Re: [Maria-developers] [GSoC] Optimize mysql-test-runs - Results of new strategy
Hello Elena and all, First, addressing the previous email: Looking at the dump, I see it can also happen that the dump contains
several records for a pair platform/bbnum. I am not sure why it happens, I think it shouldn't, might be a bug in buildbot and/or configuration, or environmental problems. Anyway, due to the way we store output files, they can well override each other in this case, thus for several platform/bbnum record you will have only one file. I suppose that's what was hard to resolve, sorry about that.
No worries ; ). There are several cases where platform and build number are the same. The system just names the files as follows: <platform>_<build_id>-log-test_1-stdio <platform>_<build_id>-log-test_2-stdio ..... <platform>_<build_id>-log-test_5-stdio These files seem to correspond temporarily with the test runs (*test_1-stdio) belongs to the first test_run of the same plt-bnum, and so on. Unfortunately, there are some cases where there are more test_runs on the dump than files available, and this means that it's impossible to be sure which file belongs to which test_run exactly.
You should consider skipped tests, at least for now. Your logic that they are skipped because they can't be run is generally correct; unfortunately, MTR first produces the *full* list of tests to run, and determines whether a test can be run or not on a later stage, when it starts running the tests. Your tool will receive the initial test list, and I'm not sure it's realistic to re-write MTR so that it takes into account limitations that cause skipping tests before creating the list.
I see. Okay then, duly noted. Possibly it's better to skip a test run altogether if there is no input
list for it; it would be definitely the best if there were 5K (or whatever slice you are currently using) of continuous test runs with input lists; if it so happens that there are lists for some branches but not others, you can skip the branch entirely.
This doesn't seem like a good option. Recall drops seriously, and the test_runs that have a corresponding file don't seem to have a special pattern, and tend to have long spaces between them, so the information becomes irrelevant, and seemingly, not useful.
The core module should take as parameters - list of tests to choose from, - size of the running set (%), - branch/platform (if we use them in the end), and produce a new list of tests of the size of the running set.
The wrapper module should - read the list of tests from the outside world (for now, from a file), - receive branch/platform as command-line options, - have the running set size set as an easily changeable constant or as a configuration parameter,
and return the list of tests -- lets say for now, in the form of <test suite>.<test name>, blank-separated, e.g. main.select innondb.create-index ...
I am almost done 'translating' the code into a solution that divides it in 'core' and 'wrapper'. There are a few bugs that I still haven't figured out, but I believe I can iron those out pretty soon. I will also incorporate the percentage rather than fixed running_set.
Now, regarding the state of the project (and the recall numbers that I am able to achieve so far), here are some observations: - Unfortunately, I am running out of ideas to try to improve recall. I tried tuning some parameters, giving more weight to ones or others, etc. I still wasn't able to push recall beyond ~87% on the strategy that uses file correlations. For what I've seen, some failures are just extremely hard to predict. - The strategy that uses only a weighted average of the failure frequency achieves a higher recall, but for a shorter time. The recall decays quickly afterwards. I may try to add some file-correlations to this strategy, to see if the recall can be sustained for a longer term. - There is one problem that I see regarding the data and the potential real-world implementation of the program: By verifying the recall with the historical data, we run the possibility of 'expecting' overfitting... so the results regarding the errors found when comparing to the historical data, and the results that could have been obtained by a real-world implementation are potentially different. A possible way to address that issue would require modifying the buildbot to gather more data over a longer term. So having said that, I am looking for some advice in the following regards: - I will try to take a step back from the new strategy, and see how I can adapt the original strategy to prevent the recall function from declining so sharply with time. - I will also spend some time keeping a codebase that adjusts better to the model that we need for the implementation. I will upload code soon. All suggestions are welcome. - Nonetheless, I feel that more data would allow to improve the algorithm greatly. Is it possible to prepare logging into the buildbot that would allow for more precise data collection? A slower, more iterative process, working closer with the buildbot and doing more detailed data collection might deliver better results. (I understand that this would probably influence the time-scope of the project) Let me know what you think about my suggestions. Regards Pablo
Hi Pablo, On 21.07.2014 19:28, Pablo Estrada wrote:
So having said that, I am looking for some advice in the following regards:
- I will also spend some time keeping a codebase that adjusts better to the model that we need for the implementation. I will upload code soon. All suggestions are welcome.
It's hard to make suggestions without seeing what you currently have, please let me know when you have pushed the code.
- Nonetheless, I feel that more data would allow to improve the algorithm greatly. Is it possible to prepare logging into the buildbot that would allow for more precise data collection? A slower, more iterative process, working closer with the buildbot and doing more detailed data collection might deliver better results. (I understand that this would probably influence the time-scope of the project)
Could you please explain what you mean by logging into buildbot (and by more precise data collection via it)? How exactly you are planning to work with buildbot interactively? In the part that concerns our task, buildbot picks up a push, gets it compiled and runs MTR with certain predefined parameters. There isn't really much room for interaction. Possibly I totally misunderstand your question, so please elaborate on it. What I can do (it also concerns your previous comment about the non-continuous data) is upload a fresh data dump for you; hopefully it will have [almost] all matching logs, so you'll get a consistent chunk of test runs to experiment with. Regards, Elena
Let me know what you think about my suggestions. Regards Pablo
Hi Elena, It's hard to make suggestions without seeing what you currently have,
please let me know when you have pushed the code.
I just finished cleaning up the code with the new implementation, but in any case, the strategy is exactly the same. I have been looking for advice with the strategy. In any case, I just uploaded the new code: https://github.com/pabloem/Kokiri/tree/core-wrapper_architecture But the strategy of using file correlation is still the same. Could you please explain what you mean by logging into buildbot (and by
more precise data collection via it)? How exactly you are planning to work with buildbot interactively? In the part that concerns our task, buildbot picks up a push, gets it compiled and runs MTR with certain predefined parameters. There isn't really much room for interaction. Possibly I totally misunderstand your question, so please elaborate on it.
What I can do (it also concerns your previous comment about the non-continuous data) is upload a fresh data dump for you; hopefully it will have [almost] all matching logs, so you'll get a consistent chunk of test runs to experiment with.
I mean adding some code that does logging of extra information such as which tests were run on each test_run. This would be the main thing. I understand that the logfiles that you sent me contain this information, but storing them is not scalable, and even with a fresh dump, I'm not sure there would be a continuous set of data. I made a small script that analyzes the matches of the files with the dump from the database, and their matching is quite random. Towards the end there is more matching, but it still is quite random, and it doesn't seem to have consistent matching for too long: https://raw.githubusercontent.com/pabloem/random/master/matches.txt If you observe, close to the end, there is already a continuous set of 20 skipped test runs:
148484: - kvm-bintar-centos5-x86_1066-log-test-stdio Skip 20 148485: - winx64-packages_3203-log-test-stdio
So, what I had suggested was to log more data about each test run e.g. mainly, which tests ran, but as much information as possible. For now, yes, if you'd be so kind, please upload a fresh dump of the database : ) Regards Pablo
Hi Pablo, On 23.07.2014 15:51, Pablo Estrada wrote:
Hi Elena,
It's hard to make suggestions without seeing what you currently have,
please let me know when you have pushed the code.
I just finished cleaning up the code with the new implementation, but in any case, the strategy is exactly the same. I have been looking for advice with the strategy.
In any case, I just uploaded the new code: https://github.com/pabloem/Kokiri/tree/core-wrapper_architecture But the strategy of using file correlation is still the same.
Thanks. I hoped you would have results of the experiments involving incoming lists of tests, as I think it's an important factor which might affect the results (and hence the strategy); but I'll look at what we have now.
Could you please explain what you mean by logging into buildbot (and by
more precise data collection via it)? How exactly you are planning to work with buildbot interactively? In the part that concerns our task, buildbot picks up a push, gets it compiled and runs MTR with certain predefined parameters. There isn't really much room for interaction. Possibly I totally misunderstand your question, so please elaborate on it.
What I can do (it also concerns your previous comment about the non-continuous data) is upload a fresh data dump for you; hopefully it will have [almost] all matching logs, so you'll get a consistent chunk of test runs to experiment with.
I mean adding some code that does logging of extra information such as which tests were run on each test_run. This would be the main thing. I understand that the logfiles that you sent me contain this information, but storing them is not scalable, and even with a fresh dump, I'm not sure there would be a continuous set of data. I made a small script that analyzes the matches of the files with the dump from the database, and their matching is quite random.
I will see what we can do about getting reliable lists one or another way; certainly the log files are a temporary solution, but it would be nice to use them for experiments and see the results anyway, because modifying MTR/buildbot tandem and especially collecting the new data of considerable volume will take time.
Towards the end there is more matching, but it still is quite random, and it doesn't seem to have consistent matching for too long:
https://raw.githubusercontent.com/pabloem/random/master/matches.txt
If you observe, close to the end, there is already a continuous set of 20 skipped test runs:
148484: - kvm-bintar-centos5-x86_1066-log-test-stdio Skip 20 148485: - winx64-packages_3203-log-test-stdio
If I interpret your list correctly, you mean that logs for test runs with id between 148464 and 148483 (included) are missing. It's a bit strange. I see logs for the following runs: 148466 - winx64-packages_3170-log-test-stdio 148467 - win32-packages_3172-log-test-stdio 148470 - win-rqg-se_309-log-test-stdio 148471 - kvm-deb-lucid-x86_3313-log-test_4-stdio 148472 - win32-packages_3173-log-test-stdio 148473 - kvm-deb-debian6-amd64_2705-log-test_4-stdio 148474 - winx64-packages_3171-log-test-stdio 148476 - win-rqg-se_310-log-test-stdio 148778 - kvm-deb-debian6-x86_2850-log-test_4-stdio 148481 - win-rqg-se_311-log-test-stdio 148482 - kvm-bintar-centos5-amd64_359-log-test-stdio 148483 - kvm-deb-precise-amd64_2709-log-test_4-stdio This is not to say that parsing logs is the best way to do things, but apparently something went wrong either with my archiving or with your matching. If you don't have these files, please let me know. Now, regarding the misses. 148464, 148475, 148480 are bld-dan-release. For this builder we indeed don't seem to have logs; and the tests are not reliable there, so it should be all right to ignore failures from it. 148465 - that's a miss, something went wrong while storing logs. 148468, 148469, 148477, 148479 - these are real misses, we don't have these logs Most of them should not happen for newer tests. For example, logs for labrador only start from June, while our database dump was from April.
So, what I had suggested was to log more data about each test run e.g. mainly, which tests ran, but as much information as possible.
For now, yes, if you'd be so kind, please upload a fresh dump of the database : )
I've uploaded the fresh dump. Same location, file name buildbot-20140722.dump.gz. Regards, Elena
Regards Pablo
Hi Elena, Thanks. I hoped you would have results of the experiments involving
incoming lists of tests, as I think it's an important factor which might affect the results (and hence the strategy); but I'll look at what we have now.
I have them now. There was one more bug I hadn't figured out. There are still a couple bugs related to matching of input test list, but these results must be quite close to the expected ones. I did them with 3000 rounds of training, and about 1500 rounds of prediction (skipping all runs without input list). Although the results are not as originally expected (20-80 ratio, I feel that they are quite acceptable. I will see what we can do about getting reliable lists one or another way;
certainly the log files are a temporary solution, but it would be nice to use them for experiments and see the results anyway, because modifying MTR/buildbot tandem and especially collecting the new data of considerable volume will take time.
I understand, nonetheless I feel that this is a reasonable long-term goal for this project.
This is not to say that parsing logs is the best way to do things, but apparently something went wrong either with my archiving or with your matching. If you don't have these files, please let me know.
It seems there's a bug with matching. I am looking at it now. I've uploaded the fresh dump. Same location, file name
buildbot-20140722.dump.gz.
I will run more detailed tests with the new fresh dump. I will focus on a running set size of 30%. I believe they will be reasonable.
Thanks. Pablo
Hi Elena, I tracked down the issue with matching files and test_runs. It was simpler than we thought. 1. I was using the index in the array, rather than the test_run.id field to identify test runs. Sorry, that was my bad. I changed and reuploaded the list:
https://raw.githubusercontent.com/pabloem/random/master/matches.txt
This accounted for cases: 148470, 148471, 148472, 148473, 148474, 148476, 148478, 148481, 148482, 148483. 2. The other 'false misses' happened because there are earlier test_runs that match the files:
148467 - win32-packages_3172-log-test-stdio
It happens that test_run 100940,101104, has the same platform and build id, so the file is matched with it earlier. By the way, I just ran some tests with running_set size 30% and the results were quite consistent around 80%, even for long runs. Over time it decreases, albeit slowly. I still feel that a lot more consistent performance can be obtained with consistent input lists. Regards Pablo On Thu, Jul 24, 2014 at 5:00 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena,
Thanks. I hoped you would have results of the experiments involving
incoming lists of tests, as I think it's an important factor which might affect the results (and hence the strategy); but I'll look at what we have now.
I have them now. There was one more bug I hadn't figured out. There are still a couple bugs related to matching of input test list, but these results must be quite close to the expected ones. I did them with 3000 rounds of training, and about 1500 rounds of prediction (skipping all runs without input list).
Although the results are not as originally expected (20-80 ratio, I feel that they are quite acceptable.
I will see what we can do about getting reliable lists one or another way;
certainly the log files are a temporary solution, but it would be nice to use them for experiments and see the results anyway, because modifying MTR/buildbot tandem and especially collecting the new data of considerable volume will take time.
I understand, nonetheless I feel that this is a reasonable long-term goal for this project.
This is not to say that parsing logs is the best way to do things, but apparently something went wrong either with my archiving or with your matching. If you don't have these files, please let me know.
It seems there's a bug with matching. I am looking at it now.
I've uploaded the fresh dump. Same location, file name
buildbot-20140722.dump.gz.
I will run more detailed tests with the new fresh dump. I will focus on a running set size of 30%. I believe they will be reasonable.
Thanks. Pablo
Hi Pablo, Okay, thanks for the update. As I understand, the last two graphs were for the new strategy taking into account all edited files, no branch/platform, no time factor? If it's not quite so, could you please indicate which exact options/metrics did you use? Also, if it's not too long and if it's possible with your current code, can you run the old strategy on the same exact data, learning/running set, and input files, so that we could clearly see the difference? Meanwhile, I will look at what we have and maybe will come up with some ideas for improving the results. I suppose your new tree does not include the input lists? Are you using the raw log files, or have you pre-processed them and made clean lists? If you are using the raw files, did you rename them? Regards, Elena On 24.07.2014 14:51, Pablo Estrada wrote:
Hi Elena, I tracked down the issue with matching files and test_runs. It was simpler than we thought. 1. I was using the index in the array, rather than the test_run.id field to identify test runs. Sorry, that was my bad. I changed and reuploaded the list:
https://raw.githubusercontent.com/pabloem/random/master/matches.txt
This accounted for cases: 148470, 148471, 148472, 148473, 148474, 148476, 148478, 148481, 148482, 148483.
2. The other 'false misses' happened because there are earlier test_runs that match the files:
148467 - win32-packages_3172-log-test-stdio
It happens that test_run 100940,101104, has the same platform and build id, so the file is matched with it earlier.
By the way, I just ran some tests with running_set size 30% and the results were quite consistent around 80%, even for long runs. Over time it decreases, albeit slowly. I still feel that a lot more consistent performance can be obtained with consistent input lists.
Regards Pablo
On Thu, Jul 24, 2014 at 5:00 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena,
Thanks. I hoped you would have results of the experiments involving
incoming lists of tests, as I think it's an important factor which might affect the results (and hence the strategy); but I'll look at what we have now.
I have them now. There was one more bug I hadn't figured out. There are still a couple bugs related to matching of input test list, but these results must be quite close to the expected ones. I did them with 3000 rounds of training, and about 1500 rounds of prediction (skipping all runs without input list).
Although the results are not as originally expected (20-80 ratio, I feel that they are quite acceptable.
I will see what we can do about getting reliable lists one or another way;
certainly the log files are a temporary solution, but it would be nice to use them for experiments and see the results anyway, because modifying MTR/buildbot tandem and especially collecting the new data of considerable volume will take time.
I understand, nonetheless I feel that this is a reasonable long-term goal for this project.
This is not to say that parsing logs is the best way to do things, but apparently something went wrong either with my archiving or with your matching. If you don't have these files, please let me know.
It seems there's a bug with matching. I am looking at it now.
I've uploaded the fresh dump. Same location, file name
buildbot-20140722.dump.gz.
I will run more detailed tests with the new fresh dump. I will focus on a running set size of 30%. I believe they will be reasonable.
Thanks. Pablo
Hi Elena, On Thu, Jul 24, 2014 at 8:06 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Okay, thanks for the update.
As I understand, the last two graphs were for the new strategy taking into account all edited files, no branch/platform, no time factor?
- Yes, new strategy. Using 'co-occurrence' of code file edits and failures. Also a weighted average of failures. - No time factor. - No branch/platform scores are kept. The data for the tests is the same, no matter platform. - But when calculating relevance, we use the failures occurred in the last run as parameter. The last run does depend of branch and platform.
Also, if it's not too long and if it's possible with your current code, can you run the old strategy on the same exact data, learning/running set, and input files, so that we could clearly see the difference?
I have not incorporated the logic for input file list for the old strategy, but I will work on it, and it should be ready by tomorrow, hopefully.
I suppose your new tree does not include the input lists? Are you using the raw log files, or have you pre-processed them and made clean lists? If you are using the raw files, did you rename them?
It does not include them. I am using the raw files. I included a tiny shell (downlaod_files.sh) that you can execute to download and decompress the files in the directory where the program will look by default. Also, I forgot to change it when uploading, but in basic_testcase.py, you would need to erase the file_dir parameter passed to s.wrapper(), so that the program defaults in looking for the files. Regards Pablo
Hi Elena, I just ran the tests comparing both strategies. To my surprise, according to the tests, the results from the 'original' strategy are a lot higher that the 'new' strategy. The difference in results might come from one of many possibilities, but I feel it's the following: Using the lists of run tests allows the relevance of a test to decrease only if it is considered to run and it runs. That way, tests with high relevance that would run, but were not in the list, don't run and thus are able to be hit their failures later on, rather than losing relevance. I will have charts in a few hours, and I will review the code more deeply, to make sure that the results are accurate. For now I can inform you that for a 50% size of the running set, the 'original' strategy, with no randomization, time factor or edit factor achieved a recall of 0.90 in the tests that I ran. Regards Pablo On Thu, Jul 24, 2014 at 8:18 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena,
On Thu, Jul 24, 2014 at 8:06 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Okay, thanks for the update.
As I understand, the last two graphs were for the new strategy taking into account all edited files, no branch/platform, no time factor?
- Yes, new strategy. Using 'co-occurrence' of code file edits and failures. Also a weighted average of failures. - No time factor. - No branch/platform scores are kept. The data for the tests is the same, no matter platform. - But when calculating relevance, we use the failures occurred in the last run as parameter. The last run does depend of branch and platform.
Also, if it's not too long and if it's possible with your current code, can you run the old strategy on the same exact data, learning/running set, and input files, so that we could clearly see the difference?
I have not incorporated the logic for input file list for the old strategy, but I will work on it, and it should be ready by tomorrow, hopefully.
I suppose your new tree does not include the input lists? Are you using the raw log files, or have you pre-processed them and made clean lists? If you are using the raw files, did you rename them?
It does not include them.
I am using the raw files. I included a tiny shell (downlaod_files.sh) that you can execute to download and decompress the files in the directory where the program will look by default. Also, I forgot to change it when uploading, but in basic_testcase.py, you would need to erase the file_dir parameter passed to s.wrapper(), so that the program defaults in looking for the files.
Regards Pablo
Hello Elena, Concluding with the results of the recent experimentation, here is the available information: I have ported the basic code for the 'original' strategy into the core-wrapper architecture, and uploaded it to the 'master' branch. Now both strategies can be tested equivalently. Branch: master <https://github.com/pabloem/Kokiri> - Original strategy, using exponential decay. The performance increased a little bit after incorporating randomizing of the end of the queue. Branch: core-wrapper_architecture <https://github.com/pabloem/Kokiri/tree/core-wrapper_architecture> - 'New' strategy using co occurrence between file changes and failures to calculate relevance. I think they are both reasonably useful strategies. My theory is that the 'original' strategy performs better with the input_test lists is that we now know which tests ran, and so only the relevance of tests which ran is affected (whereas previously, all tests were having their relevance reduced). The tests were run with *3000 rounds of training* and *7000 rounds of prediction*. I think that now the most reasonable option would be to gather data for a longer period, just to be sure that the performance of the 'original' strategy holds for the long term. We already discussed that it would be desirable that buildbot incorporated functionality to keep track of which tests were run, or considered to run (since buildbot already parses the output of MTR, the changes should be quite quick, but I understand that being a production system, extreme care must be had in the changes and the design). Finally, I fixed the chart comparing the results, sorry about the confusion yesterday. Let me know what you think, and how you'd like to proceed now. : ) Regards Pablo On Sat, Jul 26, 2014 at 8:26 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena, I just ran the tests comparing both strategies. To my surprise, according to the tests, the results from the 'original' strategy are a lot higher that the 'new' strategy. The difference in results might come from one of many possibilities, but I feel it's the following:
Using the lists of run tests allows the relevance of a test to decrease only if it is considered to run and it runs. That way, tests with high relevance that would run, but were not in the list, don't run and thus are able to be hit their failures later on, rather than losing relevance.
I will have charts in a few hours, and I will review the code more deeply, to make sure that the results are accurate. For now I can inform you that for a 50% size of the running set, the 'original' strategy, with no randomization, time factor or edit factor achieved a recall of 0.90 in the tests that I ran.
Regards Pablo
On Thu, Jul 24, 2014 at 8:18 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena,
On Thu, Jul 24, 2014 at 8:06 PM, Elena Stepanova <elenst@montyprogram.com
wrote:
Hi Pablo,
Okay, thanks for the update.
As I understand, the last two graphs were for the new strategy taking into account all edited files, no branch/platform, no time factor?
- Yes, new strategy. Using 'co-occurrence' of code file edits and failures. Also a weighted average of failures. - No time factor. - No branch/platform scores are kept. The data for the tests is the same, no matter platform. - But when calculating relevance, we use the failures occurred in the last run as parameter. The last run does depend of branch and platform.
Also, if it's not too long and if it's possible with your current code, can you run the old strategy on the same exact data, learning/running set, and input files, so that we could clearly see the difference?
I have not incorporated the logic for input file list for the old strategy, but I will work on it, and it should be ready by tomorrow, hopefully.
I suppose your new tree does not include the input lists? Are you using the raw log files, or have you pre-processed them and made clean lists? If you are using the raw files, did you rename them?
It does not include them.
I am using the raw files. I included a tiny shell (downlaod_files.sh) that you can execute to download and decompress the files in the directory where the program will look by default. Also, I forgot to change it when uploading, but in basic_testcase.py, you would need to erase the file_dir parameter passed to s.wrapper(), so that the program defaults in looking for the files.
Regards Pablo
Hi Pablo, Thanks for the update, I'm looking into it. There is one more important factor to choose which strategy to put the further effort on. Do they perform similarly time-wise? I mean, you now ran the same sets of tests on both strategies. Did it take approximately the same time? And in case you measured it, what about 3000 + 1 rounds, which is closer to the real-life test case? And what absolute time does one round take? I realize it depends on the machine and other things, but roughly -- is it seconds, or minutes, or tens of minutes? We should constantly watch it, because the whole point is to reduce test execution time; but the test execution time will include using the tool, so if it turns out that it takes as much time as we later save on tests, doing it makes little sense. Regards, Elena On 27.07.2014 11:51, Pablo Estrada wrote:
Hello Elena, Concluding with the results of the recent experimentation, here is the available information: I have ported the basic code for the 'original' strategy into the core-wrapper architecture, and uploaded it to the 'master' branch. Now both strategies can be tested equivalently. Branch: master <https://github.com/pabloem/Kokiri> - Original strategy, using exponential decay. The performance increased a little bit after incorporating randomizing of the end of the queue. Branch: core-wrapper_architecture <https://github.com/pabloem/Kokiri/tree/core-wrapper_architecture> - 'New' strategy using co occurrence between file changes and failures to calculate relevance.
I think they are both reasonably useful strategies. My theory is that the 'original' strategy performs better with the input_test lists is that we now know which tests ran, and so only the relevance of tests which ran is affected (whereas previously, all tests were having their relevance reduced). The tests were run with *3000 rounds of training* and *7000 rounds of prediction*.
I think that now the most reasonable option would be to gather data for a longer period, just to be sure that the performance of the 'original' strategy holds for the long term. We already discussed that it would be desirable that buildbot incorporated functionality to keep track of which tests were run, or considered to run (since buildbot already parses the output of MTR, the changes should be quite quick, but I understand that being a production system, extreme care must be had in the changes and the design).
Finally, I fixed the chart comparing the results, sorry about the confusion yesterday.
Let me know what you think, and how you'd like to proceed now. : ) Regards
Pablo
On Sat, Jul 26, 2014 at 8:26 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena, I just ran the tests comparing both strategies. To my surprise, according to the tests, the results from the 'original' strategy are a lot higher that the 'new' strategy. The difference in results might come from one of many possibilities, but I feel it's the following:
Using the lists of run tests allows the relevance of a test to decrease only if it is considered to run and it runs. That way, tests with high relevance that would run, but were not in the list, don't run and thus are able to be hit their failures later on, rather than losing relevance.
I will have charts in a few hours, and I will review the code more deeply, to make sure that the results are accurate. For now I can inform you that for a 50% size of the running set, the 'original' strategy, with no randomization, time factor or edit factor achieved a recall of 0.90 in the tests that I ran.
Regards Pablo
On Thu, Jul 24, 2014 at 8:18 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena,
On Thu, Jul 24, 2014 at 8:06 PM, Elena Stepanova <elenst@montyprogram.com
wrote:
Hi Pablo,
Okay, thanks for the update.
As I understand, the last two graphs were for the new strategy taking into account all edited files, no branch/platform, no time factor?
- Yes, new strategy. Using 'co-occurrence' of code file edits and failures. Also a weighted average of failures. - No time factor. - No branch/platform scores are kept. The data for the tests is the same, no matter platform. - But when calculating relevance, we use the failures occurred in the last run as parameter. The last run does depend of branch and platform.
Also, if it's not too long and if it's possible with your current code, can you run the old strategy on the same exact data, learning/running set, and input files, so that we could clearly see the difference?
I have not incorporated the logic for input file list for the old strategy, but I will work on it, and it should be ready by tomorrow, hopefully.
I suppose your new tree does not include the input lists? Are you using the raw log files, or have you pre-processed them and made clean lists? If you are using the raw files, did you rename them?
It does not include them.
I am using the raw files. I included a tiny shell (downlaod_files.sh) that you can execute to download and decompress the files in the directory where the program will look by default. Also, I forgot to change it when uploading, but in basic_testcase.py, you would need to erase the file_dir parameter passed to s.wrapper(), so that the program defaults in looking for the files.
Regards Pablo
Hello Elena, I am very sorry about that. The trees were left a bit messy with the changes. I have pushed fixes for that just now. The file where you can start now is basic_testcase.py. Before starting you should decompress csv/direct_file_changes.tar.gz into csv/direct_file_changes.csv, and update the directory that contains the input_test_lists in basic_testcase.py. Regarding your previous email, the way the project works now is as follows: 1. Learning cycle. Populate information about tests. 2. Make predictions 3. Update results - in memory 4. Repeat from step 2 In this way, the project takes several minutes running 7000 rounds. The 'standard' strategy takes about 20 minutes, and the 'new' one takes about 25. When I think about the project I expect it to work different than this. In real life (in buildbot), I believe the project would work by storing the *test_info* data structure into a file, or into the database, and loading it into memory every test_run, as follows: 1. Load data into memory (from database, or a file) 2. Make predictions 3. Update results - in memory 4. Save data (to database or a file) Steps 2 and 3 are the same in both cases. It takes from 0.05 to 0.35 seconds to do each round of prediction and update of results (depending on the length of the input list, and number of modified files for the 'new' strategy). If we make it work like this, then we just need to add up the time it would take to load up the data structure (and file_changes for the 'new' strategy). This should amount to less than a couple of seconds. I can gather more detailed data regarding time if necessary. Let me know. Regards Pablo On Sun, Jul 27, 2014 at 6:16 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update, I'm looking into it.
There is one more important factor to choose which strategy to put the further effort on. Do they perform similarly time-wise?
I mean, you now ran the same sets of tests on both strategies. Did it take approximately the same time? And in case you measured it, what about 3000 + 1 rounds, which is closer to the real-life test case?
And what absolute time does one round take? I realize it depends on the machine and other things, but roughly -- is it seconds, or minutes, or tens of minutes?
We should constantly watch it, because the whole point is to reduce test execution time; but the test execution time will include using the tool, so if it turns out that it takes as much time as we later save on tests, doing it makes little sense.
Regards, Elena
On 27.07.2014 11:51, Pablo Estrada wrote:
Hello Elena, Concluding with the results of the recent experimentation, here is the available information: I have ported the basic code for the 'original' strategy into the core-wrapper architecture, and uploaded it to the 'master' branch. Now both strategies can be tested equivalently. Branch: master <https://github.com/pabloem/Kokiri> - Original strategy,
using exponential decay. The performance increased a little bit after incorporating randomizing of the end of the queue. Branch: core-wrapper_architecture <https://github.com/pabloem/Kokiri/tree/core-wrapper_architecture> - 'New'
strategy using co occurrence between file changes and failures to calculate relevance.
I think they are both reasonably useful strategies. My theory is that the 'original' strategy performs better with the input_test lists is that we now know which tests ran, and so only the relevance of tests which ran is affected (whereas previously, all tests were having their relevance reduced). The tests were run with *3000 rounds of training* and *7000 rounds of prediction*.
I think that now the most reasonable option would be to gather data for a longer period, just to be sure that the performance of the 'original' strategy holds for the long term. We already discussed that it would be desirable that buildbot incorporated functionality to keep track of which tests were run, or considered to run (since buildbot already parses the output of MTR, the changes should be quite quick, but I understand that being a production system, extreme care must be had in the changes and the design).
Finally, I fixed the chart comparing the results, sorry about the confusion yesterday.
Let me know what you think, and how you'd like to proceed now. : ) Regards
Pablo
On Sat, Jul 26, 2014 at 8:26 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena,
I just ran the tests comparing both strategies. To my surprise, according to the tests, the results from the 'original' strategy are a lot higher that the 'new' strategy. The difference in results might come from one of many possibilities, but I feel it's the following:
Using the lists of run tests allows the relevance of a test to decrease only if it is considered to run and it runs. That way, tests with high relevance that would run, but were not in the list, don't run and thus are able to be hit their failures later on, rather than losing relevance.
I will have charts in a few hours, and I will review the code more deeply, to make sure that the results are accurate. For now I can inform you that for a 50% size of the running set, the 'original' strategy, with no randomization, time factor or edit factor achieved a recall of 0.90 in the tests that I ran.
Regards Pablo
On Thu, Jul 24, 2014 at 8:18 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena,
On Thu, Jul 24, 2014 at 8:06 PM, Elena Stepanova < elenst@montyprogram.com
wrote:
Hi Pablo,
Okay, thanks for the update.
As I understand, the last two graphs were for the new strategy taking into account all edited files, no branch/platform, no time factor?
- Yes, new strategy. Using 'co-occurrence' of code file edits and failures. Also a weighted average of failures. - No time factor. - No branch/platform scores are kept. The data for the tests is the same, no matter platform. - But when calculating relevance, we use the failures occurred in the last run as parameter. The last run does depend of branch and platform.
Also, if it's not too long and if it's possible with your current code,
can you run the old strategy on the same exact data, learning/running set, and input files, so that we could clearly see the difference?
I have not incorporated the logic for input file list for the old strategy, but I will work on it, and it should be ready by tomorrow, hopefully.
I suppose your new tree does not include the input lists? Are you using
the raw log files, or have you pre-processed them and made clean lists? If you are using the raw files, did you rename them?
It does not include them.
I am using the raw files. I included a tiny shell (downlaod_files.sh) that you can execute to download and decompress the files in the directory where the program will look by default. Also, I forgot to change it when uploading, but in basic_testcase.py, you would need to erase the file_dir parameter passed to s.wrapper(), so that the program defaults in looking for the files.
Regards Pablo
Hi Pablo, I've been looking into the current code, experimenting with it a bit, and also thinking how we can incorporate it into our process better. While I'm not done with it yet, I have some thoughts and concerns, so I will share them and you could consider them meanwhile. First, about integration. I agree it makes much more sense to store metrics rather than calculate them each time from scratch. As an additional bonus, it becomes unimportant whether we make buildbot store lists of all executed tests as we initially discussed. If the tool stores metrics, it only needs to receive the list of tests at runtime, which seems much less invasive. That is, the flow should be _roughly_ like this: - buildbot starts MTR; - MTR collects tests to be run; - MTR calls the predicting part of the tool (Pythia), passes the list of tests and maybe some other information to it; - Pythia reads existing metrics, creates the list of tests to run and returns it back to MTR; - MTR attempts to run tests; - MTR calls the calculating part of the tool (Statistician), passes the list of executed tests and the list of failures to it; - Statistician updates the metrics and writes them to the database So, instead of modifying buildbot to write full lists of tests somewhere, what we need now is the Statistician which could work in the learning mode. We create the table(s) for it -- either empty or populated with whatever we can calculate now -- and let it start learning. Later when Pythia is ready, and Statistician collects certain amount of data, prediction can start. Another bright side of storing the metrics is that in this case we don't really care about finding the minimal size of the learning set, since it won't affect performance anyhow. Now, about the actual algorithm. I chose the original strategy to look at, since it appeared to be better in your experiments, and I doubt we'll have time to tune the new one to that level. One thing that I want to see there is fully developed platform mode. I see that mode option is still there, so it should not be difficult. I actually did it myself while experimenting, but since I only made hasty and crude changes, I don't expect them to be useful. The point is, we've never actually evaluated it carefully enough. What you did before in a non-standard mode was only calculate *metrics* per the mode definer ("label", in your new design). But you still used the global counts for learning set, max count and such. So, an average learning set for a platform would be not 3000, but ~150, etc. It would have been reasonable if we really had to go through all test runs every time, but it's not so. In fact, if you are doing prediction for lets say 'labrador', you will only look at the data related to this builder (platform), and take 3000 runs from there only. After some thinking, I don't care much about branch and mixed mode anymore. Tuning per branch seems a good idea in theory, but it only makes sense with granulation per major version, which would be too complicated. Otherwise, it's counter-productive to differentiate lets say 5.5 tree and 5.5-<developer name> tree, it will only cause loss of information. Earlier it had some value since we attempted to run totally irrelevant tests on a branch that couldn't possibly run them (e.g. tests for an engine which isn't even there). But after switching to the lists of tests to run, this problem is gone. But the platform approach has a potential. If applied properly, it should provide both precision and diversity, both of which should help to achieve better results. So, it would be great to have it in the code and in the schema ready for use, so we can re-evaluate it closer to the end. Secondly, I think we've missed some important factors in recall calculation. It is now both overly optimistic and overly pessimistic, and I'm not sure they outweigh each other. Over-optimistic part: When you calculate metrics, you use all test runs up to the current one. In reality, it's impossible. What really happens is: - a new push arrived, revision N; - test run 1: builder1 started tests for revision N; - test run 2: builder2 started tests for revision N; - test run 3: builder3 started tests for revision N; - test X failed in test run 1; - test X failed in test run 2; - test X failed in test run 3; - test run 1 finised; - test run 2 finised; - test run 3 finised; It doesn't matter in which order they fail/finish; the problem is, when builder2 starts, it doesn't have information about builder1 results, and builder3 doesn't know anything about the first two. So, the metric for test X could not be increased yet. But in your current calculation, it is. So, naturally, if we happen to catch the failure on builder1, the metric raises dramatically, and the failure will be definitely caught on builders 2 and 3. It is especially important now, when you use incoming lists, and the running sets might be not identical for builders 1-3 even in standard mode. It might also be another reason why the platform mode was losing: when you calculate metrics per platform, test runs are serialized by default. It shouldn't matter if we decide to start our real-life learning cycle from scratch, because it will necessarily use only available data. But you should consider that currently it affects your measurements. Over-pessimistic part: It is similar to the previous one, but look at the same problem from a different angle. Suppose the push broke test X, and the test started failing on all builders (platforms). So, you have 20 failures, one per test run, for the same push. Now, suppose you caught it on one platform but not on others. Your statistics will still show 19 failures missed vs 1 failure caught, and recall will be dreadful (~0.05). But in fact, the goal is achieved: the failure has been caught for this push. It doesn't really matter whether you catch it 1 time or 20 times. So, recall here should be 1. It should mainly affect per-platform approach, but probably the standard one can also suffer if running sets are not identical for all builders. Finally, a couple of small details. I wonder if it's because of different versions or anything, but this didn't work for me: exp = re.compile('([^, ]+) ?([^ ]*)? *.*\[ (fail|disabled|pass|skipped) \]') It would give me an error. I had to modify it this way: exp = re.compile('([^, ]+) ?([^ ]*) *.*\[ (fail|disabled|pass|skipped) \]') From what I see, it should be the same. If you agree, please make the same change (or somehow else get rid of the error). Also, it appears that csv/test_fail_history.csv is the old file. I replaced it with csv/fails_ptest_run.csv in the code. It doesn't matter for the final version, but might be important for experiments. Finally, I checked some of discrepancies in test lists that the tool reports. They are of different nature, but I don't think it's worth spending time on it. I would just skip all files that cause any kind of confusion for the script. It brings their number down to ~28,000 which should still be enough. That's all for the moment. As I said, I'm still looking. Regards, Elena On 27.07.2014 20:17, Pablo Estrada wrote:
Hello Elena, I am very sorry about that. The trees were left a bit messy with the changes. I have pushed fixes for that just now. The file where you can start now is basic_testcase.py. Before starting you should decompress csv/direct_file_changes.tar.gz into csv/direct_file_changes.csv, and update the directory that contains the input_test_lists in basic_testcase.py.
Regarding your previous email, the way the project works now is as follows:
1. Learning cycle. Populate information about tests. 2. Make predictions 3. Update results - in memory 4. Repeat from step 2
In this way, the project takes several minutes running 7000 rounds. The 'standard' strategy takes about 20 minutes, and the 'new' one takes about 25.
When I think about the project I expect it to work different than this. In real life (in buildbot), I believe the project would work by storing the *test_info* data structure into a file, or into the database, and loading it into memory every test_run, as follows:
1. Load data into memory (from database, or a file) 2. Make predictions 3. Update results - in memory 4. Save data (to database or a file)
Steps 2 and 3 are the same in both cases. It takes from 0.05 to 0.35 seconds to do each round of prediction and update of results (depending on the length of the input list, and number of modified files for the 'new' strategy). If we make it work like this, then we just need to add up the time it would take to load up the data structure (and file_changes for the 'new' strategy). This should amount to less than a couple of seconds.
I can gather more detailed data regarding time if necessary. Let me know.
Regards Pablo
On Sun, Jul 27, 2014 at 6:16 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update, I'm looking into it.
There is one more important factor to choose which strategy to put the further effort on. Do they perform similarly time-wise?
I mean, you now ran the same sets of tests on both strategies. Did it take approximately the same time? And in case you measured it, what about 3000 + 1 rounds, which is closer to the real-life test case?
And what absolute time does one round take? I realize it depends on the machine and other things, but roughly -- is it seconds, or minutes, or tens of minutes?
We should constantly watch it, because the whole point is to reduce test execution time; but the test execution time will include using the tool, so if it turns out that it takes as much time as we later save on tests, doing it makes little sense.
Regards, Elena
On 27.07.2014 11:51, Pablo Estrada wrote:
Hello Elena, Concluding with the results of the recent experimentation, here is the available information: I have ported the basic code for the 'original' strategy into the core-wrapper architecture, and uploaded it to the 'master' branch. Now both strategies can be tested equivalently. Branch: master <https://github.com/pabloem/Kokiri> - Original strategy,
using exponential decay. The performance increased a little bit after incorporating randomizing of the end of the queue. Branch: core-wrapper_architecture <https://github.com/pabloem/Kokiri/tree/core-wrapper_architecture> - 'New'
strategy using co occurrence between file changes and failures to calculate relevance.
I think they are both reasonably useful strategies. My theory is that the 'original' strategy performs better with the input_test lists is that we now know which tests ran, and so only the relevance of tests which ran is affected (whereas previously, all tests were having their relevance reduced). The tests were run with *3000 rounds of training* and *7000 rounds of prediction*.
I think that now the most reasonable option would be to gather data for a longer period, just to be sure that the performance of the 'original' strategy holds for the long term. We already discussed that it would be desirable that buildbot incorporated functionality to keep track of which tests were run, or considered to run (since buildbot already parses the output of MTR, the changes should be quite quick, but I understand that being a production system, extreme care must be had in the changes and the design).
Finally, I fixed the chart comparing the results, sorry about the confusion yesterday.
Let me know what you think, and how you'd like to proceed now. : ) Regards
Pablo
On Sat, Jul 26, 2014 at 8:26 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena,
I just ran the tests comparing both strategies. To my surprise, according to the tests, the results from the 'original' strategy are a lot higher that the 'new' strategy. The difference in results might come from one of many possibilities, but I feel it's the following:
Using the lists of run tests allows the relevance of a test to decrease only if it is considered to run and it runs. That way, tests with high relevance that would run, but were not in the list, don't run and thus are able to be hit their failures later on, rather than losing relevance.
I will have charts in a few hours, and I will review the code more deeply, to make sure that the results are accurate. For now I can inform you that for a 50% size of the running set, the 'original' strategy, with no randomization, time factor or edit factor achieved a recall of 0.90 in the tests that I ran.
Regards Pablo
On Thu, Jul 24, 2014 at 8:18 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena,
On Thu, Jul 24, 2014 at 8:06 PM, Elena Stepanova < elenst@montyprogram.com
wrote:
Hi Pablo,
Okay, thanks for the update.
As I understand, the last two graphs were for the new strategy taking into account all edited files, no branch/platform, no time factor?
- Yes, new strategy. Using 'co-occurrence' of code file edits and failures. Also a weighted average of failures. - No time factor. - No branch/platform scores are kept. The data for the tests is the same, no matter platform. - But when calculating relevance, we use the failures occurred in the last run as parameter. The last run does depend of branch and platform.
Also, if it's not too long and if it's possible with your current code,
can you run the old strategy on the same exact data, learning/running set, and input files, so that we could clearly see the difference?
I have not incorporated the logic for input file list for the old strategy, but I will work on it, and it should be ready by tomorrow, hopefully.
I suppose your new tree does not include the input lists? Are you using
the raw log files, or have you pre-processed them and made clean lists? If you are using the raw files, did you rename them?
It does not include them.
I am using the raw files. I included a tiny shell (downlaod_files.sh) that you can execute to download and decompress the files in the directory where the program will look by default. Also, I forgot to change it when uploading, but in basic_testcase.py, you would need to erase the file_dir parameter passed to s.wrapper(), so that the program defaults in looking for the files.
Regards Pablo
Hi Elena,
First, about integration. I agree it makes much more sense to store metrics rather than calculate them each time from scratch. As an additional bonus, it becomes unimportant whether we make buildbot store lists of all executed tests as we initially discussed. If the tool stores metrics, it only needs to receive the list of tests at runtime, which seems much less invasive.
I see your point. I still think there is a benefit to having a list of run tests: If we want to run simulations or analyze other algorithms with the data, rather than just storing the test_info dictionary, having the list of tests that ran would be more useful, as we could actually look at which tests ran, rather than just what their relevance is. Of course, if we want to commit to one algorithm, then we don't need any extra information; but if we want more flexibility, then maybe storing more information would be useful. Nonetheless, I do understand that the change may be invasive, and it would go into an important production system, so it is reasonable to want to avoid it. I just want to point out advantages and disadvantages, to not dismiss it completely.
One thing that I want to see there is fully developed platform mode. I see that mode option is still there, so it should not be difficult. I actually did it myself while experimenting, but since I only made hasty and crude changes, I don't expect them to be useful.
I'm not sure what code you are referring to. Can you be more specific on what seems to be missing? I might have missed something when migrating from the previous architecture... Of the code that's definitely not there, there are a couple things that could be added: 1. When we calculate the relevance of a test on a given platform, we might want to set the relevance to 0, or we might want to derive a default relevance from other platforms (An average, the 'standard', etc...). Currently, it's just set to 0. 2. We might also, just in case, want to keep the 'standard' queue for when we don't have the data for this platform (related to the previous point).
It doesn't matter in which order they fail/finish; the problem is, when builder2 starts, it doesn't have information about builder1 results, and builder3 doesn't know anything about the first two. So, the metric for test X could not be increased yet.
But in your current calculation, it is. So, naturally, if we happen to catch the failure on builder1, the metric raises dramatically, and the failure will be definitely caught on builders 2 and 3.
It is especially important now, when you use incoming lists, and the running sets might be not identical for builders 1-3 even in standard mode.
Right, I see your point. Although if test_run 1 would catch the error, test_run 2, although it would be using the same data. might not catch the same errors if the running set makes it such that they are pushed out due to lower relevance. The effect might not be too big, but it definitely has potential to affect the results. Over-pessimistic part:
It is similar to the previous one, but look at the same problem from a different angle. Suppose the push broke test X, and the test started failing on all builders (platforms). So, you have 20 failures, one per test run, for the same push. Now, suppose you caught it on one platform but not on others. Your statistics will still show 19 failures missed vs 1 failure caught, and recall will be dreadful (~0.05). But in fact, the goal is achieved: the failure has been caught for this push. It doesn't really matter whether you catch it 1 time or 20 times. So, recall here should be 1.
It should mainly affect per-platform approach, but probably the standard one can also suffer if running sets are not identical for all builders.
Right. It seems that solving these two issues is non-trivial (the test_run table does not contain duration of the test_run, or anything). But we can keep in mind these issues.
Finally, a couple of small details.
I wonder if it's because of different versions or anything, but this didn't work for me:
exp = re.compile('([^, ]+) ?([^ ]*)? *.*\[ (fail|disabled|pass|skipped) \]')
It would give me an error. I had to modify it this way:
exp = re.compile('([^, ]+) ?([^ ]*) *.*\[ (fail|disabled|pass|skipped) \]')
From what I see, it should be the same. If you agree, please make the same change (or somehow else get rid of the error).
I guess it's a version issue. I fixed it.
Also, it appears that csv/test_fail_history.csv is the old file. I replaced it with csv/fails_ptest_run.csv in the code. It doesn't matter for the final version, but might be important for experiments.
In the code we should be using *test_fail_history_inv.csv*. That is the updated file with ascending test_run id. I will add the instructions for creating and using these files into the readme. Regards Pablo
Hi Elena, I fixed up the repositories with updated versions of the queries, as well as instructions in the README on how to generate them. Now I am looking a bit at the buildbot code, just to try to suggest some design ideas for adding the statistician and the pythia into the MTR related classes. Regards Pablo On Sun, Aug 3, 2014 at 10:51 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena,
First, about integration. I agree it makes much more sense to store metrics rather than calculate them each time from scratch. As an additional bonus, it becomes unimportant whether we make buildbot store lists of all executed tests as we initially discussed. If the tool stores metrics, it only needs to receive the list of tests at runtime, which seems much less invasive.
I see your point. I still think there is a benefit to having a list of run tests: If we want to run simulations or analyze other algorithms with the data, rather than just storing the test_info dictionary, having the list of tests that ran would be more useful, as we could actually look at which tests ran, rather than just what their relevance is. Of course, if we want to commit to one algorithm, then we don't need any extra information; but if we want more flexibility, then maybe storing more information would be useful.
Nonetheless, I do understand that the change may be invasive, and it would go into an important production system, so it is reasonable to want to avoid it. I just want to point out advantages and disadvantages, to not dismiss it completely.
One thing that I want to see there is fully developed platform mode. I see that mode option is still there, so it should not be difficult. I actually did it myself while experimenting, but since I only made hasty and crude changes, I don't expect them to be useful.
I'm not sure what code you are referring to. Can you be more specific on what seems to be missing? I might have missed something when migrating from the previous architecture...
Of the code that's definitely not there, there are a couple things that could be added: 1. When we calculate the relevance of a test on a given platform, we might want to set the relevance to 0, or we might want to derive a default relevance from other platforms (An average, the 'standard', etc...). Currently, it's just set to 0.
2. We might also, just in case, want to keep the 'standard' queue for when we don't have the data for this platform (related to the previous point).
It doesn't matter in which order they fail/finish; the problem is, when builder2 starts, it doesn't have information about builder1 results, and builder3 doesn't know anything about the first two. So, the metric for test X could not be increased yet.
But in your current calculation, it is. So, naturally, if we happen to catch the failure on builder1, the metric raises dramatically, and the failure will be definitely caught on builders 2 and 3.
It is especially important now, when you use incoming lists, and the running sets might be not identical for builders 1-3 even in standard mode.
Right, I see your point. Although if test_run 1 would catch the error, test_run 2, although it would be using the same data. might not catch the same errors if the running set makes it such that they are pushed out due to lower relevance. The effect might not be too big, but it definitely has potential to affect the results.
Over-pessimistic part:
It is similar to the previous one, but look at the same problem from a different angle. Suppose the push broke test X, and the test started failing on all builders (platforms). So, you have 20 failures, one per test run, for the same push. Now, suppose you caught it on one platform but not on others. Your statistics will still show 19 failures missed vs 1 failure caught, and recall will be dreadful (~0.05). But in fact, the goal is achieved: the failure has been caught for this push. It doesn't really matter whether you catch it 1 time or 20 times. So, recall here should be 1.
It should mainly affect per-platform approach, but probably the standard one can also suffer if running sets are not identical for all builders.
Right. It seems that solving these two issues is non-trivial (the test_run table does not contain duration of the test_run, or anything). But we can keep in mind these issues.
Finally, a couple of small details.
I wonder if it's because of different versions or anything, but this didn't work for me:
exp = re.compile('([^, ]+) ?([^ ]*)? *.*\[ (fail|disabled|pass|skipped) \]')
It would give me an error. I had to modify it this way:
exp = re.compile('([^, ]+) ?([^ ]*) *.*\[ (fail|disabled|pass|skipped) \]')
From what I see, it should be the same. If you agree, please make the same change (or somehow else get rid of the error).
I guess it's a version issue. I fixed it.
Also, it appears that csv/test_fail_history.csv is the old file. I replaced it with csv/fails_ptest_run.csv in the code. It doesn't matter for the final version, but might be important for experiments.
In the code we should be using *test_fail_history_inv.csv*. That is the updated file with ascending test_run id. I will add the instructions for creating and using these files into the readme.
Regards Pablo
(sorry, forgot the list in my reply, resending) Hi Pablo, On 03.08.2014 17:51, Pablo Estrada wrote:
Hi Elena,
One thing that I want to see there is fully developed platform mode. I see that mode option is still there, so it should not be difficult. I actually did it myself while experimenting, but since I only made hasty and crude changes, I don't expect them to be useful.
I'm not sure what code you are referring to. Can you be more specific on what seems to be missing? I might have missed something when migrating from the previous architecture...
I was mainly referring to the learning stage. Currently, the learning stage is "global". You go through X test runs, collect data, distribute it between platform-specific queues, and from X+1 test run you start predicting based on whatever platform-specific data you have at the moment. But this is bound to cause rather sporadic quality of prediction, because it could happen that out of 3000 learning runs, 1000 belongs to platform A, while platform B only had 100, and platform C was introduced later, after your learning cycle. So, for platform B the statistical data will be very limited, and for platform C there will be none -- you will simply start randomizing tests from the very beginning (or using data from other platforms as you suggest below, which is still not quite the same as pure platform-specific approach). It seems more reasonable, if the platform-specific mode is used, to do learning per platform too. It is not just about current investigation activity, but about the real-life implementation too. Lets suppose tomorrow we start collecting the data and calculating the metrics. Some platforms will run more often than others, so lets say in 2 weeks you will have X test runs on these platforms so you can start predicting for them; while other platforms will run less frequently, and it will take 1 month to collect the same amount of data. And 2 months later there will be Ubuntu Utopic Unicorn which will have no statistical data at all, and it will be cruel to jump into predicting there right away, without any statistical data at all. It sounds more complicated than it is, in fact pretty much all you need to add to your algorithm is making 'count' in your run_simulation a dict rather than a constant. So, I imagine that when you store your metrics after a test run, you will also store a number of test runs per platform, and only start predicting for this particular platform when the count for it reaches the configured number.
Of the code that's definitely not there, there are a couple things that could be added: 1. When we calculate the relevance of a test on a given platform, we
might
want to set the relevance to 0, or we might want to derive a default relevance from other platforms (An average, the 'standard', etc...). Currently, it's just set to 0.
I think you could combine this idea with what was described above. While it makes sense to run *some* full learning cycles on a new platform, it does not have to be thousands, especially since some non-LTS platforms come and go awfully fast. So, we run these no-too-many cycles, get clean platform-specific data, and if necessary enrich it with the other platforms' data.
2. We might also, just in case, want to keep the 'standard' queue for
when
we don't have the data for this platform (related to the previous point).
If we do what's described above, we should always have data for the platform. But if you mean calculating and storing the standard metrics, then yes -- since we are going to store the values rather than re-calculate them every time, there is no reason to be greedy about it. It might even make sense to calculate both metrics that you developed, too. Who knows maybe one day we'll find out that the other one gives us better results.
It doesn't matter in which order they fail/finish; the problem is, when builder2 starts, it doesn't have information about builder1 results, and builder3 doesn't know anything about the first two. So, the metric
X could not be increased yet.
But in your current calculation, it is. So, naturally, if we happen to catch the failure on builder1, the metric raises dramatically, and the failure will be definitely caught on builders 2 and 3.
It is especially important now, when you use incoming lists, and the running sets might be not identical for builders 1-3 even in standard mode.
Right, I see your point. Although if test_run 1 would catch the error, test_run 2, although it would be using the same data. might not catch the same errors if the running set makes it such that they are pushed out due to lower relevance. The effect might not be too big, but it definitely has potential to affect the results.
Over-pessimistic part:
It is similar to the previous one, but look at the same problem from a different angle. Suppose the push broke test X, and the test started failing on all builders (platforms). So, you have 20 failures, one
for test per test
run, for the same push. Now, suppose you caught it on one platform but not on others. Your statistics will still show 19 failures missed vs 1 failure caught, and recall will be dreadful (~0.05). But in fact, the goal is achieved: the failure has been caught for this push. It doesn't really matter whether you catch it 1 time or 20 times. So, recall here should be 1.
It should mainly affect per-platform approach, but probably the standard one can also suffer if running sets are not identical for all builders.
Right. It seems that solving these two issues is non-trivial (the test_run table does not contain duration of the test_run, or anything). But we can keep in mind these issues.
Right. At this point it doesn't even make sense to solve hem -- in real-life application, the first one will be gone naturally, just because there will be no data from unfinished test runs. The second one only affects recall calculation, in other words -- evaluation of the algorithm. It is interesting from theoretical point of view, but not critical for real-life application.
I fixed up the repositories with updated versions of the queries, as well as instructions in the README on how to generate them.
Now I am looking a bit at the buildbot code, just to try to suggest some design ideas for adding the statistician and the pythia into the MTR related classes.
As you know, we have the soft pencil-down in a few days, and the hard one a week later. At this point, there isn't much reason to keep frantically improving the algorithm (which is never perfect), so you are right not planning on it. In the remaining time I suggest to - address the points above; - make sure that everything that should be configurable is configurable (algorithm, mode, learning set, db connection details); - create structures to store the metrics and reading to/writing from the database; - make sure the predicting and the calculating part can be called separately; - update documentation, clean up logging and code in general. As long as we have these two parts easily callable, we will find a place in buildbot/MTR to put them to, so don't waste too much time on it. Regards, Elena
Regards Pablo
Hello Elena, I just pushed a transaction, with the following changes: 1. Added an internal counter to the kokiri class, and a function to expose it. This function can show how many update result runs and prediction runs have been run in total, or per unit (an unit being a platform, a branch or a mix of both). Using this counter, one can decide to add logic for extra learning rounds for new platforms (I added it to the wrapper class as an example). 2. Added functions to load and store status into temporary storage. They are very simple - they only serialize to a JSON file, but they can be easily modified to fit the requirements of the implementation. I can add this in the README. If you'd like for me to add the capacity to connect to a database and store the data in a table, I can do that too (I think it would be easiest to store the dicts as json data in text fields). Let me know if you'd prefer that. By the way, these functions allow the two parts of the algorithm to be called separately, e.g.: Predicting phase (can be done depending of counts of training rounds for platform, etc..) 1. Create kokiri instance 2. Load status (call load_status) 3. Input test list, get smaller output 4. Eliminate instance from memory (no need to save state since nothing changes until results are updated) Training phase: 1. Create kokiri instance 2. Load status (call load_status) 3. Feed new information 4. Save status (call save_status) 5. Eliminate instance from memory I added tests that check the new features to the wrapper. Both features seem to be working okay. Of course, the more prediction rounds for new platforms, the platform mode improves a bit, but not too dramatically, for what I've seen. I'll test it a bit more. I will also add these features to the file_change_correlations branch, and document everything in the README file. Regards Pablo On Wed, Aug 6, 2014 at 8:04 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
(sorry, forgot the list in my reply, resending)
Hi Pablo,
On 03.08.2014 17:51, Pablo Estrada wrote:
Hi Elena,
One thing that I want to see there is fully developed platform mode. I see that mode option is still there, so it should not be difficult. I actually did it myself while experimenting, but since I only made hasty and crude changes, I don't expect them to be useful.
I'm not sure what code you are referring to. Can you be more specific on what seems to be missing? I might have missed something when migrating from the previous architecture...
I was mainly referring to the learning stage. Currently, the learning stage is "global". You go through X test runs, collect data, distribute it between platform-specific queues, and from X+1 test run you start predicting based on whatever platform-specific data you have at the moment.
But this is bound to cause rather sporadic quality of prediction, because it could happen that out of 3000 learning runs, 1000 belongs to platform A, while platform B only had 100, and platform C was introduced later, after your learning cycle. So, for platform B the statistical data will be very limited, and for platform C there will be none -- you will simply start randomizing tests from the very beginning (or using data from other platforms as you suggest below, which is still not quite the same as pure platform-specific approach).
It seems more reasonable, if the platform-specific mode is used, to do learning per platform too. It is not just about current investigation activity, but about the real-life implementation too.
Lets suppose tomorrow we start collecting the data and calculating the metrics. Some platforms will run more often than others, so lets say in 2 weeks you will have X test runs on these platforms so you can start predicting for them; while other platforms will run less frequently, and it will take 1 month to collect the same amount of data. And 2 months later there will be Ubuntu Utopic Unicorn which will have no statistical data at all, and it will be cruel to jump into predicting there right away, without any statistical data at all.
It sounds more complicated than it is, in fact pretty much all you need to add to your algorithm is making 'count' in your run_simulation a dict rather than a constant.
So, I imagine that when you store your metrics after a test run, you will also store a number of test runs per platform, and only start predicting for this particular platform when the count for it reaches the configured number.
Of the code that's definitely not there, there are a couple things that could be added: 1. When we calculate the relevance of a test on a given platform, we
might
want to set the relevance to 0, or we might want to derive a default relevance from other platforms (An average, the 'standard', etc...). Currently, it's just set to 0.
I think you could combine this idea with what was described above. While it makes sense to run *some* full learning cycles on a new platform, it does not have to be thousands, especially since some non-LTS platforms come and go awfully fast. So, we run these no-too-many cycles, get clean platform-specific data, and if necessary enrich it with the other platforms' data.
2. We might also, just in case, want to keep the 'standard' queue for
when
we don't have the data for this platform (related to the previous point).
If we do what's described above, we should always have data for the platform. But if you mean calculating and storing the standard metrics, then yes -- since we are going to store the values rather than re-calculate them every time, there is no reason to be greedy about it. It might even make sense to calculate both metrics that you developed, too. Who knows maybe one day we'll find out that the other one gives us better results.
It doesn't matter in which order they fail/finish; the problem is, when builder2 starts, it doesn't have information about builder1 results, and builder3 doesn't know anything about the first two. So, the metric for
test
X could not be increased yet.
But in your current calculation, it is. So, naturally, if we happen to catch the failure on builder1, the metric raises dramatically, and the failure will be definitely caught on builders 2 and 3.
It is especially important now, when you use incoming lists, and the running sets might be not identical for builders 1-3 even in standard mode.
Right, I see your point. Although if test_run 1 would catch the error, test_run 2, although it would be using the same data. might not catch the same errors if the running set makes it such that they are pushed out due to lower relevance. The effect might not be too big, but it definitely has potential to affect the results.
Over-pessimistic part:
It is similar to the previous one, but look at the same problem from a different angle. Suppose the push broke test X, and the test started failing on all builders (platforms). So, you have 20 failures, one per
test
run, for the same push. Now, suppose you caught it on one platform but not on others. Your statistics will still show 19 failures missed vs 1 failure caught, and recall will be dreadful (~0.05). But in fact, the goal is achieved: the failure has been caught for this push. It doesn't really matter whether you catch it 1 time or 20 times. So, recall here should be 1.
It should mainly affect per-platform approach, but probably the standard one can also suffer if running sets are not identical for all builders.
Right. It seems that solving these two issues is non-trivial (the test_run table does not contain duration of the test_run, or anything). But we can keep in mind these issues.
Right. At this point it doesn't even make sense to solve hem -- in real-life application, the first one will be gone naturally, just because there will be no data from unfinished test runs.
The second one only affects recall calculation, in other words -- evaluation of the algorithm. It is interesting from theoretical point of view, but not critical for real-life application.
I fixed up the repositories with updated versions of the queries, as well as instructions in the README on how to generate them.
Now I am looking a bit at the buildbot code, just to try to suggest some design ideas for adding the statistician and the pythia into the MTR related classes.
As you know, we have the soft pencil-down in a few days, and the hard one a week later. At this point, there isn't much reason to keep frantically improving the algorithm (which is never perfect), so you are right not planning on it.
In the remaining time I suggest to
- address the points above; - make sure that everything that should be configurable is configurable (algorithm, mode, learning set, db connection details); - create structures to store the metrics and reading to/writing from the database; - make sure the predicting and the calculating part can be called separately; - update documentation, clean up logging and code in general.
As long as we have these two parts easily callable, we will find a place in buildbot/MTR to put them to, so don't waste too much time on it.
Regards, Elena
Regards Pablo
Hi Pablo, Thanks for the update. Couple of comments inline. On 08.08.2014 18:17, Pablo Estrada wrote:
Hello Elena, I just pushed a transaction, with the following changes:
1. Added an internal counter to the kokiri class, and a function to expose it. This function can show how many update result runs and prediction runs have been run in total, or per unit (an unit being a platform, a branch or a mix of both). Using this counter, one can decide to add logic for extra learning rounds for new platforms (I added it to the wrapper class as an example).
2. Added functions to load and store status into temporary storage. They are very simple - they only serialize to a JSON file, but they can be easily modified to fit the requirements of the implementation. I can add this in the README. If you'd like for me to add the capacity to connect to a database and store the data in a table, I can do that too (I think it
Yes, I think we'll have to have it stored in the database. Chances are, the scripts will run on buildbot slaves rather than on the master, so storing data in a file just won't do any good.
would be easiest to store the dicts as json data in text fields). Let me know if you'd prefer that.
I don't like the idea of storing the entire dicts as json. It doesn't seem to be justified by... well... anything, except for saving a tiny bit of time on writing queries. But that's a one-time effort, while this way we won't be able to [easily] join the statistical data with, lets say, existing buildbot tables; and it generally won't be efficient and easy to read. Besides, keep in mind that for real use, if, lets say, we are running in 'platform' mode, for each call we don't need the whole dict, we only need the part of dict which relates to this platform, and possibly the standard one. So, there is really no point loading other 20 platforms' data, which you will almost inevitably do if you store it in a single json. The real (not json-ed) data structure seems quite suitable for SQL, so it makes sense to store it as such. If you think it will take you long to do that, it's not critical: just create an example interface for connecting to a database and running *some* queries to store/read the data, and we'll tune it later. Regards, Elena
By the way, these functions allow the two parts of the algorithm to be called separately, e.g.:
Predicting phase (can be done depending of counts of training rounds for platform, etc..) 1. Create kokiri instance 2. Load status (call load_status) 3. Input test list, get smaller output 4. Eliminate instance from memory (no need to save state since nothing changes until results are updated)
Training phase: 1. Create kokiri instance 2. Load status (call load_status) 3. Feed new information 4. Save status (call save_status) 5. Eliminate instance from memory
I added tests that check the new features to the wrapper. Both features seem to be working okay. Of course, the more prediction rounds for new platforms, the platform mode improves a bit, but not too dramatically, for what I've seen. I'll test it a bit more.
I will also add these features to the file_change_correlations branch, and document everything in the README file.
Regards Pablo
On Wed, Aug 6, 2014 at 8:04 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
(sorry, forgot the list in my reply, resending)
Hi Pablo,
On 03.08.2014 17:51, Pablo Estrada wrote:
Hi Elena,
One thing that I want to see there is fully developed platform mode. I see that mode option is still there, so it should not be difficult. I actually did it myself while experimenting, but since I only made hasty and crude changes, I don't expect them to be useful.
I'm not sure what code you are referring to. Can you be more specific on what seems to be missing? I might have missed something when migrating from the previous architecture...
I was mainly referring to the learning stage. Currently, the learning stage is "global". You go through X test runs, collect data, distribute it between platform-specific queues, and from X+1 test run you start predicting based on whatever platform-specific data you have at the moment.
But this is bound to cause rather sporadic quality of prediction, because it could happen that out of 3000 learning runs, 1000 belongs to platform A, while platform B only had 100, and platform C was introduced later, after your learning cycle. So, for platform B the statistical data will be very limited, and for platform C there will be none -- you will simply start randomizing tests from the very beginning (or using data from other platforms as you suggest below, which is still not quite the same as pure platform-specific approach).
It seems more reasonable, if the platform-specific mode is used, to do learning per platform too. It is not just about current investigation activity, but about the real-life implementation too.
Lets suppose tomorrow we start collecting the data and calculating the metrics. Some platforms will run more often than others, so lets say in 2 weeks you will have X test runs on these platforms so you can start predicting for them; while other platforms will run less frequently, and it will take 1 month to collect the same amount of data. And 2 months later there will be Ubuntu Utopic Unicorn which will have no statistical data at all, and it will be cruel to jump into predicting there right away, without any statistical data at all.
It sounds more complicated than it is, in fact pretty much all you need to add to your algorithm is making 'count' in your run_simulation a dict rather than a constant.
So, I imagine that when you store your metrics after a test run, you will also store a number of test runs per platform, and only start predicting for this particular platform when the count for it reaches the configured number.
Of the code that's definitely not there, there are a couple things that could be added: 1. When we calculate the relevance of a test on a given platform, we
might
want to set the relevance to 0, or we might want to derive a default relevance from other platforms (An average, the 'standard', etc...). Currently, it's just set to 0.
I think you could combine this idea with what was described above. While it makes sense to run *some* full learning cycles on a new platform, it does not have to be thousands, especially since some non-LTS platforms come and go awfully fast. So, we run these no-too-many cycles, get clean platform-specific data, and if necessary enrich it with the other platforms' data.
2. We might also, just in case, want to keep the 'standard' queue for
when
we don't have the data for this platform (related to the previous point).
If we do what's described above, we should always have data for the platform. But if you mean calculating and storing the standard metrics, then yes -- since we are going to store the values rather than re-calculate them every time, there is no reason to be greedy about it. It might even make sense to calculate both metrics that you developed, too. Who knows maybe one day we'll find out that the other one gives us better results.
It doesn't matter in which order they fail/finish; the problem is, when builder2 starts, it doesn't have information about builder1 results, and builder3 doesn't know anything about the first two. So, the metric for
test
X could not be increased yet.
But in your current calculation, it is. So, naturally, if we happen to catch the failure on builder1, the metric raises dramatically, and the failure will be definitely caught on builders 2 and 3.
It is especially important now, when you use incoming lists, and the running sets might be not identical for builders 1-3 even in standard mode.
Right, I see your point. Although if test_run 1 would catch the error, test_run 2, although it would be using the same data. might not catch the same errors if the running set makes it such that they are pushed out due to lower relevance. The effect might not be too big, but it definitely has potential to affect the results.
Over-pessimistic part:
It is similar to the previous one, but look at the same problem from a different angle. Suppose the push broke test X, and the test started failing on all builders (platforms). So, you have 20 failures, one per
test
run, for the same push. Now, suppose you caught it on one platform but not on others. Your statistics will still show 19 failures missed vs 1 failure caught, and recall will be dreadful (~0.05). But in fact, the goal is achieved: the failure has been caught for this push. It doesn't really matter whether you catch it 1 time or 20 times. So, recall here should be 1.
It should mainly affect per-platform approach, but probably the standard one can also suffer if running sets are not identical for all builders.
Right. It seems that solving these two issues is non-trivial (the test_run table does not contain duration of the test_run, or anything). But we can keep in mind these issues.
Right. At this point it doesn't even make sense to solve hem -- in real-life application, the first one will be gone naturally, just because there will be no data from unfinished test runs.
The second one only affects recall calculation, in other words -- evaluation of the algorithm. It is interesting from theoretical point of view, but not critical for real-life application.
I fixed up the repositories with updated versions of the queries, as well as instructions in the README on how to generate them.
Now I am looking a bit at the buildbot code, just to try to suggest some design ideas for adding the statistician and the pythia into the MTR related classes.
As you know, we have the soft pencil-down in a few days, and the hard one a week later. At this point, there isn't much reason to keep frantically improving the algorithm (which is never perfect), so you are right not planning on it.
In the remaining time I suggest to
- address the points above; - make sure that everything that should be configurable is configurable (algorithm, mode, learning set, db connection details); - create structures to store the metrics and reading to/writing from the database; - make sure the predicting and the calculating part can be called separately; - update documentation, clean up logging and code in general.
As long as we have these two parts easily callable, we will find a place in buildbot/MTR to put them to, so don't waste too much time on it.
Regards, Elena
Regards Pablo
Hello Elena, You raise good points. I have just rewritten the save_state and load_state functions. Now they work with a MySQL database and a table that looks like this: create table kokiri_data ( dict varchar(20), labels varchar(200), value varchar(100), primary key (dict,labels)); Since I wanted to store many dicts into the database, I decided to try this format. The 'dict' field includes the dictionary that the data belongs to ('upd_count','pred_count' or 'test_info'). The 'labels' field includes the space-separated list of labels in the dictionary (for a more detailed explanation, check the README and the code). The value contains the value of the datum (count of runs, relevance, etc.) Since the labels are space-separated, this assumes we are not using the mixed mode. If we use mixed mode, we may change the separator (, or & or % or $ are good alternatives). Let me know what you think about this strategy to store into the database. I felt it was the most simple one, while still allowing to do some querying on the database (like loading only one metric or one 'unit' (platform/branch/mix), etc). It may also allow to store many configurations if necessary. Regards Pablo On Sat, Aug 9, 2014 at 8:26 AM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update. Couple of comments inline.
On 08.08.2014 18:17, Pablo Estrada wrote:
Hello Elena, I just pushed a transaction, with the following changes:
1. Added an internal counter to the kokiri class, and a function to expose it. This function can show how many update result runs and prediction runs have been run in total, or per unit (an unit being a platform, a branch or a mix of both). Using this counter, one can decide to add logic for extra learning rounds for new platforms (I added it to the wrapper class as an example).
2. Added functions to load and store status into temporary storage. They are very simple - they only serialize to a JSON file, but they can be easily modified to fit the requirements of the implementation. I can add this in the README. If you'd like for me to add the capacity to connect to a database and store the data in a table, I can do that too (I think it
Yes, I think we'll have to have it stored in the database. Chances are, the scripts will run on buildbot slaves rather than on the master, so storing data in a file just won't do any good.
would be easiest to store the dicts as json data in text fields). Let me
know if you'd prefer that.
I don't like the idea of storing the entire dicts as json. It doesn't seem to be justified by... well... anything, except for saving a tiny bit of time on writing queries. But that's a one-time effort, while this way we won't be able to [easily] join the statistical data with, lets say, existing buildbot tables; and it generally won't be efficient and easy to read.
Besides, keep in mind that for real use, if, lets say, we are running in 'platform' mode, for each call we don't need the whole dict, we only need the part of dict which relates to this platform, and possibly the standard one. So, there is really no point loading other 20 platforms' data, which you will almost inevitably do if you store it in a single json.
The real (not json-ed) data structure seems quite suitable for SQL, so it makes sense to store it as such.
If you think it will take you long to do that, it's not critical: just create an example interface for connecting to a database and running *some* queries to store/read the data, and we'll tune it later.
Regards, Elena
By the way, these functions allow the two parts of the algorithm to be called separately, e.g.:
Predicting phase (can be done depending of counts of training rounds for platform, etc..) 1. Create kokiri instance 2. Load status (call load_status) 3. Input test list, get smaller output 4. Eliminate instance from memory (no need to save state since nothing changes until results are updated)
Training phase: 1. Create kokiri instance 2. Load status (call load_status) 3. Feed new information 4. Save status (call save_status) 5. Eliminate instance from memory
I added tests that check the new features to the wrapper. Both features seem to be working okay. Of course, the more prediction rounds for new platforms, the platform mode improves a bit, but not too dramatically, for what I've seen. I'll test it a bit more.
I will also add these features to the file_change_correlations branch, and document everything in the README file.
Regards Pablo
On Wed, Aug 6, 2014 at 8:04 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
(sorry, forgot the list in my reply, resending)
Hi Pablo,
On 03.08.2014 17:51, Pablo Estrada wrote:
Hi Elena,
One thing that I want to see there is fully developed platform mode. I
see
that mode option is still there, so it should not be difficult. I
actually
did it myself while experimenting, but since I only made hasty and crude
changes, I don't expect them to be useful.
I'm not sure what code you are referring to. Can you be more specific on what seems to be missing? I might have missed something when migrating
from
the previous architecture...
I was mainly referring to the learning stage. Currently, the learning stage is "global". You go through X test runs, collect data, distribute it between platform-specific queues, and from X+1 test run you start predicting based on whatever platform-specific data you have at the moment.
But this is bound to cause rather sporadic quality of prediction, because it could happen that out of 3000 learning runs, 1000 belongs to platform A, while platform B only had 100, and platform C was introduced later, after your learning cycle. So, for platform B the statistical data will be very limited, and for platform C there will be none -- you will simply start randomizing tests from the very beginning (or using data from other platforms as you suggest below, which is still not quite the same as pure platform-specific approach).
It seems more reasonable, if the platform-specific mode is used, to do learning per platform too. It is not just about current investigation activity, but about the real-life implementation too.
Lets suppose tomorrow we start collecting the data and calculating the metrics. Some platforms will run more often than others, so lets say in 2 weeks you will have X test runs on these platforms so you can start predicting for them; while other platforms will run less frequently, and it will take 1 month to collect the same amount of data. And 2 months later there will be Ubuntu Utopic Unicorn which will have no statistical data at all, and it will be cruel to jump into predicting there right away, without any statistical data at all.
It sounds more complicated than it is, in fact pretty much all you need to add to your algorithm is making 'count' in your run_simulation a dict rather than a constant.
So, I imagine that when you store your metrics after a test run, you will also store a number of test runs per platform, and only start predicting for this particular platform when the count for it reaches the configured number.
Of the code that's definitely not there, there are a couple things that could be added: 1. When we calculate the relevance of a test on a given platform, we
might
want to set the relevance to 0, or we might want to derive a default relevance from other platforms (An average, the 'standard', etc...). Currently, it's just set to 0.
I think you could combine this idea with what was described above. While it makes sense to run *some* full learning cycles on a new platform, it does not have to be thousands, especially since some non-LTS platforms come and go awfully fast. So, we run these no-too-many cycles, get clean platform-specific data, and if necessary enrich it with the other platforms' data.
2. We might also, just in case, want to keep the 'standard' queue for
when
we don't have the data for this platform (related to the previous point).
If we do what's described above, we should always have data for the platform. But if you mean calculating and storing the standard metrics, then yes -- since we are going to store the values rather than re-calculate them every time, there is no reason to be greedy about it. It might even make sense to calculate both metrics that you developed, too. Who knows maybe one day we'll find out that the other one gives us better results.
It doesn't matter in which order they fail/finish; the problem is, when
builder2 starts, it doesn't have information about builder1 results, and builder3 doesn't know anything about the first two. So, the metric for
test
X could not be increased yet.
But in your current calculation, it is. So, naturally, if we happen to catch the failure on builder1, the metric raises dramatically, and the failure will be definitely caught on builders 2 and 3.
It is especially important now, when you use incoming lists, and the running sets might be not identical for builders 1-3 even in standard
mode.
Right, I see your point. Although if test_run 1 would catch the error, test_run 2, although it would be using the same data. might not catch the same errors if the running set makes it such that they are pushed out due to lower relevance. The effect might not be too big, but it definitely
has
potential to affect the results.
Over-pessimistic part:
It is similar to the previous one, but look at the same problem from a different angle. Suppose the push broke test X, and the test started failing on all builders (platforms). So, you have 20 failures, one per
test
run, for the same push. Now, suppose you caught it on one platform but
not
on others. Your statistics will still show 19 failures missed vs 1
failure
caught, and recall will be dreadful (~0.05). But in fact, the goal is
achieved: the failure has been caught for this push. It doesn't really matter whether you catch it 1 time or 20 times. So, recall here should
be 1.
It should mainly affect per-platform approach, but probably the standard one can also suffer if running sets are not identical for all builders.
Right. It seems that solving these two issues is non-trivial (the
test_run
table does not contain duration of the test_run, or anything). But we can keep in mind these issues.
Right. At this point it doesn't even make sense to solve hem -- in real-life application, the first one will be gone naturally, just because there will be no data from unfinished test runs.
The second one only affects recall calculation, in other words -- evaluation of the algorithm. It is interesting from theoretical point of view, but not critical for real-life application.
I fixed up the repositories with updated versions of the queries, as
well as instructions in the README on how to generate them.
Now I am looking a bit at the buildbot code, just to try to suggest some design ideas for adding the statistician and the pythia into the MTR related classes.
As you know, we have the soft pencil-down in a few days, and the hard one a week later. At this point, there isn't much reason to keep frantically improving the algorithm (which is never perfect), so you are right not planning on it.
In the remaining time I suggest to
- address the points above; - make sure that everything that should be configurable is configurable (algorithm, mode, learning set, db connection details); - create structures to store the metrics and reading to/writing from the database; - make sure the predicting and the calculating part can be called separately; - update documentation, clean up logging and code in general.
As long as we have these two parts easily callable, we will find a place in buildbot/MTR to put them to, so don't waste too much time on it.
Regards, Elena
Regards Pablo
Hi Pablo, On 10.08.2014 9:31, Pablo Estrada wrote:
Hello Elena, You raise good points. I have just rewritten the save_state and load_state functions. Now they work with a MySQL database and a table that looks like this:
create table kokiri_data ( dict varchar(20), labels varchar(200), value varchar(100), primary key (dict,labels));
Since I wanted to store many dicts into the database, I decided to try this format. The 'dict' field includes the dictionary that the data belongs to ('upd_count','pred_count' or 'test_info'). The 'labels' field includes the space-separated list of labels in the dictionary (for a more detailed explanation, check the README and the code). The value contains the value of the datum (count of runs, relevance, etc.)
Since the labels are space-separated, this assumes we are not using the mixed mode. If we use mixed mode, we may change the separator (, or & or % or $ are good alternatives).
Let me know what you think about this strategy to store into the database. I felt it was the most simple one, while still allowing to do some querying on the database (like loading only one metric or one 'unit' (platform/branch/mix), etc). It may also allow to store many configurations if necessary.
Okay, lets have it this way. We can change it later if we want to. In the remaining time, you can do the cleanup, check documentation, and maybe run some last clean experiments with the existing data and different parameters (modes, metrics etc.), to have the statistical results with the latest code, which we'll use later to decide on the final configuration. Regards, Elena
Regards Pablo
On Sat, Aug 9, 2014 at 8:26 AM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update. Couple of comments inline.
On 08.08.2014 18:17, Pablo Estrada wrote:
Hello Elena, I just pushed a transaction, with the following changes:
1. Added an internal counter to the kokiri class, and a function to expose it. This function can show how many update result runs and prediction runs have been run in total, or per unit (an unit being a platform, a branch or a mix of both). Using this counter, one can decide to add logic for extra learning rounds for new platforms (I added it to the wrapper class as an example).
2. Added functions to load and store status into temporary storage. They are very simple - they only serialize to a JSON file, but they can be easily modified to fit the requirements of the implementation. I can add this in the README. If you'd like for me to add the capacity to connect to a database and store the data in a table, I can do that too (I think it
Yes, I think we'll have to have it stored in the database. Chances are, the scripts will run on buildbot slaves rather than on the master, so storing data in a file just won't do any good.
would be easiest to store the dicts as json data in text fields). Let me
know if you'd prefer that.
I don't like the idea of storing the entire dicts as json. It doesn't seem to be justified by... well... anything, except for saving a tiny bit of time on writing queries. But that's a one-time effort, while this way we won't be able to [easily] join the statistical data with, lets say, existing buildbot tables; and it generally won't be efficient and easy to read.
Besides, keep in mind that for real use, if, lets say, we are running in 'platform' mode, for each call we don't need the whole dict, we only need the part of dict which relates to this platform, and possibly the standard one. So, there is really no point loading other 20 platforms' data, which you will almost inevitably do if you store it in a single json.
The real (not json-ed) data structure seems quite suitable for SQL, so it makes sense to store it as such.
If you think it will take you long to do that, it's not critical: just create an example interface for connecting to a database and running *some* queries to store/read the data, and we'll tune it later.
Regards, Elena
By the way, these functions allow the two parts of the algorithm to be called separately, e.g.:
Predicting phase (can be done depending of counts of training rounds for platform, etc..) 1. Create kokiri instance 2. Load status (call load_status) 3. Input test list, get smaller output 4. Eliminate instance from memory (no need to save state since nothing changes until results are updated)
Training phase: 1. Create kokiri instance 2. Load status (call load_status) 3. Feed new information 4. Save status (call save_status) 5. Eliminate instance from memory
I added tests that check the new features to the wrapper. Both features seem to be working okay. Of course, the more prediction rounds for new platforms, the platform mode improves a bit, but not too dramatically, for what I've seen. I'll test it a bit more.
I will also add these features to the file_change_correlations branch, and document everything in the README file.
Regards Pablo
On Wed, Aug 6, 2014 at 8:04 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
(sorry, forgot the list in my reply, resending)
Hi Pablo,
On 03.08.2014 17:51, Pablo Estrada wrote:
Hi Elena,
One thing that I want to see there is fully developed platform mode. I
see
that mode option is still there, so it should not be difficult. I
actually
did it myself while experimenting, but since I only made hasty and crude
changes, I don't expect them to be useful.
I'm not sure what code you are referring to. Can you be more specific on what seems to be missing? I might have missed something when migrating
from
the previous architecture...
I was mainly referring to the learning stage. Currently, the learning stage is "global". You go through X test runs, collect data, distribute it between platform-specific queues, and from X+1 test run you start predicting based on whatever platform-specific data you have at the moment.
But this is bound to cause rather sporadic quality of prediction, because it could happen that out of 3000 learning runs, 1000 belongs to platform A, while platform B only had 100, and platform C was introduced later, after your learning cycle. So, for platform B the statistical data will be very limited, and for platform C there will be none -- you will simply start randomizing tests from the very beginning (or using data from other platforms as you suggest below, which is still not quite the same as pure platform-specific approach).
It seems more reasonable, if the platform-specific mode is used, to do learning per platform too. It is not just about current investigation activity, but about the real-life implementation too.
Lets suppose tomorrow we start collecting the data and calculating the metrics. Some platforms will run more often than others, so lets say in 2 weeks you will have X test runs on these platforms so you can start predicting for them; while other platforms will run less frequently, and it will take 1 month to collect the same amount of data. And 2 months later there will be Ubuntu Utopic Unicorn which will have no statistical data at all, and it will be cruel to jump into predicting there right away, without any statistical data at all.
It sounds more complicated than it is, in fact pretty much all you need to add to your algorithm is making 'count' in your run_simulation a dict rather than a constant.
So, I imagine that when you store your metrics after a test run, you will also store a number of test runs per platform, and only start predicting for this particular platform when the count for it reaches the configured number.
Of the code that's definitely not there, there are a couple things that could be added: 1. When we calculate the relevance of a test on a given platform, we
might
want to set the relevance to 0, or we might want to derive a default relevance from other platforms (An average, the 'standard', etc...). Currently, it's just set to 0.
I think you could combine this idea with what was described above. While it makes sense to run *some* full learning cycles on a new platform, it does not have to be thousands, especially since some non-LTS platforms come and go awfully fast. So, we run these no-too-many cycles, get clean platform-specific data, and if necessary enrich it with the other platforms' data.
2. We might also, just in case, want to keep the 'standard' queue for
when
we don't have the data for this platform (related to the previous point).
If we do what's described above, we should always have data for the platform. But if you mean calculating and storing the standard metrics, then yes -- since we are going to store the values rather than re-calculate them every time, there is no reason to be greedy about it. It might even make sense to calculate both metrics that you developed, too. Who knows maybe one day we'll find out that the other one gives us better results.
It doesn't matter in which order they fail/finish; the problem is, when
builder2 starts, it doesn't have information about builder1 results, and builder3 doesn't know anything about the first two. So, the metric for
test
X could not be increased yet.
But in your current calculation, it is. So, naturally, if we happen to catch the failure on builder1, the metric raises dramatically, and the failure will be definitely caught on builders 2 and 3.
It is especially important now, when you use incoming lists, and the running sets might be not identical for builders 1-3 even in standard
mode.
Right, I see your point. Although if test_run 1 would catch the error, test_run 2, although it would be using the same data. might not catch the same errors if the running set makes it such that they are pushed out due to lower relevance. The effect might not be too big, but it definitely
has
potential to affect the results.
Over-pessimistic part:
It is similar to the previous one, but look at the same problem from a different angle. Suppose the push broke test X, and the test started failing on all builders (platforms). So, you have 20 failures, one per
test
run, for the same push. Now, suppose you caught it on one platform but
not
on others. Your statistics will still show 19 failures missed vs 1
failure
caught, and recall will be dreadful (~0.05). But in fact, the goal is
achieved: the failure has been caught for this push. It doesn't really matter whether you catch it 1 time or 20 times. So, recall here should
be 1.
It should mainly affect per-platform approach, but probably the standard one can also suffer if running sets are not identical for all builders.
Right. It seems that solving these two issues is non-trivial (the
test_run
table does not contain duration of the test_run, or anything). But we can keep in mind these issues.
Right. At this point it doesn't even make sense to solve hem -- in real-life application, the first one will be gone naturally, just because there will be no data from unfinished test runs.
The second one only affects recall calculation, in other words -- evaluation of the algorithm. It is interesting from theoretical point of view, but not critical for real-life application.
I fixed up the repositories with updated versions of the queries, as
well as instructions in the README on how to generate them.
Now I am looking a bit at the buildbot code, just to try to suggest some design ideas for adding the statistician and the pythia into the MTR related classes.
As you know, we have the soft pencil-down in a few days, and the hard one a week later. At this point, there isn't much reason to keep frantically improving the algorithm (which is never perfect), so you are right not planning on it.
In the remaining time I suggest to
- address the points above; - make sure that everything that should be configurable is configurable (algorithm, mode, learning set, db connection details); - create structures to store the metrics and reading to/writing from the database; - make sure the predicting and the calculating part can be called separately; - update documentation, clean up logging and code in general.
As long as we have these two parts easily callable, we will find a place in buildbot/MTR to put them to, so don't waste too much time on it.
Regards, Elena
Regards Pablo
Hello Elena and all, I have submitted the concluding commit to the project with a very short 'RESULTS' file that explains briefly the project, the different strategies and the results. It includes a chart with updated results for both strategies and different modes. If you think I should add anything else, please let me know. Here it is: https://github.com/pabloem/Kokiri/blob/master/RESULTS.md Thank you very much. Regards Pablo On 8/13/14, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
On 10.08.2014 9:31, Pablo Estrada wrote:
Hello Elena, You raise good points. I have just rewritten the save_state and load_state functions. Now they work with a MySQL database and a table that looks like this:
create table kokiri_data ( dict varchar(20), labels varchar(200), value varchar(100), primary key (dict,labels));
Since I wanted to store many dicts into the database, I decided to try this format. The 'dict' field includes the dictionary that the data belongs to ('upd_count','pred_count' or 'test_info'). The 'labels' field includes the space-separated list of labels in the dictionary (for a more detailed explanation, check the README and the code). The value contains the value of the datum (count of runs, relevance, etc.)
Since the labels are space-separated, this assumes we are not using the mixed mode. If we use mixed mode, we may change the separator (, or & or % or $ are good alternatives).
Let me know what you think about this strategy to store into the database. I felt it was the most simple one, while still allowing to do some querying on the database (like loading only one metric or one 'unit' (platform/branch/mix), etc). It may also allow to store many configurations if necessary.
Okay, lets have it this way. We can change it later if we want to.
In the remaining time, you can do the cleanup, check documentation, and maybe run some last clean experiments with the existing data and different parameters (modes, metrics etc.), to have the statistical results with the latest code, which we'll use later to decide on the final configuration.
Regards, Elena
Regards Pablo
On Sat, Aug 9, 2014 at 8:26 AM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update. Couple of comments inline.
On 08.08.2014 18:17, Pablo Estrada wrote:
Hello Elena, I just pushed a transaction, with the following changes:
1. Added an internal counter to the kokiri class, and a function to expose it. This function can show how many update result runs and prediction runs have been run in total, or per unit (an unit being a platform, a branch or a mix of both). Using this counter, one can decide to add logic for extra learning rounds for new platforms (I added it to the wrapper class as an example).
2. Added functions to load and store status into temporary storage. They are very simple - they only serialize to a JSON file, but they can be easily modified to fit the requirements of the implementation. I can add this in the README. If you'd like for me to add the capacity to connect to a database and store the data in a table, I can do that too (I think it
Yes, I think we'll have to have it stored in the database. Chances are, the scripts will run on buildbot slaves rather than on the master, so storing data in a file just won't do any good.
would be easiest to store the dicts as json data in text fields). Let me
know if you'd prefer that.
I don't like the idea of storing the entire dicts as json. It doesn't seem to be justified by... well... anything, except for saving a tiny bit of time on writing queries. But that's a one-time effort, while this way we won't be able to [easily] join the statistical data with, lets say, existing buildbot tables; and it generally won't be efficient and easy to read.
Besides, keep in mind that for real use, if, lets say, we are running in 'platform' mode, for each call we don't need the whole dict, we only need the part of dict which relates to this platform, and possibly the standard one. So, there is really no point loading other 20 platforms' data, which you will almost inevitably do if you store it in a single json.
The real (not json-ed) data structure seems quite suitable for SQL, so it makes sense to store it as such.
If you think it will take you long to do that, it's not critical: just create an example interface for connecting to a database and running *some* queries to store/read the data, and we'll tune it later.
Regards, Elena
By the way, these functions allow the two parts of the algorithm to be called separately, e.g.:
Predicting phase (can be done depending of counts of training rounds for platform, etc..) 1. Create kokiri instance 2. Load status (call load_status) 3. Input test list, get smaller output 4. Eliminate instance from memory (no need to save state since nothing changes until results are updated)
Training phase: 1. Create kokiri instance 2. Load status (call load_status) 3. Feed new information 4. Save status (call save_status) 5. Eliminate instance from memory
I added tests that check the new features to the wrapper. Both features seem to be working okay. Of course, the more prediction rounds for new platforms, the platform mode improves a bit, but not too dramatically, for what I've seen. I'll test it a bit more.
I will also add these features to the file_change_correlations branch, and document everything in the README file.
Regards Pablo
On Wed, Aug 6, 2014 at 8:04 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
(sorry, forgot the list in my reply, resending)
Hi Pablo,
On 03.08.2014 17:51, Pablo Estrada wrote:
Hi Elena,
One thing that I want to see there is fully developed platform mode. I > see
that mode option is still there, so it should not be difficult. I > actually
did it myself while experimenting, but since I only made hasty and crude > changes, I don't expect them to be useful. > > I'm not sure what code you are referring to. Can you be more specific on what seems to be missing? I might have missed something when migrating
from
the previous architecture...
I was mainly referring to the learning stage. Currently, the learning stage is "global". You go through X test runs, collect data, distribute it between platform-specific queues, and from X+1 test run you start predicting based on whatever platform-specific data you have at the moment.
But this is bound to cause rather sporadic quality of prediction, because it could happen that out of 3000 learning runs, 1000 belongs to platform A, while platform B only had 100, and platform C was introduced later, after your learning cycle. So, for platform B the statistical data will be very limited, and for platform C there will be none -- you will simply start randomizing tests from the very beginning (or using data from other platforms as you suggest below, which is still not quite the same as pure platform-specific approach).
It seems more reasonable, if the platform-specific mode is used, to do learning per platform too. It is not just about current investigation activity, but about the real-life implementation too.
Lets suppose tomorrow we start collecting the data and calculating the metrics. Some platforms will run more often than others, so lets say in 2 weeks you will have X test runs on these platforms so you can start predicting for them; while other platforms will run less frequently, and it will take 1 month to collect the same amount of data. And 2 months later there will be Ubuntu Utopic Unicorn which will have no statistical data at all, and it will be cruel to jump into predicting there right away, without any statistical data at all.
It sounds more complicated than it is, in fact pretty much all you need to add to your algorithm is making 'count' in your run_simulation a dict rather than a constant.
So, I imagine that when you store your metrics after a test run, you will also store a number of test runs per platform, and only start predicting for this particular platform when the count for it reaches the configured number.
Of the code that's definitely not there, there are a couple things that could be added: 1. When we calculate the relevance of a test on a given platform, we
might
want to set the relevance to 0, or we might want to derive a default relevance from other platforms (An average, the 'standard', etc...). Currently, it's just set to 0.
I think you could combine this idea with what was described above. While it makes sense to run *some* full learning cycles on a new platform, it does not have to be thousands, especially since some non-LTS platforms come and go awfully fast. So, we run these no-too-many cycles, get clean platform-specific data, and if necessary enrich it with the other platforms' data.
2. We might also, just in case, want to keep the 'standard' queue for
when
we don't have the data for this platform (related to the previous point).
If we do what's described above, we should always have data for the platform. But if you mean calculating and storing the standard metrics, then yes -- since we are going to store the values rather than re-calculate them every time, there is no reason to be greedy about it. It might even make sense to calculate both metrics that you developed, too. Who knows maybe one day we'll find out that the other one gives us better results.
It doesn't matter in which order they fail/finish; the problem is, when > builder2 starts, it doesn't have information about builder1 results, > and > builder3 doesn't know anything about the first two. So, the metric > for > test
X could not be increased yet. > > But in your current calculation, it is. So, naturally, if we happen > to > catch the failure on builder1, the metric raises dramatically, and > the > failure will be definitely caught on builders 2 and 3. > > It is especially important now, when you use incoming lists, and the > running sets might be not identical for builders 1-3 even in > standard > mode.
> Right, I see your point. Although if test_run 1 would catch the error, test_run 2, although it would be using the same data. might not catch the same errors if the running set makes it such that they are pushed out due to lower relevance. The effect might not be too big, but it definitely
has
potential to affect the results.
Over-pessimistic part:
> > It is similar to the previous one, but look at the same problem from > a > different angle. Suppose the push broke test X, and the test started > failing on all builders (platforms). So, you have 20 failures, one > per > test
run, for the same push. Now, suppose you caught it on one platform but > not
on others. Your statistics will still show 19 failures missed vs 1 > failure
caught, and recall will be dreadful (~0.05). But in fact, the goal is > achieved: the failure has been caught for this push. It doesn't > really > matter whether you catch it 1 time or 20 times. So, recall here > should > be 1.
> It should mainly affect per-platform approach, but probably the > standard > one can also suffer if running sets are not identical for all > builders. > > Right. It seems that solving these two issues is non-trivial (the
test_run
table does not contain duration of the test_run, or anything). But we can keep in mind these issues.
Right. At this point it doesn't even make sense to solve hem -- in real-life application, the first one will be gone naturally, just because there will be no data from unfinished test runs.
The second one only affects recall calculation, in other words -- evaluation of the algorithm. It is interesting from theoretical point of view, but not critical for real-life application.
I fixed up the repositories with updated versions of the queries, as
well as instructions in the README on how to generate them.
Now I am looking a bit at the buildbot code, just to try to suggest some design ideas for adding the statistician and the pythia into the MTR related classes.
As you know, we have the soft pencil-down in a few days, and the hard one a week later. At this point, there isn't much reason to keep frantically improving the algorithm (which is never perfect), so you are right not planning on it.
In the remaining time I suggest to
- address the points above; - make sure that everything that should be configurable is configurable (algorithm, mode, learning set, db connection details); - create structures to store the metrics and reading to/writing from the database; - make sure the predicting and the calculating part can be called separately; - update documentation, clean up logging and code in general.
As long as we have these two parts easily callable, we will find a place in buildbot/MTR to put them to, so don't waste too much time on it.
Regards, Elena
Regards Pablo
Hi Pablo, Thanks for the great work. Just one thing -- In RESULTS.md, paragraphs "The Fail Frequency algorithm" and "The File-change correlation algorithm" are unfinished. It's not a big deal, but I want to be sure there wasn't anything important in the lost part. Could you please double-check? Regards, Elena On 17.08.2014 16:32, Pablo Estrada wrote:
Hello Elena and all, I have submitted the concluding commit to the project with a very short 'RESULTS' file that explains briefly the project, the different strategies and the results. It includes a chart with updated results for both strategies and different modes. If you think I should add anything else, please let me know. Here it is: https://github.com/pabloem/Kokiri/blob/master/RESULTS.md
Thank you very much. Regards
Pablo
On 8/13/14, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
On 10.08.2014 9:31, Pablo Estrada wrote:
Hello Elena, You raise good points. I have just rewritten the save_state and load_state functions. Now they work with a MySQL database and a table that looks like this:
create table kokiri_data ( dict varchar(20), labels varchar(200), value varchar(100), primary key (dict,labels));
Since I wanted to store many dicts into the database, I decided to try this format. The 'dict' field includes the dictionary that the data belongs to ('upd_count','pred_count' or 'test_info'). The 'labels' field includes the space-separated list of labels in the dictionary (for a more detailed explanation, check the README and the code). The value contains the value of the datum (count of runs, relevance, etc.)
Since the labels are space-separated, this assumes we are not using the mixed mode. If we use mixed mode, we may change the separator (, or & or % or $ are good alternatives).
Let me know what you think about this strategy to store into the database. I felt it was the most simple one, while still allowing to do some querying on the database (like loading only one metric or one 'unit' (platform/branch/mix), etc). It may also allow to store many configurations if necessary.
Okay, lets have it this way. We can change it later if we want to.
In the remaining time, you can do the cleanup, check documentation, and maybe run some last clean experiments with the existing data and different parameters (modes, metrics etc.), to have the statistical results with the latest code, which we'll use later to decide on the final configuration.
Regards, Elena
Regards Pablo
On Sat, Aug 9, 2014 at 8:26 AM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update. Couple of comments inline.
On 08.08.2014 18:17, Pablo Estrada wrote:
Hello Elena, I just pushed a transaction, with the following changes:
1. Added an internal counter to the kokiri class, and a function to expose it. This function can show how many update result runs and prediction runs have been run in total, or per unit (an unit being a platform, a branch or a mix of both). Using this counter, one can decide to add logic for extra learning rounds for new platforms (I added it to the wrapper class as an example).
2. Added functions to load and store status into temporary storage. They are very simple - they only serialize to a JSON file, but they can be easily modified to fit the requirements of the implementation. I can add this in the README. If you'd like for me to add the capacity to connect to a database and store the data in a table, I can do that too (I think it
Yes, I think we'll have to have it stored in the database. Chances are, the scripts will run on buildbot slaves rather than on the master, so storing data in a file just won't do any good.
would be easiest to store the dicts as json data in text fields). Let me
know if you'd prefer that.
I don't like the idea of storing the entire dicts as json. It doesn't seem to be justified by... well... anything, except for saving a tiny bit of time on writing queries. But that's a one-time effort, while this way we won't be able to [easily] join the statistical data with, lets say, existing buildbot tables; and it generally won't be efficient and easy to read.
Besides, keep in mind that for real use, if, lets say, we are running in 'platform' mode, for each call we don't need the whole dict, we only need the part of dict which relates to this platform, and possibly the standard one. So, there is really no point loading other 20 platforms' data, which you will almost inevitably do if you store it in a single json.
The real (not json-ed) data structure seems quite suitable for SQL, so it makes sense to store it as such.
If you think it will take you long to do that, it's not critical: just create an example interface for connecting to a database and running *some* queries to store/read the data, and we'll tune it later.
Regards, Elena
By the way, these functions allow the two parts of the algorithm to be called separately, e.g.:
Predicting phase (can be done depending of counts of training rounds for platform, etc..) 1. Create kokiri instance 2. Load status (call load_status) 3. Input test list, get smaller output 4. Eliminate instance from memory (no need to save state since nothing changes until results are updated)
Training phase: 1. Create kokiri instance 2. Load status (call load_status) 3. Feed new information 4. Save status (call save_status) 5. Eliminate instance from memory
I added tests that check the new features to the wrapper. Both features seem to be working okay. Of course, the more prediction rounds for new platforms, the platform mode improves a bit, but not too dramatically, for what I've seen. I'll test it a bit more.
I will also add these features to the file_change_correlations branch, and document everything in the README file.
Regards Pablo
On Wed, Aug 6, 2014 at 8:04 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
(sorry, forgot the list in my reply, resending)
Hi Pablo,
On 03.08.2014 17:51, Pablo Estrada wrote:
> Hi Elena, > > > One thing that I want to see there is fully developed platform mode. > I >> > see
> that mode option is still there, so it should not be difficult. I >> > actually
> did it myself while experimenting, but since I only made hasty and > crude >> changes, I don't expect them to be useful. >> >> > I'm not sure what code you are referring to. Can you be more specific > on > what seems to be missing? I might have missed something when > migrating > from
> the previous architecture... >
I was mainly referring to the learning stage. Currently, the learning stage is "global". You go through X test runs, collect data, distribute it between platform-specific queues, and from X+1 test run you start predicting based on whatever platform-specific data you have at the moment.
But this is bound to cause rather sporadic quality of prediction, because it could happen that out of 3000 learning runs, 1000 belongs to platform A, while platform B only had 100, and platform C was introduced later, after your learning cycle. So, for platform B the statistical data will be very limited, and for platform C there will be none -- you will simply start randomizing tests from the very beginning (or using data from other platforms as you suggest below, which is still not quite the same as pure platform-specific approach).
It seems more reasonable, if the platform-specific mode is used, to do learning per platform too. It is not just about current investigation activity, but about the real-life implementation too.
Lets suppose tomorrow we start collecting the data and calculating the metrics. Some platforms will run more often than others, so lets say in 2 weeks you will have X test runs on these platforms so you can start predicting for them; while other platforms will run less frequently, and it will take 1 month to collect the same amount of data. And 2 months later there will be Ubuntu Utopic Unicorn which will have no statistical data at all, and it will be cruel to jump into predicting there right away, without any statistical data at all.
It sounds more complicated than it is, in fact pretty much all you need to add to your algorithm is making 'count' in your run_simulation a dict rather than a constant.
So, I imagine that when you store your metrics after a test run, you will also store a number of test runs per platform, and only start predicting for this particular platform when the count for it reaches the configured number.
> Of the code that's definitely not there, there are a couple things > that > could be added: > 1. When we calculate the relevance of a test on a given platform, we > might
> want to set the relevance to 0, or we might want to derive a default > relevance from other platforms (An average, the 'standard', etc...). > Currently, it's just set to 0. >
I think you could combine this idea with what was described above. While it makes sense to run *some* full learning cycles on a new platform, it does not have to be thousands, especially since some non-LTS platforms come and go awfully fast. So, we run these no-too-many cycles, get clean platform-specific data, and if necessary enrich it with the other platforms' data.
> 2. We might also, just in case, want to keep the 'standard' queue for > when
> we don't have the data for this platform (related to the previous > point). >
If we do what's described above, we should always have data for the platform. But if you mean calculating and storing the standard metrics, then yes -- since we are going to store the values rather than re-calculate them every time, there is no reason to be greedy about it. It might even make sense to calculate both metrics that you developed, too. Who knows maybe one day we'll find out that the other one gives us better results.
> > It doesn't matter in which order they fail/finish; the problem is, > when >> builder2 starts, it doesn't have information about builder1 results, >> and >> builder3 doesn't know anything about the first two. So, the metric >> for >> > test
> X could not be increased yet. >> >> But in your current calculation, it is. So, naturally, if we happen >> to >> catch the failure on builder1, the metric raises dramatically, and >> the >> failure will be definitely caught on builders 2 and 3. >> >> It is especially important now, when you use incoming lists, and the >> running sets might be not identical for builders 1-3 even in >> standard >> > mode.
> >> > Right, I see your point. Although if test_run 1 would catch the > error, > test_run 2, although it would be using the same data. might not catch > the > same errors if the running set makes it such that they are pushed out > due > to lower relevance. The effect might not be too big, but it > definitely > has
> potential to affect the results. > > Over-pessimistic part: > >> >> It is similar to the previous one, but look at the same problem from >> a >> different angle. Suppose the push broke test X, and the test started >> failing on all builders (platforms). So, you have 20 failures, one >> per >> > test
> run, for the same push. Now, suppose you caught it on one platform > but >> > not
> on others. Your statistics will still show 19 failures missed vs 1 >> > failure
> caught, and recall will be dreadful (~0.05). But in fact, the goal is >> achieved: the failure has been caught for this push. It doesn't >> really >> matter whether you catch it 1 time or 20 times. So, recall here >> should >> > be 1.
> >> It should mainly affect per-platform approach, but probably the >> standard >> one can also suffer if running sets are not identical for all >> builders. >> >> > Right. It seems that solving these two issues is non-trivial (the > test_run
> table does not contain duration of the test_run, or anything). But we > can > keep in mind these issues. >
Right. At this point it doesn't even make sense to solve hem -- in real-life application, the first one will be gone naturally, just because there will be no data from unfinished test runs.
The second one only affects recall calculation, in other words -- evaluation of the algorithm. It is interesting from theoretical point of view, but not critical for real-life application.
I fixed up the repositories with updated versions of the queries, as > well > as instructions in the README on how to generate them. > > Now I am looking a bit at the buildbot code, just to try to suggest > some > design ideas for adding the statistician and the pythia into the MTR > related classes. >
As you know, we have the soft pencil-down in a few days, and the hard one a week later. At this point, there isn't much reason to keep frantically improving the algorithm (which is never perfect), so you are right not planning on it.
In the remaining time I suggest to
- address the points above; - make sure that everything that should be configurable is configurable (algorithm, mode, learning set, db connection details); - create structures to store the metrics and reading to/writing from the database; - make sure the predicting and the calculating part can be called separately; - update documentation, clean up logging and code in general.
As long as we have these two parts easily callable, we will find a place in buildbot/MTR to put them to, so don't waste too much time on it.
Regards, Elena
> Regards > Pablo > >
Elena, thank you very much : ). I just pushed the final commit. I had not realized it had failed last time. Best. Pablo On Tue, Aug 19, 2014 at 4:26 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the great work.
Just one thing -- In RESULTS.md, paragraphs "The Fail Frequency algorithm" and "The File-change correlation algorithm" are unfinished. It's not a big deal, but I want to be sure there wasn't anything important in the lost part. Could you please double-check?
Regards, Elena
On 17.08.2014 16:32, Pablo Estrada wrote:
Hello Elena and all, I have submitted the concluding commit to the project with a very short 'RESULTS' file that explains briefly the project, the different strategies and the results. It includes a chart with updated results for both strategies and different modes. If you think I should add anything else, please let me know. Here it is: https://github.com/pabloem/Kokiri/blob/master/RESULTS.md
Thank you very much. Regards
Pablo
On 8/13/14, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
On 10.08.2014 9:31, Pablo Estrada wrote:
Hello Elena, You raise good points. I have just rewritten the save_state and load_state functions. Now they work with a MySQL database and a table that looks like this:
create table kokiri_data ( dict varchar(20), labels varchar(200), value varchar(100), primary key (dict,labels));
Since I wanted to store many dicts into the database, I decided to try this format. The 'dict' field includes the dictionary that the data belongs to ('upd_count','pred_count' or 'test_info'). The 'labels' field includes the space-separated list of labels in the dictionary (for a more detailed explanation, check the README and the code). The value contains the value of the datum (count of runs, relevance, etc.)
Since the labels are space-separated, this assumes we are not using the mixed mode. If we use mixed mode, we may change the separator (, or & or % or $ are good alternatives).
Let me know what you think about this strategy to store into the database. I felt it was the most simple one, while still allowing to do some querying on the database (like loading only one metric or one 'unit' (platform/branch/mix), etc). It may also allow to store many configurations if necessary.
Okay, lets have it this way. We can change it later if we want to.
In the remaining time, you can do the cleanup, check documentation, and maybe run some last clean experiments with the existing data and different parameters (modes, metrics etc.), to have the statistical results with the latest code, which we'll use later to decide on the final configuration.
Regards, Elena
Regards Pablo
On Sat, Aug 9, 2014 at 8:26 AM, Elena Stepanova < elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update. Couple of comments inline.
On 08.08.2014 18:17, Pablo Estrada wrote:
Hello Elena,
I just pushed a transaction, with the following changes:
1. Added an internal counter to the kokiri class, and a function to expose it. This function can show how many update result runs and prediction runs have been run in total, or per unit (an unit being a platform, a branch or a mix of both). Using this counter, one can decide to add logic for extra learning rounds for new platforms (I added it to the wrapper class as an example).
2. Added functions to load and store status into temporary storage. They are very simple - they only serialize to a JSON file, but they can be easily modified to fit the requirements of the implementation. I can add this in the README. If you'd like for me to add the capacity to connect to a database and store the data in a table, I can do that too (I think it
Yes, I think we'll have to have it stored in the database. Chances are, the scripts will run on buildbot slaves rather than on the master, so storing data in a file just won't do any good.
would be easiest to store the dicts as json data in text fields). Let me
know if you'd prefer that.
I don't like the idea of storing the entire dicts as json. It doesn't seem to be justified by... well... anything, except for saving a tiny bit of time on writing queries. But that's a one-time effort, while this way we won't be able to [easily] join the statistical data with, lets say, existing buildbot tables; and it generally won't be efficient and easy to read.
Besides, keep in mind that for real use, if, lets say, we are running in 'platform' mode, for each call we don't need the whole dict, we only need the part of dict which relates to this platform, and possibly the standard one. So, there is really no point loading other 20 platforms' data, which you will almost inevitably do if you store it in a single json.
The real (not json-ed) data structure seems quite suitable for SQL, so it makes sense to store it as such.
If you think it will take you long to do that, it's not critical: just create an example interface for connecting to a database and running *some* queries to store/read the data, and we'll tune it later.
Regards, Elena
By the way, these functions allow the two parts of the algorithm to be
called separately, e.g.:
Predicting phase (can be done depending of counts of training rounds for platform, etc..) 1. Create kokiri instance 2. Load status (call load_status) 3. Input test list, get smaller output 4. Eliminate instance from memory (no need to save state since nothing changes until results are updated)
Training phase: 1. Create kokiri instance 2. Load status (call load_status) 3. Feed new information 4. Save status (call save_status) 5. Eliminate instance from memory
I added tests that check the new features to the wrapper. Both features seem to be working okay. Of course, the more prediction rounds for new platforms, the platform mode improves a bit, but not too dramatically, for what I've seen. I'll test it a bit more.
I will also add these features to the file_change_correlations branch, and document everything in the README file.
Regards Pablo
On Wed, Aug 6, 2014 at 8:04 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
(sorry, forgot the list in my reply, resending)
> > Hi Pablo, > > > > On 03.08.2014 17:51, Pablo Estrada wrote: > > Hi Elena, >> >> >> One thing that I want to see there is fully developed platform >> mode. >> I >> >>> >>> see >> > > that mode option is still there, so it should not be difficult. I >> >>> >>> actually >> > > did it myself while experimenting, but since I only made hasty and >> crude >> >>> changes, I don't expect them to be useful. >>> >>> >>> I'm not sure what code you are referring to. Can you be more >> specific >> on >> what seems to be missing? I might have missed something when >> migrating >> >> from > > the previous architecture... >> >> > I was mainly referring to the learning stage. Currently, the learning > stage is "global". You go through X test runs, collect data, > distribute > it > between platform-specific queues, and from X+1 test run you start > predicting based on whatever platform-specific data you have at the > moment. > > But this is bound to cause rather sporadic quality of prediction, > because > it could happen that out of 3000 learning runs, 1000 belongs to > platform > A, > while platform B only had 100, and platform C was introduced later, > after > your learning cycle. So, for platform B the statistical data will be > very > limited, and for platform C there will be none -- you will simply > start > randomizing tests from the very beginning (or using data from other > platforms as you suggest below, which is still not quite the same as > pure > platform-specific approach). > > It seems more reasonable, if the platform-specific mode is used, to > do > learning per platform too. It is not just about current investigation > activity, but about the real-life implementation too. > > Lets suppose tomorrow we start collecting the data and calculating > the > metrics. > Some platforms will run more often than others, so lets say in 2 > weeks > you > will have X test runs on these platforms so you can start predicting > for > them; while other platforms will run less frequently, and it will > take > 1 > month to collect the same amount of data. > And 2 months later there will be Ubuntu Utopic Unicorn which will > have > no > statistical data at all, and it will be cruel to jump into predicting > there > right away, without any statistical data at all. > > It sounds more complicated than it is, in fact pretty much all you > need > to > add to your algorithm is making 'count' in your run_simulation a dict > rather than a constant. > > So, I imagine that when you store your metrics after a test run, you > will > also store a number of test runs per platform, and only start > predicting > for this particular platform when the count for it reaches the > configured > number. > > > Of the code that's definitely not there, there are a couple things >> that >> could be added: >> 1. When we calculate the relevance of a test on a given platform, we >> >> might > > want to set the relevance to 0, or we might want to derive a default >> relevance from other platforms (An average, the 'standard', etc...). >> Currently, it's just set to 0. >> >> > I think you could combine this idea with what was described above. > While > it makes sense to run *some* full learning cycles on a new platform, > it > does not have to be thousands, especially since some non-LTS > platforms > come > and go awfully fast. So, we run these no-too-many cycles, get clean > platform-specific data, and if necessary enrich it with the other > platforms' data. > > > > 2. We might also, just in case, want to keep the 'standard' queue >> for >> >> when > > we don't have the data for this platform (related to the previous >> point). >> >> > If we do what's described above, we should always have data for the > platform. > But if you mean calculating and storing the standard metrics, then > yes > -- > since we are going to store the values rather than re-calculate them > every > time, there is no reason to be greedy about it. It might even make > sense > to > calculate both metrics that you developed, too. Who knows maybe one > day > we'll find out that the other one gives us better results. > > > >> It doesn't matter in which order they fail/finish; the problem >> is, >> when >> >>> builder2 starts, it doesn't have information about builder1 >>> results, >>> and >>> builder3 doesn't know anything about the first two. So, the metric >>> for >>> >>> test >> > > X could not be increased yet. >> >>> >>> But in your current calculation, it is. So, naturally, if we happen >>> to >>> catch the failure on builder1, the metric raises dramatically, and >>> the >>> failure will be definitely caught on builders 2 and 3. >>> >>> It is especially important now, when you use incoming lists, and >>> the >>> running sets might be not identical for builders 1-3 even in >>> standard >>> >>> mode. >> > > >> >>> Right, I see your point. Although if test_run 1 would catch the >> error, >> test_run 2, although it would be using the same data. might not >> catch >> the >> same errors if the running set makes it such that they are pushed >> out >> due >> to lower relevance. The effect might not be too big, but it >> definitely >> >> has > > potential to affect the results. >> >> Over-pessimistic part: >> >> >>> It is similar to the previous one, but look at the same problem >>> from >>> a >>> different angle. Suppose the push broke test X, and the test >>> started >>> failing on all builders (platforms). So, you have 20 failures, one >>> per >>> >>> test >> > > run, for the same push. Now, suppose you caught it on one platform >> but >> >>> >>> not >> > > on others. Your statistics will still show 19 failures missed vs 1 >> >>> >>> failure >> > > caught, and recall will be dreadful (~0.05). But in fact, the goal >> is >> >>> achieved: the failure has been caught for this push. It doesn't >>> really >>> matter whether you catch it 1 time or 20 times. So, recall here >>> should >>> >>> be 1. >> > > >> It should mainly affect per-platform approach, but probably the >>> standard >>> one can also suffer if running sets are not identical for all >>> builders. >>> >>> >>> Right. It seems that solving these two issues is non-trivial (the >> >> test_run > > table does not contain duration of the test_run, or anything). But >> we >> can >> keep in mind these issues. >> >> > Right. At this point it doesn't even make sense to solve hem -- in > real-life application, the first one will be gone naturally, just > because > there will be no data from unfinished test runs. > > The second one only affects recall calculation, in other words -- > evaluation of the algorithm. It is interesting from theoretical point > of > view, but not critical for real-life application. > > > I fixed up the repositories with updated versions of the queries, > as > >> well >> as instructions in the README on how to generate them. >> >> Now I am looking a bit at the buildbot code, just to try to suggest >> some >> design ideas for adding the statistician and the pythia into the MTR >> related classes. >> >> > > As you know, we have the soft pencil-down in a few days, and the hard > one > a week later. At this point, there isn't much reason to keep > frantically > improving the algorithm (which is never perfect), so you are right > not > planning on it. > > In the remaining time I suggest to > > - address the points above; > - make sure that everything that should be configurable is > configurable > (algorithm, mode, learning set, db connection details); > - create structures to store the metrics and reading to/writing from > the > database; > - make sure the predicting and the calculating part can be called > separately; > - update documentation, clean up logging and code in general. > > As long as we have these two parts easily callable, we will find a > place > in buildbot/MTR to put them to, so don't waste too much time on it. > > Regards, > Elena > > > > Regards >> Pablo >> >> >> >
participants (2)
-
Elena Stepanova
-
Pablo Estrada