Re: [Maria-developers] [GSoC] Accepted student ready to work : )
Hello all, I'm still interested on a call, (I would make sure to write down everything as part of an email to keep it clearly organized here), but as Elena suggested, I'll try to look at Sergei's bzr repo first, and all the documentation. In any case, I already have a decent idea of how the data is arranged. I can see how it is difficult to trace test results back to code changes directly, but it might be possible. I still haven't figured out how to recognize what code changes pertain to the same 'history', or 'line' (I think that you were not considering this in your experiments... and it might turn out to not be relevant at all, but I still want to consider it, to try to make a more complete model). So I have just two questions that should be easy to answer by email: 1. In your graphs, exactly, what do you mean by Precision? What about Recall? And cutoff? 2. I looked at your launchpad profile but I can't seem to find the repo with your tests (if you even pushed it to lp at all). Is it available? That's it for today : ) Until next time. Pablo On Thu, Apr 24, 2014 at 12:28 AM, Sergei Golubchik <serg@skysql.com> wrote:
On April 23, 2014 10:07:23 AM CEST, Pablo Estrada wrote:
If it's not too inconvenient, do you think we could set up a Google Hangout or a Skype call on Monday to go over a few questions quickly?
Definitely. This week I'm on vacations and without a laptop, back home on Sunday. So, Monday is fine. But please if you have specific questions - ask them now in emails. I'll prepare answers for Monday, but it'll take time to remember everything. That is, if you will wait with questions till the call, I might not be able to remember the answer on the spot. Regards, Sergei
Hi, Pablo! On Apr 27, Pablo Estrada wrote: > In any case, I already have a decent idea of how the data is arranged. I > can see how it is difficult to trace test results back to code changes > directly, but it might be possible. >From historical data that you have it is strictly impossible. One can for example: + do some changes * bzr commit -> revno: 1234 (for example), revid: something@unique-1 * bzr push + buildbot builds and tests revno 1234 * bzr uncommit + do more changes * bzr commit -> revno: 1234 (for example), revid: something@unique-2 * bzr push + buildbot builds and tests revno 1234 note, that two *different* revisions got the same revno! And the changes from the first revision are completely and totally lost, there is no way to retrieve from from anywhere. Revision-id is the only unique identifier for a revision, unfortunately, it's not logged in these tables. I believe we'll change buildbot so that revid would be logged in the future. But so far it wasn't needed, and this is one of the defects in the data. In my testing data I've generated fake revision ids - using md5, revno, set of changed files, timestamps, perhaps, I don't remember the details. I've presumed that when this will be put in production, buildbot will log real revids. > I still haven't figured out how to recognize what code changes pertain > to the same 'history', or 'line' (I think that you were not > considering this in your experiments... and it might turn out to not > be relevant at all, but I still want to consider it, to try to make a > more complete model). To a certain extent you can get this from revnos (but not from revids). See any complex history graph in "bzr viz". For example this: http://mhall119.com/2010/04/bzr-is-amazing/ (I've just googled it out with image search for "bzr viz") Note, that changes in the same "line" have the same revno prefix, and only the last component in the revno is being incremented. On that screenshot, you can recognize these "lines" of changes: black: 28 -> 29 -> 30 -> 31 -> 32 blue: 19.3.1 -> 19.3.2 green: 19.1.5 -> 19.1.6 -> 19.1.7 -> 19.1.8 -> 19.1.9 -> 19.1.10 -> 19.1.11 magenta: 19.4.1 red: 19.5.1 yellow: 19.6.1 -> 19.6.2 if you'd like you can install bzr, branch lp:maria and run "bzr viz" or "qbzr". > So I have just two questions that should be easy to answer by email: > 1. In your graphs, exactly, what do you mean by Precision? What about > Recall? And cutoff? For many years I was very interested in Information Retrieval (I'm the author of MySQL full-text search feature). These terms come from IR. http://en.wikipedia.org/wiki/Precision_and_recall Think of it this way: * There's a revision we want to test in buildbot. * There're NNN (more than 20000) tests to run in total (taking into account all combinations, test variants, and builders). * We only want to run N (much less than NNN) tests. * We sort all NNN tests in some specific order, and take first N tests. N is the "cutoff". * Now, I check the history - what tests out of all NNN have failed and succeeded. * Ideally, first N tests should've failed, and all other NNN-N tests should've succeeded. * Precision corresponds to the number of succeeded tests in the first N, where 100% means that all N tests have failed, no succeeded test was selected. * Recall corresponds to the number of failed tests outside of the first N. An 100% there are no failed tests outside of the first N, no failed test was missed. In IR precision and recall should be always considered together. It's trivial to optimize only one of these values. E.g. for N=0 you have a perfect precision, you never select a succeeded test. And you have a very bad recall - all failed tests were missed. Alternatively, for N=NNN, you have a perfect recall, and bad precision. In my results (see readme from v1) I've realized that I don't care about precision as such, I am only interested in recall and the total testing time. Although precision and the time are related, time is much more meaningful here. So I optimize a {Time, Recall} pair, not {Precision, Recall}. > 2. I looked at your launchpad profile but I can't seem to find the repo > with your tests (if you even pushed it to lp at all). Is it available? Uhm. I've never published it. It was a pretty much write-only code in very hairy perl. I don't know if it makes any sense to look at it at all. But ok, I've just pushed it, here you are: bzr+ssh://bazaar.launchpad.net/~maria-captains/mariadb-tools/mtr-prob-tmp Regards, Sergei
Sergei Golubchik <serg@mariadb.org> writes:
note, that two *different* revisions got the same revno! And the changes from the first revision are completely and totally lost, there is no way to retrieve from from anywhere.
Indeed. But note that in main trees (5.1, 5.2, 5.3, 5.5, and 10.0), this cannot occur, since we have set the append_revision_only option (or "append_revisions_only", can't remember). This prevents a revision number from changing, once pushed. So in main trees, the revision number _should_ in fact be unique.
Revision-id is the only unique identifier for a revision, unfortunately, it's not logged in these tables. I believe we'll change buildbot so that revid would be logged in the future. But so far it wasn't needed, and this is one of the defects in the data.
I actually wanted to log it when I wrote the code. The problem is that the revision-id is not available to buildbot when the change is received from Launchpad. I even asked the bzr/launchpad developers to provide the revid: so it could be logged. The answer I got was that it is a deliberate feature to hide the revision id :-( https://bugs.launchpad.net/launchpad/+bug/419057 So I don't think we will get revid in Buildbot. Of course, if we go to git, we will not have this problem anymore, as it always uses a consistent, stable revision identifier. Hope this helps, - Kristian.
Hi, Kristian! On Apr 28, Kristian Nielsen wrote:
Sergei Golubchik <serg@mariadb.org> writes:
note, that two *different* revisions got the same revno! And the changes from the first revision are completely and totally lost, there is no way to retrieve from from anywhere.
Indeed.
But note that in main trees (5.1, 5.2, 5.3, 5.5, and 10.0), this cannot occur, since we have set the append_revision_only option (or "append_revisions_only", can't remember). This prevents a revision number from changing, once pushed.
So in main trees, the revision number _should_ in fact be unique.
Yes. I omitted that detail, because I hope that we can find a solution that works for all trees without checks that only work for main trees. But, of course, as the last resort we can rely on append_revisions_only.
Revision-id is the only unique identifier for a revision, unfortunately, it's not logged in these tables. I believe we'll change buildbot so that revid would be logged in the future. But so far it wasn't needed, and this is one of the defects in the data.
I actually wanted to log it when I wrote the code. The problem is that the revision-id is not available to buildbot when the change is received from Launchpad. I even asked the bzr/launchpad developers to provide the revid: so it could be logged. The answer I got was that it is a deliberate feature to hide the revision id :-(
https://bugs.launchpad.net/launchpad/+bug/419057
So I don't think we will get revid in Buildbot. Of course, if we go to git, we will not have this problem anymore, as it always uses a consistent, stable revision identifier.
Oh, I see, thanks. Git - yes, that's not an issue. Bzr - perhaps we could figure out something regardless. May be get the revid on the tarbake builder - it needs the tree anyway. Or use fake revids. Or something. It is not a showstopper for this project, we can think about it later, when we finish the research part and get to the integration. Regards, Sergei
Hello everyone, Well, I now have familiarized myself with the data. I will start trying to simulate some scenarios. In this email, I will summarize my roadmap for the next few days. The text is* quite complicated* (and my redaction is kind of clumsy too). It's *not necessary* *to read it in detail*. This email is mostly to keep a record of what I'll do. Anyhow, if anybody has questions or comments, I will be happy to address them, as well as receive any feedback. I will start coding the simulations in the next few days. Also, if you want more details of what I'm doing, I can write that up too. So here's what I'll do for the simulations: *1. Calculating the: "Relevancy index"* for a test, I have considered two simple options so far: - *Exponential decay*: The relevancy index of a test is the *sum over each failure* of( *exp((FailureTime - CurrentTime)/DecayRate))*. It decreases exponentially as time passes, and increases if the test fails. - DecayRate is - i.e. If TestA failed at days 5 and 7, and now is day 9, RI will be (exp(5-9)+exp(7-9)) = (exp(-4)+exp(-2)). - The unit to measure time is just seconds in UNIX_TIMESTAMP - *Weighted moving average*: The relevancy index of a test is: *R[now] = R[now-1]*alpha + fail*(1-alpha)*, where fail is 1 if the test failed in this run, and 0 if it did not fail. The value is between 1 and 0. It decreases slowly if a test runs without failing, and it increases slowly if the test fails. - 0 < alpha < 1 (Initially set at 0.95 for testing). - i.e. If TestB failed for the first time in the last run, and again in this run: R[t] = 1*0.95 + 1*0.5 = 1 - If test B ran once more and did not fail, then: R[t+1] = 1*0.95 + 0*0.5 = 0.95 - The *advantage of this method* is that it doesn't have to look at the whole history every time it's calculated (unlike the exponential decay method) - Much like TCP protocol (1<http://www.cl.cam.ac.uk/~jac22/books/mm/book/node153.html> ) Regarding the *Relevancy Index*, it can be calculated grouping test results in many ways: *Roughly* using test_name+variation, or *more granularly* by *including* *branch* and *platform*. I'll add some thoughts regarding these options at the bottom of the email. *2. *To* run the simulation*, I'll gather data from the first few thousands of test_run entries, and then start simulating results. Here's what I'll do: 1. *Gather data *first few thousands of test_run entries (i.e. 4 thousand) 2. After N thousand test_runs, I'll go through the test_run entries *one by one*, and using the data gathered to that point, I will select '*running sets*' of *100* *test suites* to run on each test_run entry. (The number can be adjusted) 3. If in this *test_run* entry, the list of *failed tests* contains tests that are *NOT part* of the *running set*, the failure will be ignored, and so the information of this failure will be lost (not used as part of the relevancy index). *(See Comment 2)* 4. If the set of *failed tests *in the *test_run* entry intersect with the *running_set*, this is better *recall*. This information will be used to continue calculating the *relevancy index*. According to the results obtained from the simulations, we can adjust the algorithm (i.e. to consider *relevancy index by* *platform* and *branch*, etc.) Comments about the *relevancy index:* - The methods to calculate the relevancy index are very simple. There are some other useful metrics that could be incorporated - *Time since last run. *With the current methods, if a* test*completely *stops running*, it only* becomes less relevant with time*, and so even if it could expose defects, it doesn't get to run because its relevancy index is just going down. Incorporating a function that* increases the relevancy index* as the *time since the last run* *increases* can help solve this issue. I believe this measure will be useful. - *Correlation between test failures*. If two tests tend to fail together, is it better to just run one of them? Incorporating this measure seems difficult, but it is on the table, in case we should consider it. - As you might have seen, I decided to not consider any data concerned with *code changes*. I'll work like this and see if the results are satisfactory. Comments regarding *buildbot infrasturcture:* These comments are out of the scope of this project, but it would be very desirable features for the buildbot infrastructure. - Unfortunately, given the data available in the database, it is NOT possible to know *which tests ran* on each *test_run*. This information would be very useful, as it would help estimate the *exact failure rate*of a test. I didn't look into the code, but it seems that *class MtrLogObserver*(2<http://buildbot.sourcearchive.com/documentation/0.8.3p1-1/mtrlogobserver_8py_source.html>) contains most of the infrastructure necessary to just add one or two more tables ( *test_suite*, and *test_suite_test_run*), some code, and start keeping track of this information. - Another problem with the data available in the database is that it is not possible to know *how many test suites exist*. It is only possible to estimate *how many different test suites have failed*. This would also be helpful information. - Actually, this information would be useful not only for this project, but in general for book-keeping of the development of MariaDB. Thanks to all, Pablo On Mon, Apr 28, 2014 at 9:57 PM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Kristian!
Sergei Golubchik <serg@mariadb.org> writes:
note, that two *different* revisions got the same revno! And the changes from the first revision are completely and totally lost, there is no way to retrieve from from anywhere.
Indeed.
But note that in main trees (5.1, 5.2, 5.3, 5.5, and 10.0), this cannot occur, since we have set the append_revision_only option (or "append_revisions_only", can't remember). This prevents a revision number from changing, once
On Apr 28, Kristian Nielsen wrote: pushed.
So in main trees, the revision number _should_ in fact be unique.
Yes. I omitted that detail, because I hope that we can find a solution that works for all trees without checks that only work for main trees. But, of course, as the last resort we can rely on append_revisions_only.
Revision-id is the only unique identifier for a revision, unfortunately, it's not logged in these tables. I believe we'll change buildbot so that revid would be logged in the future. But so far it wasn't needed, and this is one of the defects in the data.
I actually wanted to log it when I wrote the code. The problem is that the revision-id is not available to buildbot when the change is received from Launchpad. I even asked the bzr/launchpad developers to provide the revid: so it could be logged. The answer I got was that it is a deliberate feature to hide the revision id :-(
https://bugs.launchpad.net/launchpad/+bug/419057
So I don't think we will get revid in Buildbot. Of course, if we go to git, we will not have this problem anymore, as it always uses a consistent, stable revision identifier.
Oh, I see, thanks.
Git - yes, that's not an issue. Bzr - perhaps we could figure out something regardless. May be get the revid on the tarbake builder - it needs the tree anyway. Or use fake revids. Or something. It is not a showstopper for this project, we can think about it later, when we finish the research part and get to the integration.
Regards, Sergei
Hi, Pablo! On May 07, Pablo Estrada wrote:
So here's what I'll do for the simulations:
*1. Calculating the: "Relevancy index"* for a test, I have considered two simple options so far:
- *Exponential decay*: The relevancy index of a test is the *sum over each failure* of( *exp((FailureTime - CurrentTime)/DecayRate))*. It decreases exponentially as time passes, and increases if the test fails. - DecayRate is - i.e. If TestA failed at days 5 and 7, and now is day 9, RI will be (exp(5-9)+exp(7-9)) = (exp(-4)+exp(-2)). - The unit to measure time is just seconds in UNIX_TIMESTAMP - *Weighted moving average*: The relevancy index of a test is: *R[now] = R[now-1]*alpha + fail*(1-alpha)*, where fail is 1 if the test failed in this run, and 0 if it did not fail. The value is between 1 and 0. It decreases slowly if a test runs without failing, and it increases slowly if the test fails. - 0 < alpha < 1 (Initially set at 0.95 for testing). - i.e. If TestB failed for the first time in the last run, and again in this run: R[t] = 1*0.95 + 1*0.5 = 1 - If test B ran once more and did not fail, then: R[t+1] = 1*0.95 + 0*0.5 = 0.95 - The *advantage of this method* is that it doesn't have to look at the whole history every time it's calculated (unlike the exponential decay method)
you don't need to look at the whole history for the exponential decay. Because it is exp((FailureTime - CurrentTime)/DecayRate) You simply have R[t] = exp(FailureTime/DecayRate) / exp(t/DecayRate) R[t+1] = R[t] / exp(1/DecayRate) (if there was no failure)
- Much like TCP protocol (http://www.cl.cam.ac.uk/~jac22/books/mm/book/node153.html)
Regarding the *Relevancy Index*, it can be calculated grouping test results in many ways: *Roughly* using test_name+variation, or *more granularly* by *including* *branch* and *platform*. I'll add some thoughts regarding these options at the bottom of the email.
I've tested these options earlier, you may want to try them all too and see which one delivers better results.
*2. *To* run the simulation*, I'll gather data from the first few thousands of test_run entries, and then start simulating results. Here's what I'll do:
1. *Gather data *first few thousands of test_run entries (i.e. 4 thousand) 2. After N thousand test_runs, I'll go through the test_run entries *one by one*, and using the data gathered to that point, I will select '*running sets*' of *100* *test suites* to run on each test_run entry. (The number can be adjusted)
Absolutely, it should be. This is the main parameter we can tune, after all. The larger your running set is, the better will be the recall. May be not now but later, but it would be very useful to see these graphs, recall as a function of the running set size. It's important to know whether by increasing the running set by 10% we get 1% recall increase of 70% recall increase (as you've seen, there's a region when recall increases very fast as the running set grows).
3. If in this *test_run* entry, the list of *failed tests* contains tests that are *NOT part* of the *running set*, the failure will be ignored, and so the information of this failure will be lost (not used as part of the relevancy index). *(See Comment 2)* 4. If the set of *failed tests *in the *test_run* entry intersect with the *running_set*, this is better *recall*. This information will be used to continue calculating the *relevancy index*.
Could you explain the terminology you're using? What is a "test suite" and what is a "test run"? How will you calculate the "recall"?
According to the results obtained from the simulations, we can adjust the algorithm (i.e. to consider *relevancy index by* *platform* and *branch*, etc.)
Comments about the *relevancy index:*
- The methods to calculate the relevancy index are very simple. There are some other useful metrics that could be incorporated - *Time since last run. *With the current methods, if a* test*completely *stops running*, it only* becomes less relevant with time*, and so even if it could expose defects, it doesn't get to run because its relevancy index is just going down. Incorporating a function that* increases the relevancy index* as the *time since the last run* *increases* can help solve this issue. I believe this measure will be useful.
Right. I will not comment on this measure now. But I absolutely agree that this is an issue that must be solved.
- *Correlation between test failures*. If two tests tend to fail together, is it better to just run one of them? Incorporating this measure seems difficult, but it is on the table, in case we should consider it.
Agree. Taking correlations into account looks very promising, but it does seem to be difficult :)
- As you might have seen, I decided to not consider any data concerned with *code changes*. I'll work like this and see if the results are satisfactory.
Right!
Comments regarding *buildbot infrasturcture:* These comments are out of the scope of this project, but it would be very desirable features for the buildbot infrastructure.
- Unfortunately, given the data available in the database, it is NOT possible to know *which tests ran* on each *test_run*. This information would be very useful, as it would help estimate the *exact failure rate*of a test. I didn't look into the code, but it seems that *class MtrLogObserver*(http://buildbot.sourcearchive.com/documentation/0.8.3p1-1/mtrlogobserver_8py...) contains most of the infrastructure necessary to just add one or two more tables ( *test_suite*, and *test_suite_test_run*), some code, and start keeping track of this information. - Another problem with the data available in the database is that it is not possible to know *how many test suites exist*. It is only possible to estimate *how many different test suites have failed*. This would also be helpful information. - Actually, this information would be useful not only for this project, but in general for book-keeping of the development of MariaDB.
Regards, Sergei
Hello Sergei and all, First of all, I'll explain quickly the terms that I was using: - *test_suite, test suite, test case* - When I say test suite or test case, I am referring to a single test file. For instance ' *pbxt.group_min_max*'. They are the ones that fail, and whose failures we want to attempt to predict. - *test_run, test run* - When I use this term, I refer to an entry in the *test_run* table of the database. A test run is a set of *test_suites* that run together at a certain time. I have in place now a basic script to do the simulations. I have tried to keep the code clear, and I will upload a repository to github soon. I have already run simulations on the data. The simulations used 2000 test_runs as training data, and then attempted to predict behavior on the following 3000 test_runs. Of course, maybe a wider spectrum of data might be needed to truly asses the algorithm. I used four different ways to calculate a 'relevancy index' for a test: 1. Keep a relevancy index by test case 2. Keep a relevancy index by test case by platform 3. Keep a relevancy index by test case by branch 4. Keep a relevancy index by test case by branch by platform (mixed) I graphed the results. The graph is attached. As can be seen from the graph, the platform and the mixed model proved to be the best for recall. I feel the results were quite similar to what Sergei encountered. I have not run the tests on a larger set of data (the data dump that I have available contains 200,000 test_runs, so in theory I could test the algorithm with all this data)... I feel that I want to consider a couple things before going on to big testing: I feel that there is a bit of a potential fallacy in the model that I'm following. Here's why: The problem that I find in the model is that we don't know a-priori when a test will fail for the first time. Strictly speaking, in the model, if a test doesn't fail for the first time, it never starts running at all. In the implementation that I made, I am using the first failure of each test to start giving it a relevancy test (so the test would have to fail before it even qualifies to run). This results in a really high recall rate because it is natural that if a test fails once, it might fail pretty soon after, so although we might have missed the first failure, we still consider that we didn't miss it, and based on it we will catch the two or three failures that come right after. This inflates the recall rate of 'subsequent' failures, but it is not very helpful when trying to catch failures that are not part of a trend... I feel this is not realistic. Here are changes that I'd like to incorporate to the model: 1. The failure rate should stay, and should still be measured with exponential decay or weighted average 2. Include a new measure that increases relevancy: Time since last run. The relevancy index should have a component that makes the test more relevant the longer it spends not running 1. A problem with this is that *test suites* that might have stopped being used will stay and compete for resources, although in reality they would not be relevant anymore 3. Include also correlation. I still don't have a great idea of how correlation will be considered, but it's something like this: 1. The data contains the list of test_runs where each test_suite has failed. If two test suites have failed together a certain percentage of times (>30%?), then when test A fails, the relevancy test of test B also goes up... and when test A runs without failing, the relevancy test of test B goes down too. 2. Using only the times that tests fail together seems like a good heuristic, without having to calculate the total correlation of all the history of all the combinations of tests. If these measures were to be incorporated, a couple of changes would also have to be considered: 1. Failures that are* not spotted* *on a test_run* might be *able to be spotted *on the *next* two or three or *N test_runs*? What do you think? 2. Considering these measures, probably *recall* will be *negatively affected*, but I feel that the model would be *more realistic*. Any input on my new suggestions? If all seems okay, I will proceed on to try to implement these. Also, I will soon upload the information so far to github. Can I also upload queries made to the database? Or are these private? Regards Pablo On Wed, May 7, 2014 at 7:41 PM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Pablo!
On May 07, Pablo Estrada wrote:
So here's what I'll do for the simulations:
*1. Calculating the: "Relevancy index"* for a test, I have considered two simple options so far:
- *Exponential decay*: The relevancy index of a test is the *sum over each failure* of( *exp((FailureTime - CurrentTime)/DecayRate))*. It decreases exponentially as time passes, and increases if the test
fails.
- DecayRate is - i.e. If TestA failed at days 5 and 7, and now is day 9, RI will be (exp(5-9)+exp(7-9)) = (exp(-4)+exp(-2)). - The unit to measure time is just seconds in UNIX_TIMESTAMP - *Weighted moving average*: The relevancy index of a test is:
*R[now] =
R[now-1]*alpha + fail*(1-alpha)*, where fail is 1 if the test failed in this run, and 0 if it did not fail. The value is between 1 and 0. It decreases slowly if a test runs without failing, and it increases slowly if the test fails. - 0 < alpha < 1 (Initially set at 0.95 for testing). - i.e. If TestB failed for the first time in the last run, and again in this run: R[t] = 1*0.95 + 1*0.5 = 1 - If test B ran once more and did not fail, then: R[t+1] = 1*0.95 + 0*0.5 = 0.95 - The *advantage of this method* is that it doesn't have to look at the whole history every time it's calculated (unlike the exponential decay method)
you don't need to look at the whole history for the exponential decay. Because it is
exp((FailureTime - CurrentTime)/DecayRate)
You simply have
R[t] = exp(FailureTime/DecayRate) / exp(t/DecayRate) R[t+1] = R[t] / exp(1/DecayRate) (if there was no failure)
- Much like TCP protocol (
http://www.cl.cam.ac.uk/~jac22/books/mm/book/node153.html)
Regarding the *Relevancy Index*, it can be calculated grouping test
in many ways: *Roughly* using test_name+variation, or *more granularly* by *including* *branch* and *platform*. I'll add some thoughts regarding
results these
options at the bottom of the email.
I've tested these options earlier, you may want to try them all too and see which one delivers better results.
*2. *To* run the simulation*, I'll gather data from the first few thousands of test_run entries, and then start simulating results. Here's what I'll do:
1. *Gather data *first few thousands of test_run entries (i.e. 4 thousand) 2. After N thousand test_runs, I'll go through the test_run entries *one by one*, and using the data gathered to that point, I will select '*running sets*' of *100* *test suites* to run on each test_run entry. (The number can be adjusted)
Absolutely, it should be. This is the main parameter we can tune, after all. The larger your running set is, the better will be the recall.
May be not now but later, but it would be very useful to see these graphs, recall as a function of the running set size. It's important to know whether by increasing the running set by 10% we get 1% recall increase of 70% recall increase (as you've seen, there's a region when recall increases very fast as the running set grows).
3. If in this *test_run* entry, the list of *failed tests* contains tests that are *NOT part* of the *running set*, the failure will be ignored, and so the information of this failure will be lost (not used as part of the relevancy index). *(See Comment 2)* 4. If the set of *failed tests *in the *test_run* entry intersect with the *running_set*, this is better *recall*. This information will be used to continue calculating the *relevancy index*.
Could you explain the terminology you're using? What is a "test suite" and what is a "test run"? How will you calculate the "recall"?
According to the results obtained from the simulations, we can adjust the algorithm (i.e. to consider *relevancy index by* *platform* and *branch*, etc.)
Comments about the *relevancy index:*
- The methods to calculate the relevancy index are very simple. There are some other useful metrics that could be incorporated - *Time since last run. *With the current methods, if a* test*completely *stops running*, it only* becomes less relevant with time*, and so even if it could expose defects, it doesn't get to run because its relevancy index is just going down. Incorporating a function that* increases the relevancy index* as the *time since the last run* *increases* can help solve this issue. I believe this measure will be useful.
Right. I will not comment on this measure now. But I absolutely agree that this is an issue that must be solved.
- *Correlation between test failures*. If two tests tend to fail together, is it better to just run one of them? Incorporating this measure seems difficult, but it is on the table, in case we should
consider it.
Agree. Taking correlations into account looks very promising, but it does seem to be difficult :)
- As you might have seen, I decided to not consider any data concerned with *code changes*. I'll work like this and see if the results are satisfactory.
Right!
Comments regarding *buildbot infrasturcture:* These comments are out of the scope of this project, but it would be very desirable features for the buildbot infrastructure.
- Unfortunately, given the data available in the database, it is NOT possible to know *which tests ran* on each *test_run*. This information would be very useful, as it would help estimate the *exact failure rate*of a test. I didn't look into the code, but it seems that *class MtrLogObserver*( http://buildbot.sourcearchive.com/documentation/0.8.3p1-1/mtrlogobserver_8py... ) contains most of the infrastructure necessary to just add one or two more tables ( *test_suite*, and *test_suite_test_run*), some code, and start keeping track of this information. - Another problem with the data available in the database is that it is not possible to know *how many test suites exist*. It is only possible to estimate *how many different test suites have failed*. This would also be helpful information. - Actually, this information would be useful not only for this project, but in general for book-keeping of the development of MariaDB.
Regards, Sergei
Here's the repository with the code I have in place so far. It's pretty simple stuff: https://github.com/pabloem/Kokiri On Wed, May 21, 2014 at 1:06 PM, Pablo Estrada <polecito.em@gmail.com>wrote:
Hello Sergei and all, First of all, I'll explain quickly the terms that I was using:
- *test_suite, test suite, test case* - When I say test suite or test case, I am referring to a single test file. For instance ' *pbxt.group_min_max*'. They are the ones that fail, and whose failures we want to attempt to predict. - *test_run, test run* - When I use this term, I refer to an entry in the *test_run* table of the database. A test run is a set of *test_suites* that run together at a certain time.
I have in place now a basic script to do the simulations. I have tried to keep the code clear, and I will upload a repository to github soon. I have already run simulations on the data. The simulations used 2000 test_runs as training data, and then attempted to predict behavior on the following 3000 test_runs. Of course, maybe a wider spectrum of data might be needed to truly asses the algorithm.
I used four different ways to calculate a 'relevancy index' for a test:
1. Keep a relevancy index by test case 2. Keep a relevancy index by test case by platform 3. Keep a relevancy index by test case by branch 4. Keep a relevancy index by test case by branch by platform (mixed)
I graphed the results. The graph is attached. As can be seen from the graph, the platform and the mixed model proved to be the best for recall. I feel the results were quite similar to what Sergei encountered.
I have not run the tests on a larger set of data (the data dump that I have available contains 200,000 test_runs, so in theory I could test the algorithm with all this data)... I feel that I want to consider a couple things before going on to big testing:
I feel that there is a bit of a potential fallacy in the model that I'm following. Here's why: The problem that I find in the model is that we don't know a-priori when a test will fail for the first time. Strictly speaking, in the model, if a test doesn't fail for the first time, it never starts running at all. In the implementation that I made, I am using the first failure of each test to start giving it a relevancy test (so the test would have to fail before it even qualifies to run). This results in a really high recall rate because it is natural that if a test fails once, it might fail pretty soon after, so although we might have missed the first failure, we still consider that we didn't miss it, and based on it we will catch the two or three failures that come right after. This inflates the recall rate of 'subsequent' failures, but it is not very helpful when trying to catch failures that are not part of a trend... I feel this is not realistic.
Here are changes that I'd like to incorporate to the model:
1. The failure rate should stay, and should still be measured with exponential decay or weighted average 2. Include a new measure that increases relevancy: Time since last run. The relevancy index should have a component that makes the test more relevant the longer it spends not running 1. A problem with this is that *test suites* that might have stopped being used will stay and compete for resources, although in reality they would not be relevant anymore 3. Include also correlation. I still don't have a great idea of how correlation will be considered, but it's something like this: 1. The data contains the list of test_runs where each test_suite has failed. If two test suites have failed together a certain percentage of times (>30%?), then when test A fails, the relevancy test of test B also goes up... and when test A runs without failing, the relevancy test of test B goes down too. 2. Using only the times that tests fail together seems like a good heuristic, without having to calculate the total correlation of all the history of all the combinations of tests.
If these measures were to be incorporated, a couple of changes would also have to be considered:
1. Failures that are* not spotted* *on a test_run* might be *able to be spotted *on the *next* two or three or *N test_runs*? What do you think? 2. Considering these measures, probably *recall* will be *negatively affected*, but I feel that the model would be *more realistic*.
Any input on my new suggestions? If all seems okay, I will proceed on to try to implement these. Also, I will soon upload the information so far to github. Can I also upload queries made to the database? Or are these private?
Regards Pablo
On Wed, May 7, 2014 at 7:41 PM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Pablo!
On May 07, Pablo Estrada wrote:
So here's what I'll do for the simulations:
*1. Calculating the: "Relevancy index"* for a test, I have considered
simple options so far:
- *Exponential decay*: The relevancy index of a test is the *sum over each failure* of( *exp((FailureTime - CurrentTime)/DecayRate))*. It decreases exponentially as time passes, and increases if the test fails. - DecayRate is - i.e. If TestA failed at days 5 and 7, and now is day 9, RI will be (exp(5-9)+exp(7-9)) = (exp(-4)+exp(-2)). - The unit to measure time is just seconds in UNIX_TIMESTAMP - *Weighted moving average*: The relevancy index of a test is: *R[now] = R[now-1]*alpha + fail*(1-alpha)*, where fail is 1 if the test failed in this run, and 0 if it did not fail. The value is between 1 and 0. It decreases slowly if a test runs without failing, and it increases slowly if the test fails. - 0 < alpha < 1 (Initially set at 0.95 for testing). - i.e. If TestB failed for the first time in the last run, and again in this run: R[t] = 1*0.95 + 1*0.5 = 1 - If test B ran once more and did not fail, then: R[t+1] = 1*0.95
two +
0*0.5 = 0.95 - The *advantage of this method* is that it doesn't have to look
at
the whole history every time it's calculated (unlike the
exponential decay
method)
you don't need to look at the whole history for the exponential decay. Because it is
exp((FailureTime - CurrentTime)/DecayRate)
You simply have
R[t] = exp(FailureTime/DecayRate) / exp(t/DecayRate) R[t+1] = R[t] / exp(1/DecayRate) (if there was no failure)
- Much like TCP protocol (
http://www.cl.cam.ac.uk/~jac22/books/mm/book/node153.html)
Regarding the *Relevancy Index*, it can be calculated grouping test
in many ways: *Roughly* using test_name+variation, or *more granularly* by *including* *branch* and *platform*. I'll add some thoughts regarding
results these
options at the bottom of the email.
I've tested these options earlier, you may want to try them all too and see which one delivers better results.
*2. *To* run the simulation*, I'll gather data from the first few thousands of test_run entries, and then start simulating results. Here's what I'll do:
1. *Gather data *first few thousands of test_run entries (i.e. 4 thousand) 2. After N thousand test_runs, I'll go through the test_run entries *one by one*, and using the data gathered to that point, I will select '*running sets*' of *100* *test suites* to run on each test_run entry. (The number can be adjusted)
Absolutely, it should be. This is the main parameter we can tune, after all. The larger your running set is, the better will be the recall.
May be not now but later, but it would be very useful to see these graphs, recall as a function of the running set size. It's important to know whether by increasing the running set by 10% we get 1% recall increase of 70% recall increase (as you've seen, there's a region when recall increases very fast as the running set grows).
3. If in this *test_run* entry, the list of *failed tests* contains tests that are *NOT part* of the *running set*, the failure will be ignored, and so the information of this failure will be lost (not used as part of the relevancy index). *(See Comment 2)* 4. If the set of *failed tests *in the *test_run* entry intersect with the *running_set*, this is better *recall*. This information will be used to continue calculating the *relevancy index*.
Could you explain the terminology you're using? What is a "test suite" and what is a "test run"? How will you calculate the "recall"?
According to the results obtained from the simulations, we can adjust the algorithm (i.e. to consider *relevancy index by* *platform* and *branch*, etc.)
Comments about the *relevancy index:*
- The methods to calculate the relevancy index are very simple. There are some other useful metrics that could be incorporated - *Time since last run. *With the current methods, if a* test*completely *stops running*, it only* becomes less relevant with time*, and so even if it could expose defects, it doesn't get to run because its relevancy index is just going down. Incorporating a function that* increases the relevancy index* as the *time since the last run* *increases* can help solve this issue. I believe this measure will be useful.
Right. I will not comment on this measure now. But I absolutely agree that this is an issue that must be solved.
- *Correlation between test failures*. If two tests tend to fail together, is it better to just run one of them? Incorporating this measure seems difficult, but it is on the table, in case we should
consider it.
Agree. Taking correlations into account looks very promising, but it does seem to be difficult :)
- As you might have seen, I decided to not consider any data concerned with *code changes*. I'll work like this and see if the results are satisfactory.
Right!
Comments regarding *buildbot infrasturcture:* These comments are out of the scope of this project, but it would be very desirable features for the buildbot infrastructure.
- Unfortunately, given the data available in the database, it is NOT possible to know *which tests ran* on each *test_run*. This information would be very useful, as it would help estimate the *exact failure rate*of a test. I didn't look into the code, but it seems that *class MtrLogObserver*( http://buildbot.sourcearchive.com/documentation/0.8.3p1-1/mtrlogobserver_8py... ) contains most of the infrastructure necessary to just add one or two more tables ( *test_suite*, and *test_suite_test_run*), some code, and start keeping track of this information. - Another problem with the data available in the database is that it is not possible to know *how many test suites exist*. It is only possible to estimate *how many different test suites have failed*. This would also be helpful information. - Actually, this information would be useful not only for this project, but in general for book-keeping of the development of MariaDB.
Regards, Sergei
Hi Pablo, Thanks for the update, it looks like a great start. Please see some comments inline, and I suppose Sergei will provide his own if he has something to add or correct. Most controversial points of mine I marked as "ATTN Sergei:". On 21.05.2014 8:06, Pablo Estrada wrote:
Hello Sergei and all, First of all, I'll explain quickly the terms that I was using:
- *test_suite, test suite, test case* - When I say test suite or test case, I am referring to a single test file. For instance ' *pbxt.group_min_max*'. They are the ones that fail, and whose failures we want to attempt to predict. - *test_run, test run* - When I use this term, I refer to an entry in the *test_run* table of the database. A test run is a set of *test_suites* that run together at a certain time.
Since your algorithm is expected to be used with MTR-based tests, I don't think it's a good idea to redefine concepts that have very particular meaning in MTR universe, it's going to be very confusing for people who use it later. Also, these concepts in MTR are close enough to the formal standard definitions, which is yet another reason to stick with them. Test suite in MTR is a set of test cases under the same directory. In your example, pbxt is a test suite. (Also, all MTR test cases are collectively called "MTR test suite", but it's irrelevant to your purposes). Test case in MTR is what is represented by individual X.test/X.result files and, optionally, some other related files. In your example group_min_max is a test case from pbxt test suite. I suggest to stay with the terminology, for clarity.
I have in place now a basic script to do the simulations. I have tried to keep the code clear, and I will upload a repository to github soon. I have already run simulations on the data. The simulations used 2000 test_runs as training data, and then attempted to predict behavior on the following 3000 test_runs. Of course, maybe a wider spectrum of data might be needed to truly asses the algorithm.
I used four different ways to calculate a 'relevancy index' for a test:
1. Keep a relevancy index by test case 2. Keep a relevancy index by test case by platform 3. Keep a relevancy index by test case by branch 4. Keep a relevancy index by test case by branch by platform (mixed)
I graphed the results. The graph is attached. As can be seen from the graph, the platform and the mixed model proved to be the best for recall. I feel the results were quite similar to what Sergei encountered.
The results look as expected, which is good. Logically they are probably wrong -- given the nature of the tests, the branch should be a more important factor than the platform. But it's not your algorithm that fails, it's the data that is full of false-positives, which are impossible to separate from the real failures, and which indeed tend to be platform-dependent. But even on an ideal data set the mixed approach should still be most efficient, so it should be okay to use it even if some day we fix all the broken tests and collect reliable data.
I have not run the tests on a larger set of data (the data dump that I have available contains 200,000 test_runs, so in theory I could test the algorithm with all this data)... I feel that I want to consider a couple things before going on to big testing:
I feel that there is a bit of a potential fallacy in the model that I'm following. Here's why: The problem that I find in the model is that we don't know a-priori when a test will fail for the first time. Strictly speaking, in the model, if a test doesn't fail for the first time, it never starts running at all. In the implementation that I made, I am using the first failure of each test to start giving it a relevancy test (so the test would have to fail before it even qualifies to run). This results in a really high recall rate because it is natural that if a test fails once, it might fail pretty soon after, so although we might have missed the first failure, we still consider that we didn't miss it, and based on it we will catch the two or three failures that come right after. This inflates the recall rate of 'subsequent' failures, but it is not very helpful when trying to catch failures that are not part of a trend... I feel this is not realistic.
I don't quite get this part. What does it mean, "missed the first failure"? Missed how/why? *ATTN Sergei:* Anyway, the problem seems to originate from the fact that we do not have a complete list of tests that had ever been run, which you mentioned in the previous email, and which we should look into fixing. If that's done, you won't have to rely on the first failure to start calculating relevancy, you will do it from the first test run.
Here are changes that I'd like to incorporate to the model:
1. The failure rate should stay, and should still be measured with exponential decay or weighted average 2. Include a new measure that increases relevancy: Time since last run. The relevancy index should have a component that makes the test more relevant the longer it spends not running
I agree with the idea, but have doubts about the criteria. I think you should measure not the time, but the number of test runs that happened since the last time the test was run (it would be even better if we could count the number of revisions, but that's probably not easy). The reason is that some branches are very active, while others can be extremely slow. So, with the same time-based coefficient the relevancy of a test can strike between two consequent test runs just because they happened with a month interval, but will be changing too slowly on a branch which has a dozen of commits a day.
1. A problem with this is that *test suites* that might have stopped being used will stay and compete for resources, although in reality they would not be relevant anymore
*ATTN Sergei:* I think in any case we'll have to rely on the fact that your script will choose tests not from the whole universe of tests, but from an initial list that MTR produces for this particular test run. That is, it will go something like that: - test run is started in buildbot; - MTR collects test cases to run, according to the startup parameters, as it always does; - the list is passed to your script; - the script filters it according to the algorithm that you developed, keeps only a small portion of the initial list, and passes it back to MTR; - MTR runs the requested tests. That is, you do exclusion of tests rather than inclusion. This will solve two problems: - first test run: when a new test is added, only MTR knows about it, buildbot doesn't; so, when MTR passes to you a test that you know nothing about (and assuming that we do have a list of all executed tests in buildbot), you'll know it's a new test and will act accordingly; - abandoned tests: MTR just won't pass them to your script, so it won't take them into account.
3. Include also correlation. I still don't have a great idea of how correlation will be considered, but it's something like this: 1. The data contains the list of test_runs where each test_suite has failed. If two test suites have failed together a certain percentage of times (>30%?), then when test A fails, the relevancy test of test B also goes up... and when test A runs without failing, the relevancy test of test B goes down too.
We'll need to see how it goes. In real life correlation of this kind does exist, but I'd say much more often related failures happen due to some environmental problems, so the presumed correlation will be fake.
2. Using only the times that tests fail together seems like a good heuristic, without having to calculate the total correlation of all the history of all the combinations of tests.
If these measures were to be incorporated, a couple of changes would also have to be considered:
1. Failures that are* not spotted* *on a test_run* might be *able to be spotted *on the *next* two or three or *N test_runs*? What do you think?
Yes, it can happen and does happen often. It is not intended -- MTR tests are supposed to be deterministic, if there is a reason to fail, they should always fail. Unfortunately, it does not always work like that, and failures, including real ones, can appear sporadically. I don't think we need to put a lot of effort into dealing with it, but certainly a test should not be excluded after just one successful run, there should be some N>1. I'd say 5-10 would be reasonable.
2. Considering these measures, probably *recall* will be *negatively affected*, but I feel that the model would be *more realistic*.
Any input on my new suggestions? If all seems okay, I will proceed on to try to implement these. Also, I will soon upload the information so far to github. Can I also upload queries made to the database? Or are these private?
I saw your other email, but didn't look into the repo yet, will do soon -- I didn't want to delay this email even further. *ATTN Sergei* I don't think there is any private information in the queries (or in the data, for that matter -- it all can be seen via public web interface anyway). Regards, Elena
Regards Pablo
On Wed, May 7, 2014 at 7:41 PM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Pablo!
On May 07, Pablo Estrada wrote:
So here's what I'll do for the simulations:
*1. Calculating the: "Relevancy index"* for a test, I have considered two simple options so far:
- *Exponential decay*: The relevancy index of a test is the *sum over each failure* of( *exp((FailureTime - CurrentTime)/DecayRate))*. It decreases exponentially as time passes, and increases if the test
fails.
- DecayRate is - i.e. If TestA failed at days 5 and 7, and now is day 9, RI will be (exp(5-9)+exp(7-9)) = (exp(-4)+exp(-2)). - The unit to measure time is just seconds in UNIX_TIMESTAMP - *Weighted moving average*: The relevancy index of a test is:
*R[now] =
R[now-1]*alpha + fail*(1-alpha)*, where fail is 1 if the test failed
in
this run, and 0 if it did not fail. The value is between 1 and 0. It decreases slowly if a test runs without failing, and it increases
slowly if
the test fails. - 0 < alpha < 1 (Initially set at 0.95 for testing). - i.e. If TestB failed for the first time in the last run, and
again
in this run: R[t] = 1*0.95 + 1*0.5 = 1 - If test B ran once more and did not fail, then: R[t+1] = 1*0.95 + 0*0.5 = 0.95 - The *advantage of this method* is that it doesn't have to look at the whole history every time it's calculated (unlike the
exponential decay
method)
you don't need to look at the whole history for the exponential decay. Because it is
exp((FailureTime - CurrentTime)/DecayRate)
You simply have
R[t] = exp(FailureTime/DecayRate) / exp(t/DecayRate) R[t+1] = R[t] / exp(1/DecayRate) (if there was no failure)
- Much like TCP protocol (
http://www.cl.cam.ac.uk/~jac22/books/mm/book/node153.html)
Regarding the *Relevancy Index*, it can be calculated grouping test
in many ways: *Roughly* using test_name+variation, or *more granularly* by *including* *branch* and *platform*. I'll add some thoughts regarding
results these
options at the bottom of the email.
I've tested these options earlier, you may want to try them all too and see which one delivers better results.
*2. *To* run the simulation*, I'll gather data from the first few thousands of test_run entries, and then start simulating results. Here's what I'll do:
1. *Gather data *first few thousands of test_run entries (i.e. 4 thousand) 2. After N thousand test_runs, I'll go through the test_run entries *one by one*, and using the data gathered to that point, I will select '*running sets*' of *100* *test suites* to run on each test_run entry. (The number can be adjusted)
Absolutely, it should be. This is the main parameter we can tune, after all. The larger your running set is, the better will be the recall.
May be not now but later, but it would be very useful to see these graphs, recall as a function of the running set size. It's important to know whether by increasing the running set by 10% we get 1% recall increase of 70% recall increase (as you've seen, there's a region when recall increases very fast as the running set grows).
3. If in this *test_run* entry, the list of *failed tests* contains tests that are *NOT part* of the *running set*, the failure will be ignored, and so the information of this failure will be lost (not
used as
part of the relevancy index). *(See Comment 2)* 4. If the set of *failed tests *in the *test_run* entry intersect with the *running_set*, this is better *recall*. This information will be used to continue calculating the *relevancy index*.
Could you explain the terminology you're using? What is a "test suite" and what is a "test run"? How will you calculate the "recall"?
According to the results obtained from the simulations, we can adjust the algorithm (i.e. to consider *relevancy index by* *platform* and *branch*, etc.)
Comments about the *relevancy index:*
- The methods to calculate the relevancy index are very simple. There are some other useful metrics that could be incorporated - *Time since last run. *With the current methods, if a* test*completely *stops running*, it only* becomes less relevant with time*, and so even if it could expose defects, it doesn't get to run because its relevancy index is just going down. Incorporating a function that* increases the relevancy index* as the *time since the last run* *increases* can help solve this issue. I believe this measure will be useful.
Right. I will not comment on this measure now. But I absolutely agree that this is an issue that must be solved.
- *Correlation between test failures*. If two tests tend to fail together, is it better to just run one of them? Incorporating this measure seems difficult, but it is on the table, in case we should
consider it.
Agree. Taking correlations into account looks very promising, but it does seem to be difficult :)
- As you might have seen, I decided to not consider any data concerned with *code changes*. I'll work like this and see if the results are satisfactory.
Right!
Comments regarding *buildbot infrasturcture:* These comments are out of the scope of this project, but it would be very desirable features for the buildbot infrastructure.
- Unfortunately, given the data available in the database, it is NOT possible to know *which tests ran* on each *test_run*. This information would be very useful, as it would help estimate the *exact failure rate*of a test. I didn't look into the code, but it seems that *class MtrLogObserver*( http://buildbot.sourcearchive.com/documentation/0.8.3p1-1/mtrlogobserver_8py... ) contains most of the infrastructure necessary to just add one or two more tables ( *test_suite*, and *test_suite_test_run*), some code, and start keeping track of this information. - Another problem with the data available in the database is that it is not possible to know *how many test suites exist*. It is only possible to estimate *how many different test suites have failed*. This would also be helpful information. - Actually, this information would be useful not only for this project, but in general for book-keeping of the development of MariaDB.
Regards, Sergei
Hi, Elena! On May 22, Elena Stepanova wrote:
*ATTN Sergei:*
I think in any case we'll have to rely on the fact that your script will choose tests not from the whole universe of tests, but from an initial list that MTR produces for this particular test run. That is, it will go something like that: - test run is started in buildbot; - MTR collects test cases to run, according to the startup parameters, as it always does; - the list is passed to your script; - the script filters it according to the algorithm that you developed, keeps only a small portion of the initial list, and passes it back to MTR; - MTR runs the requested tests.
Right, that'd be good in production. Unfortunately, in the historical data we don't have this information. But we have the list of changed files for every commit. Which means - when a new test file is added (or, may be, modified), it could be treated as if it has failed recently, and added to the test set. Regards, Sergei
Hi, Pablo! On May 21, Pablo Estrada wrote:
Hello Sergei and all, First of all, I'll explain quickly the terms that I was using:
- *test_suite, test suite, test case* - When I say test suite or test case, I am referring to a single test file. For instance ' *pbxt.group_min_max*'. They are the ones that fail, and whose failures we want to attempt to predict.
may I suggest to distinguish between a test *suite* and a test *case*? the latter is usually a one test file, but a suite (for mtr) is a directory with many test files. Like, "main", "pbxt", etc.
- *test_run, test run* - When I use this term, I refer to an entry in the *test_run* table of the database. A test run is a set of *test_suites* that run together at a certain time.
I have in place now a basic script to do the simulations. I have tried to keep the code clear, and I will upload a repository to github soon. I have already run simulations on the data. The simulations used 2000 test_runs as training data, and then attempted to predict behavior on the following 3000 test_runs. Of course, maybe a wider spectrum of data might be needed to truly asses the algorithm.
I used four different ways to calculate a 'relevancy index' for a test:
1. Keep a relevancy index by test case 2. Keep a relevancy index by test case by platform 3. Keep a relevancy index by test case by branch 4. Keep a relevancy index by test case by branch by platform (mixed)
I graphed the results. The graph is attached. As can be seen from the graph, the platform and the mixed model proved to be the best for recall. I feel the results were quite similar to what Sergei encountered.
Right.
I have not run the tests on a larger set of data (the data dump that I have available contains 200,000 test_runs, so in theory I could test the algorithm with all this data)... I feel that I want to consider a couple things before going on to big testing:
I feel that there is a bit of a potential fallacy in the model that I'm following. Here's why: The problem that I find in the model is that we don't know a-priori when a test will fail for the first time. Strictly speaking, in the model, if a test doesn't fail for the first time, it never starts running at all. In the implementation that I made, I am using the first failure of each test to start giving it a relevancy test (so the test would have to fail before it even qualifies to run). This results in a really high recall rate because it is natural that if a test fails once, it might fail pretty soon after, so although we might have missed the first failure, we still consider that we didn't miss it, and based on it we will catch the two or three failures that come right after. This inflates the recall rate of 'subsequent' failures, but it is not very helpful when trying to catch failures that are not part of a trend... I feel this is not realistic.
Here are changes that I'd like to incorporate to the model:
1. The failure rate should stay, and should still be measured with exponential decay or weighted average 2. Include a new measure that increases relevancy: Time since last run. The relevancy index should have a component that makes the test more relevant the longer it spends not running 1. A problem with this is that *test suites* that might have stopped being used will stay and compete for resources, although in reality they would not be relevant anymore 3. Include also correlation. I still don't have a great idea of how correlation will be considered, but it's something like this: 1. The data contains the list of test_runs where each test_suite has failed. If two test suites have failed together a certain percentage of times (>30%?), then when test A fails, the relevancy test of test B also goes up... and when test A runs without failing, the relevancy test of test B goes down too. 2. Using only the times that tests fail together seems like a good heuristic, without having to calculate the total correlation of all the history of all the combinations of tests.
If these measures were to be incorporated, a couple of changes would also have to be considered:
1. Failures that are* not spotted* *on a test_run* might be *able to be spotted *on the *next* two or three or *N test_runs*? What do you think? 2. Considering these measures, probably *recall* will be *negatively affected*, but I feel that the model would be *more realistic*.
I don't think you should introduce artificial limitations that make the recall worse, because they "look realistic". You can do it realistic instead, not look realistic - simply pretend that your code is already running on buildbot and limits the number of tests to run. So, if the test didn't run - you don't have any failure information about it. And then you only need to do what improves recall, nothing else :) (of course, to calculate the recall you need to use all failures, even for tests that you didn't run)
Any input on my new suggestions? If all seems okay, I will proceed on to try to implement these. Also, I will soon upload the information so far to github. Can I also upload queries made to the database? Or are these private?
You mean the data tables? I think they're all public, they don't have anything one couldn't get from http://buildbot.askmonty.org/ Regards, Sergei
Hello everyone: I'm answering to both your emails here (Elena first, then Sergei) On Thu, May 22, 2014 at 4:12 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
I suggest to stay with the terminology, for clarity.
You are right. I'll stick to MTR terminology. But even on an ideal data set the mixed approach should still be most
efficient, so it should be okay to use it even if some day we fix all the broken tests and collect reliable data.
Yes, I agree. Keeping the Mixed (Branch/Platform) approach.
2. Include a new measure that increases relevancy: Time since last run.
The relevancy index should have a component that makes the test more relevant the longer it spends not running
I agree with the idea, but have doubts about the criteria. I think you should measure not the time, but the number of test runs that happened since the last time the test was run (it would be even better if we could count the number of revisions, but that's probably not easy). The reason is that some branches are very active, while others can be extremely slow. So, with the same time-based coefficient the relevancy of a test can strike between two consequent test runs just because they happened with a month interval, but will be changing too slowly on a branch which has a dozen of commits a day.
Yes. I agree with you on this. This is what I had in mind, but I couldn't express it properly on my email : )
3. Include also correlation. I still don't have a great idea of how
correlation will be considered, but it's something like this: 1. The data contains the list of test_runs where each test_suite has
failed. If two test suites have failed together a certain percentage of times (>30%?), then when test A fails, the relevancy test of test B also goes up... and when test A runs without failing, the relevancy test of test B goes down too.
We'll need to see how it goes. In real life correlation of this kind does exist, but I'd say much more often related failures happen due to some environmental problems, so the presumed correlation will be fake.
Good point. Let's see how the numbers play out, but I think you are right that this will end up with a severe bias due to test blowups and failures due to environmental problems.
I think in any case we'll have to rely on the fact that your script will choose tests not from the whole universe of tests, but from an initial list that MTR produces for this particular test run. That is, it will go something like that: - test run is started in buildbot; - MTR collects test cases to run, according to the startup parameters, as it always does; - the list is passed to your script; - the script filters it according to the algorithm that you developed, keeps only a small portion of the initial list, and passes it back to MTR; - MTR runs the requested tests.
That is, you do exclusion of tests rather than inclusion.
This will solve two problems: - first test run: when a new test is added, only MTR knows about it, buildbot doesn't; so, when MTR passes to you a test that you know nothing about (and assuming that we do have a list of all executed tests in buildbot), you'll know it's a new test and will act accordingly; - abandoned tests: MTR just won't pass them to your script, so it won't take them into account.
Great. This is good to know, to have a more precise idea of how the project would fit into the MariaDB development.
On Thu, May 22, 2014 at 5:39 PM, Sergei Golubchik <serg@mariadb.org> wrote:
- *test_suite, test suite, test case* - When I say test suite or test case, I am referring to a single test file. For instance ' *pbxt.group_min_max*'. They are the ones that fail, and whose failures we want to attempt to predict.
may I suggest to distinguish between a test *suite* and a test *case*? the latter is usually a one test file, but a suite (for mtr) is a directory with many test files. Like, "main", "pbxt", etc.
Right. I didn't define this properly. Let's keep the definitions exactly from MTR, as Elena suggested. I don't think you should introduce artificial limitations that make the
recall worse, because they "look realistic".
You can do it realistic instead, not look realistic - simply pretend that your code is already running on buildbot and limits the number of tests to run. So, if the test didn't run - you don't have any failure information about it.
And then you only need to do what improves recall, nothing else :)
(of course, to calculate the recall you need to use all failures, even for tests that you didn't run)
Yes, my code *already works this way*. It doesn't consider failure information from tests that were not supposed to run. The graphs that I sent are from scripts that ran like this. Of course, the recall is just the number of spotted failures from the 100% of known failures : ) Anyway, with all this, I will get to work on adapting the simulation a little bit: - Time since last run will also affect the relevancy of a test - I will try to use the list of changed files from commits to make sure new tests start running right away Any other comments are welcome. Regards Pablo
Hi, Pablo! On May 25, Pablo Estrada wrote:
On Thu, May 22, 2014 at 5:39 PM, Sergei Golubchik <serg@mariadb.org> wrote:
I don't think you should introduce artificial limitations that make the recall worse, because they "look realistic".
You can do it realistic instead, not look realistic - simply pretend that your code is already running on buildbot and limits the number of tests to run. So, if the test didn't run - you don't have any failure information about it.
And then you only need to do what improves recall, nothing else :)
(of course, to calculate the recall you need to use all failures, even for tests that you didn't run)
Yes, my code *already works this way*. It doesn't consider failure information from tests that were not supposed to run. The graphs that I sent are from scripts that ran like this.
Good. I hoped that'll be the case (but didn't check your scripts on github yet, sorry).
Of course, the recall is just the number of spotted failures from the 100% of known failures : )
Anyway, with all this, I will get to work on adapting the simulation a little bit:
- Time since last run will also affect the relevancy of a test - I will try to use the list of changed files from commits to make sure new tests start running right away
Any other comments are welcome.
Getting back to your "potential fallacy" at how you start taking tests into account only when they fail for the first time... I agree, in real life we cannot do that. Instead we start from a complete list of tests, that is known in advance. And you don't have it, unfortunately. An option would be to create a complete list of all tests that have ever failed (and perhaps remove tests that were added in some revision present in the history). And use that as a "starting set" of tests. Alternatively, we can generate a list of all tests currently present in the 10.0 tree - everything that you have in the history tables should be a subset of that. Regards, Sergei
Hello everyone, Here's a small report on the news that I have so far: 1. I had a slow couple of weeks because of a quick holiday that I took. I will make up for that. 2. I added the metric that considers number of test_runs since a test_case ran for the last time. I graphed it, and it does not affect the results much at all. -I still think this is useful to uncover hidden bugs that might lurk in the code for a long time; but testing this condition is difficult with our data. I would like to keep this measure, specially since it doesn't seem to affect results negatively. Opinions? 3. I liked Sergei's idea about using the changes on the test files to calculate the relevancy index. If a test has been changed recently, its relevancy index should be high. This is also more realistic, and uses information that it's easy for us to figure out. - I am trying to match the change_files table or the mysql-test directory with the test_failure table. I was confused about the name of tests and test suites, but I am making progress on it. Once I am able to match at least 90% of the test_names in test_failure with the filename in the change_files table, I will incorporate this data into the code and see how it works out. - *Question*: Looking at the change_files table, there are files that have been *ADDED* several times. Why would this be? Maybe when a new branch is created, all files are ADDED to it? Any ideas? : ) (If no one knows, I'll figure it out, but maybe you do know ; )) 4. I uploaded inline comments for my code last week, let me know if it's clear enough. You can start by run_basic_simulations.py, where the most important functions are called... and after, you can dive into basic_simulator.py, where the simulation is actually done. The repository is a bit messy, I admit. I'll clean it up in the following commits. This is all I have to report for now. Any advice on the way I'm proceeding is welcome : ) Have a nice week, everyone. Pablo On Sun, May 25, 2014 at 11:43 PM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Pablo!
On May 25, Pablo Estrada wrote:
On Thu, May 22, 2014 at 5:39 PM, Sergei Golubchik <serg@mariadb.org> wrote:
I don't think you should introduce artificial limitations that make the recall worse, because they "look realistic".
You can do it realistic instead, not look realistic - simply pretend that your code is already running on buildbot and limits the number of tests to run. So, if the test didn't run - you don't have any failure information about it.
And then you only need to do what improves recall, nothing else :)
(of course, to calculate the recall you need to use all failures, even for tests that you didn't run)
Yes, my code *already works this way*. It doesn't consider failure information from tests that were not supposed to run. The graphs that I sent are from scripts that ran like this.
Good. I hoped that'll be the case (but didn't check your scripts on github yet, sorry).
Of course, the recall is just the number of spotted failures from the 100% of known failures : )
Anyway, with all this, I will get to work on adapting the simulation a little bit:
- Time since last run will also affect the relevancy of a test - I will try to use the list of changed files from commits to make sure new tests start running right away
Any other comments are welcome.
Getting back to your "potential fallacy" at how you start taking tests into account only when they fail for the first time...
I agree, in real life we cannot do that. Instead we start from a complete list of tests, that is known in advance. And you don't have it, unfortunately.
An option would be to create a complete list of all tests that have ever failed (and perhaps remove tests that were added in some revision present in the history). And use that as a "starting set" of tests.
Alternatively, we can generate a list of all tests currently present in the 10.0 tree - everything that you have in the history tables should be a subset of that.
Regards, Sergei
Hi Pablo, Thanks for the update. Some comments inline. On 02.06.2014 18:44, Pablo Estrada wrote:
Hello everyone, Here's a small report on the news that I have so far:
1. I had a slow couple of weeks because of a quick holiday that I took. I will make up for that. 2. I added the metric that considers number of test_runs since a test_case ran for the last time. I graphed it, and it does not affect the results much at all. -I still think this is useful to uncover hidden bugs that might lurk in the code for a long time; but testing this condition is difficult with our data. I would like to keep this measure, specially since it doesn't seem to affect results negatively. Opinions? 3. I liked Sergei's idea about using the changes on the test files to calculate the relevancy index. If a test has been changed recently, its relevancy index should be high. This is also more realistic, and uses information that it's easy for us to figure out.
Right, change should be taken into account, but test files cannot be completely relied upon. The problem is that many MTR tests have a more complicated structure than just test/result pair. They call various other files from inside; sometimes a test file is not much more than setting a few variables and then calling some common logic. Thus, the test file itself never changes, while the logic might. I think it makes more sense to use *result* files. A result file will almost always reflect a change to the logic, it should be more accurate, although not perfect either. If a change was to fix the test itself, e.g. to get rid of a sporadic timeout and such, it's possible that the result file stays the same, while a test or an included file changes.
- I am trying to match the change_files table or the mysql-test directory with the test_failure table. I was confused about the name of tests and test suites, but I am making progress on it. Once I am able to match at least 90% of the test_names in test_failure with the filename in the change_files table, I will incorporate this data into the code and see how it works out.
I recommend to read these pages: https://mariadb.com/kb/en/mysql-test-overview/ https://mariadb.com/kb/en/mysql-test-auxiliary-files/ They might help to resolve some confusion. And of course there is a good "official" MTR manual which you've probably seen: http://dev.mysql.com/doc/mysqltest/2.0/en/index.html
- *Question*: Looking at the change_files table, there are files that have been *ADDED* several times. Why would this be? Maybe when a new branch is created, all files are ADDED to it? Any ideas? : ) (If no one knows, I'll figure it out, but maybe you do know ; ))
My guess is that the most common reason is multiple merges. A file is added to lets say 5.1; then the change, along with others, is merged into 5.2 and hence the file is added again; then to 5.3, etc. Besides, there might be numerous custom development trees registered on buildbot where the change is also merged into, hence the file is added again. I'm sure there are more reasons.
4. I uploaded inline comments for my code last week, let me know if it's clear enough. You can start by run_basic_simulations.py, where the most important functions are called... and after, you can dive into basic_simulator.py, where the simulation is actually done. The repository is a bit messy, I admit. I'll clean it up in the following commits.
Thanks for the hints. I started looking at your code, but haven't made much progress yet. I will follow the path you recommended. Regards, /E
This is all I have to report for now. Any advice on the way I'm proceeding is welcome : ) Have a nice week, everyone.
Pablo
On Sun, May 25, 2014 at 11:43 PM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Pablo!
On May 25, Pablo Estrada wrote:
On Thu, May 22, 2014 at 5:39 PM, Sergei Golubchik <serg@mariadb.org> wrote:
I don't think you should introduce artificial limitations that make the recall worse, because they "look realistic".
You can do it realistic instead, not look realistic - simply pretend that your code is already running on buildbot and limits the number of tests to run. So, if the test didn't run - you don't have any failure information about it.
And then you only need to do what improves recall, nothing else :)
(of course, to calculate the recall you need to use all failures, even for tests that you didn't run)
Yes, my code *already works this way*. It doesn't consider failure information from tests that were not supposed to run. The graphs that I sent are from scripts that ran like this.
Good. I hoped that'll be the case (but didn't check your scripts on github yet, sorry).
Of course, the recall is just the number of spotted failures from the 100% of known failures : )
Anyway, with all this, I will get to work on adapting the simulation a little bit:
- Time since last run will also affect the relevancy of a test - I will try to use the list of changed files from commits to make sure new tests start running right away
Any other comments are welcome.
Getting back to your "potential fallacy" at how you start taking tests into account only when they fail for the first time...
I agree, in real life we cannot do that. Instead we start from a complete list of tests, that is known in advance. And you don't have it, unfortunately.
An option would be to create a complete list of all tests that have ever failed (and perhaps remove tests that were added in some revision present in the history). And use that as a "starting set" of tests.
Alternatively, we can generate a list of all tests currently present in the 10.0 tree - everything that you have in the history tables should be a subset of that.
Regards, Sergei
Hi, Pablo! On Jun 02, Pablo Estrada wrote:
Hello everyone, Here's a small report on the news that I have so far:
1. I had a slow couple of weeks because of a quick holiday that I took. I will make up for that. 2. I added the metric that considers number of test_runs since a test_case ran for the last time. I graphed it, and it does not affect the results much at all. -I still think this is useful to uncover hidden bugs that might lurk in the code for a long time; but testing this condition is difficult with our data. I would like to keep this measure, specially since it doesn't seem to affect results negatively. Opinions?
What do you mean "added the metric"? You started to use it when selecting what tests to run and found that it doesn't affect results much? Well, it's your project, you can keep any measure you want. But please mark clearly (in comments or whatever) what factors affect results and what don't. It would be very useful to be able to see the simplest possible model that still delivers reasonably good results. Even if we'll decide to use something more complicated at the end.
3. I liked Sergei's idea about using the changes on the test files to calculate the relevancy index. If a test has been changed recently, its relevancy index should be high. This is also more realistic, and uses information that it's easy for us to figure out.
Same as above, basically. I'd prefer to use not the model that simply "looks realistic", but the one that makes best predictions. You can use whatever criteria you prefer, but if taking into account changed tests will not improve the results, I'd like it to be to clearly documented or visible in the code.
- I am trying to match the change_files table or the mysql-test directory with the test_failure table. I was confused about the name of tests and test suites, but I am making progress on it. Once I am able to match at least 90% of the test_names in test_failure with the filename in the change_files table, I will incorporate this data into the code and see how it works out. - *Question*: Looking at the change_files table, there are files that have been *ADDED* several times. Why would this be? Maybe when a new branch is created, all files are ADDED to it? Any ideas? : ) (If no one knows, I'll figure it out, but maybe you do know ; ))
In most cases it shows how a revision is travelling across different branches. It was commited somewhere (adding files there), then merged into another branch (adding files there), then into yet another, and so on. In some cases it could be committed, uncommitted, committed again, repeat until satisfied. Or simply rebased. I personally use this quite often to test my changes in buildbot (of course, it's only safe to use in a personal tree, the one which is not shared with others). Regards, Sergei
Hi, Pablo! To add to my last reply: On Jun 03, Sergei Golubchik wrote:
Well, it's your project, you can keep any measure you want. But please mark clearly (in comments or whatever) what factors affect results and what don't.
It would be very useful to be able to see the simplest possible model that still delivers reasonably good results. Even if we'll decide to use something more complicated at the end. ... Same as above, basically. I'd prefer to use not the model that simply "looks realistic", but the one that makes best predictions.
You can use whatever criteria you prefer, but if taking into account changed tests will not improve the results, I'd like it to be to clearly documented or visible in the code.
Alternatively, you can deliver (when the this GSoC project ends) two versions of the script - one with anything you want in it, and the second one - as simple as possible. For example, the only really important metric is the "recall as a function of total testing` time". We want to reach as high recall as possible in the shortest possible testing time, right? But according to this criteria one needs to take into account individual test execution times (it's better to run 5 fast tests than 1 slow test) and individual builder speed factors (better to run 10 tests on a fast builder than 5 tests on a slow builder). And in my tests it turned out that these complications don't improve results much. So, while they make perfect sense and make the model more realistic, the simple model can perfectly survive without them and use "recall vs. number of tests" metric. Regards, Sergei
Hello everyone, Here are the highlights of this week: 1. I am keeping the script customizable, to make simple/complex runs if necessary. I am also documenting the options. 2. I have added the logic to consider test file changes into the mix. Finding first-failures has improved significantly, and this is useful. 3. I am working now on considering correlation between test failures. Adding this should be quite quick, and I should be testing it by the end of the week. 4. I feel quite okay with the results so far. I have been testing with the first 5000 test runs (2000 for training, 3000 for simulation). Once I finish implementing the factor of correlation, I will run tests on a more ample spectrum (~20,000 or ~50,000 test runs). If the results look good, we might be able to review, make any changes/suggestions, and if everyone is okay with it, discuss details about the implementation while I finish the last touches for the simulation script. How does that seem? Regards, and best for everyone. Pablo On Tue, Jun 3, 2014 at 7:42 AM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Pablo!
To add to my last reply:
On Jun 03, Sergei Golubchik wrote:
Well, it's your project, you can keep any measure you want. But please mark clearly (in comments or whatever) what factors affect results and what don't.
It would be very useful to be able to see the simplest possible model that still delivers reasonably good results. Even if we'll decide to use something more complicated at the end. ... Same as above, basically. I'd prefer to use not the model that simply "looks realistic", but the one that makes best predictions.
You can use whatever criteria you prefer, but if taking into account changed tests will not improve the results, I'd like it to be to clearly documented or visible in the code.
Alternatively, you can deliver (when the this GSoC project ends) two versions of the script - one with anything you want in it, and the second one - as simple as possible.
For example, the only really important metric is the "recall as a function of total testing` time". We want to reach as high recall as possible in the shortest possible testing time, right? But according to this criteria one needs to take into account individual test execution times (it's better to run 5 fast tests than 1 slow test) and individual builder speed factors (better to run 10 tests on a fast builder than 5 tests on a slow builder). And in my tests it turned out that these complications don't improve results much. So, while they make perfect sense and make the model more realistic, the simple model can perfectly survive without them and use "recall vs. number of tests" metric.
Regards, Sergei
participants (4)
-
Elena Stepanova
-
Kristian Nielsen
-
Pablo Estrada
-
Sergei Golubchik