Hi, Pablo! On May 21, Pablo Estrada wrote:
Hello Sergei and all, First of all, I'll explain quickly the terms that I was using:
- *test_suite, test suite, test case* - When I say test suite or test case, I am referring to a single test file. For instance ' *pbxt.group_min_max*'. They are the ones that fail, and whose failures we want to attempt to predict.
may I suggest to distinguish between a test *suite* and a test *case*? the latter is usually a one test file, but a suite (for mtr) is a directory with many test files. Like, "main", "pbxt", etc.
- *test_run, test run* - When I use this term, I refer to an entry in the *test_run* table of the database. A test run is a set of *test_suites* that run together at a certain time.
I have in place now a basic script to do the simulations. I have tried to keep the code clear, and I will upload a repository to github soon. I have already run simulations on the data. The simulations used 2000 test_runs as training data, and then attempted to predict behavior on the following 3000 test_runs. Of course, maybe a wider spectrum of data might be needed to truly asses the algorithm.
I used four different ways to calculate a 'relevancy index' for a test:
1. Keep a relevancy index by test case 2. Keep a relevancy index by test case by platform 3. Keep a relevancy index by test case by branch 4. Keep a relevancy index by test case by branch by platform (mixed)
I graphed the results. The graph is attached. As can be seen from the graph, the platform and the mixed model proved to be the best for recall. I feel the results were quite similar to what Sergei encountered.
Right.
I have not run the tests on a larger set of data (the data dump that I have available contains 200,000 test_runs, so in theory I could test the algorithm with all this data)... I feel that I want to consider a couple things before going on to big testing:
I feel that there is a bit of a potential fallacy in the model that I'm following. Here's why: The problem that I find in the model is that we don't know a-priori when a test will fail for the first time. Strictly speaking, in the model, if a test doesn't fail for the first time, it never starts running at all. In the implementation that I made, I am using the first failure of each test to start giving it a relevancy test (so the test would have to fail before it even qualifies to run). This results in a really high recall rate because it is natural that if a test fails once, it might fail pretty soon after, so although we might have missed the first failure, we still consider that we didn't miss it, and based on it we will catch the two or three failures that come right after. This inflates the recall rate of 'subsequent' failures, but it is not very helpful when trying to catch failures that are not part of a trend... I feel this is not realistic.
Here are changes that I'd like to incorporate to the model:
1. The failure rate should stay, and should still be measured with exponential decay or weighted average 2. Include a new measure that increases relevancy: Time since last run. The relevancy index should have a component that makes the test more relevant the longer it spends not running 1. A problem with this is that *test suites* that might have stopped being used will stay and compete for resources, although in reality they would not be relevant anymore 3. Include also correlation. I still don't have a great idea of how correlation will be considered, but it's something like this: 1. The data contains the list of test_runs where each test_suite has failed. If two test suites have failed together a certain percentage of times (>30%?), then when test A fails, the relevancy test of test B also goes up... and when test A runs without failing, the relevancy test of test B goes down too. 2. Using only the times that tests fail together seems like a good heuristic, without having to calculate the total correlation of all the history of all the combinations of tests.
If these measures were to be incorporated, a couple of changes would also have to be considered:
1. Failures that are* not spotted* *on a test_run* might be *able to be spotted *on the *next* two or three or *N test_runs*? What do you think? 2. Considering these measures, probably *recall* will be *negatively affected*, but I feel that the model would be *more realistic*.
I don't think you should introduce artificial limitations that make the recall worse, because they "look realistic". You can do it realistic instead, not look realistic - simply pretend that your code is already running on buildbot and limits the number of tests to run. So, if the test didn't run - you don't have any failure information about it. And then you only need to do what improves recall, nothing else :) (of course, to calculate the recall you need to use all failures, even for tests that you didn't run)
Any input on my new suggestions? If all seems okay, I will proceed on to try to implement these. Also, I will soon upload the information so far to github. Can I also upload queries made to the database? Or are these private?
You mean the data tables? I think they're all public, they don't have anything one couldn't get from http://buildbot.askmonty.org/ Regards, Sergei