Hello everyone,
Well, I now have familiarized myself with the data. I will start trying to simulate some scenarios. In this email, I will summarize my roadmap for the next few days. The text is quite complicated (and my redaction is kind of clumsy too). It's not necessary to read it in detail. This email is mostly to keep a record of what I'll do.
Anyhow, if anybody has questions or comments, I will be happy to address them, as well as receive any feedback. I will start coding the simulations in the next few days. Also, if you want more details of what I'm doing, I can write that up too.
So here's what I'll do for the simulations:
1. Calculating the: "Relevancy index" for a test, I have considered two simple options so far:
- Exponential decay: The relevancy index of a test is the sum over each failure of( exp((FailureTime - CurrentTime)/DecayRate)). It decreases exponentially as time passes, and increases if the test fails.
- DecayRate is
- i.e. If TestA failed at days 5 and 7, and now is day 9, RI will be (exp(5-9)+exp(7-9)) = (exp(-4)+exp(-2)).
- The unit to measure time is just seconds in UNIX_TIMESTAMP
- Weighted moving average: The relevancy index of a test is: R[now] = R[now-1]*alpha + fail*(1-alpha), where fail is 1 if the test failed in this run, and 0 if it did not fail. The value is between 1 and 0. It decreases slowly if a test runs without failing, and it increases slowly if the test fails.
- 0 < alpha < 1 (Initially set at 0.95 for testing).
- i.e. If TestB failed for the first time in the last run, and again in this run: R[t] = 1*0.95 + 1*0.5 = 1
- If test B ran once more and did not fail, then: R[t+1] = 1*0.95 + 0*0.5 = 0.95
- The advantage of this method is that it doesn't have to look at the whole history every time it's calculated (unlike the exponential decay method)
- Much like TCP protocol (1)
Regarding the Relevancy Index, it can be calculated grouping test results in many ways: Roughly using test_name+variation, or more granularly by including branch and platform. I'll add some thoughts regarding these options at the bottom of the email.
2. To run the simulation, I'll gather data from the first few thousands of test_run entries, and then start simulating results. Here's what I'll do:
- Gather data first few thousands of test_run entries (i.e. 4 thousand)
- After N thousand test_runs, I'll go through the test_run entries one by one, and using the data gathered to that point, I will select 'running sets' of 100 test suites to run on each test_run entry. (The number can be adjusted)
- If in this test_run entry, the list of failed tests contains tests that are NOT part of the running set, the failure will be ignored, and so the information of this failure will be lost (not used as part of the relevancy index). (See Comment 2)
- If the set of failed tests in the test_run entry intersect with the running_set, this is better recall. This information will be used to continue calculating the relevancy index.
According to the results obtained from the simulations, we can adjust the algorithm (i.e. to consider relevancy index by platform and branch, etc.)
Comments about the relevancy index:
- The methods to calculate the relevancy index are very simple. There are some other useful metrics that could be incorporated
- Time since last run. With the current methods, if a test completely stops running, it only becomes less relevant with time, and so even if it could expose defects, it doesn't get to run because its relevancy index is just going down. Incorporating a function that increases the relevancy index as the time since the last run increases can help solve this issue. I believe this measure will be useful.
- Correlation between test failures. If two tests tend to fail together, is it better to just run one of them? Incorporating this measure seems difficult, but it is on the table, in case we should consider it.
- As you might have seen, I decided to not consider any data concerned with code changes. I'll work like this and see if the results are satisfactory.
Comments regarding buildbot infrasturcture:
These comments are out of the scope of this project, but it would be very desirable features for the buildbot infrastructure.
- Unfortunately, given the data available in the database, it is NOT possible to know which tests ran on each test_run. This information would be very useful, as it would help estimate the exact failure rate of a test. I didn't look into the code, but it seems that class MtrLogObserver(2) contains most of the infrastructure necessary to just add one or two more tables (test_suite, and test_suite_test_run), some code, and start keeping track of this information.
- Another problem with the data available in the database is that it is not possible to know how many test suites exist. It is only possible to estimate how many different test suites have failed. This would also be helpful information.
- Actually, this information would be useful not only for this project, but in general for book-keeping of the development of MariaDB.
Thanks to all,
Pablo