Hello Sergei and all,
First of all, I'll explain quickly the terms that I was using:
I have in place now a basic script to do the simulations. I have tried to keep the code clear, and I will upload a repository to github soon.
I have already run simulations on the data. The simulations used 2000 test_runs as training data, and then attempted to predict behavior on the following 3000 test_runs. Of course, maybe a wider spectrum of data might be needed to truly asses the algorithm.

I used four different ways to calculate a 'relevancy index' for a test:
  1. Keep a relevancy index by test case
  2. Keep a relevancy index by test case by platform
  3. Keep a relevancy index by test case by branch
  4. Keep a relevancy index by test case by branch by platform (mixed)
I graphed the results. The graph is attached. As can be seen from the graph, the platform and the mixed model proved to be the best for recall.
I feel the results were quite similar to what Sergei encountered.

I have not run the tests on a larger set of data (the data dump that I have available contains 200,000 test_runs, so in theory I could test the algorithm with all this data)... I feel that I want to consider a couple things before going on to big testing:

I feel that there is a bit of a potential fallacy in the model that I'm following. Here's why:
The problem that I find in the model is that we don't know a-priori when a test will fail for the first time. Strictly speaking, in the model, if a test doesn't fail for the first time, it never starts running at all. In the implementation that I made, I am using the first failure of each test to start giving it a relevancy test (so the test would have to fail before it even qualifies to run).
This results in a really high recall rate because it is natural that if a test fails once, it might fail pretty soon after, so although we might have missed the first failure, we still consider that we didn't miss it, and based on it we will catch the two or three failures that come right after.
This inflates the recall rate of 'subsequent' failures, but it is not very helpful when trying to catch failures that are not part of a trend... I feel this is not realistic.

Here are changes that I'd like to incorporate to the model:
  1. The failure rate should stay, and should still be measured with exponential decay or weighted average
  2. Include a new measure that increases relevancy: Time since last run. The relevancy index should have a component that makes the test more relevant the longer it spends not running
    1. A problem with this is that test suites that might have stopped being used will stay and compete for resources, although in reality they would not be relevant anymore
  3. Include also correlation. I still don't have a great idea of how correlation will be considered, but it's something like this:
    1. The data contains the list of test_runs where each test_suite has failed. If two test suites have failed together a certain percentage of times (>30%?), then when test A fails, the relevancy test of test B also goes up... and when test A runs without failing, the relevancy test of test B goes down too.
    2. Using only the times that tests fail together seems like a good heuristic, without having to calculate the total correlation of all the history of all the combinations of tests.
If these measures were to be incorporated, a couple of changes would also have to be considered:
  1. Failures that are not spotted on a test_run might be able to be spotted on the next two or three or N test_runs? What do you think?
  2. Considering these measures, probably recall will be negatively affected, but I feel that the model would be more realistic.
Any input on my new suggestions? If all seems okay, I will proceed on to try to implement these.
Also, I will soon upload the information so far to github. Can I also upload queries made to the database? Or are these private?


On Wed, May 7, 2014 at 7:41 PM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Pablo!

On May 07, Pablo Estrada wrote:
> So here's what I'll do for the simulations:
> *1. Calculating the: "Relevancy index"* for a test, I have considered two
> simple options so far:
>    - *Exponential decay*: The relevancy index of a test is the *sum over
>    each failure* of( *exp((FailureTime - CurrentTime)/DecayRate))*. It
>    decreases exponentially as time passes, and increases if the test fails.
>       - DecayRate is
>       - i.e. If TestA failed at days 5 and 7, and now is day 9, RI will
>       be (exp(5-9)+exp(7-9)) = (exp(-4)+exp(-2)).
>       - The unit to measure time is just seconds in UNIX_TIMESTAMP
>    - *Weighted moving average*: The relevancy index of a test is: *R[now] =
>    R[now-1]*alpha + fail*(1-alpha)*, where fail is 1 if the test failed in
>    this run, and 0 if it did not fail. The value is between 1 and 0. It
>    decreases slowly if a test runs without failing, and it increases slowly if
>    the test fails.
>       - 0 < alpha < 1 (Initially set at 0.95 for testing).
>       - i.e. If TestB failed for the first time in the last run, and again
>       in this run: R[t] = 1*0.95 + 1*0.5 = 1
>       - If test B ran once more and did not fail, then: R[t+1] = 1*0.95 +
>       0*0.5 = 0.95
>       - The *advantage of this method* is that it doesn't have to look at
>       the whole history every time it's calculated (unlike the exponential decay
>       method)

you don't need to look at the whole history for the exponential decay.
Because it is

  exp((FailureTime - CurrentTime)/DecayRate)

You simply have

  R[t] = exp(FailureTime/DecayRate) / exp(t/DecayRate)
  R[t+1] = R[t] / exp(1/DecayRate)   (if there was no failure)

>       - Much like TCP protocol (http://www.cl.cam.ac.uk/~jac22/books/mm/book/node153.html)
> Regarding the *Relevancy Index*, it can be calculated grouping test results
> in many ways: *Roughly* using test_name+variation, or *more granularly* by
> *including* *branch* and *platform*. I'll add some thoughts regarding these
> options at the bottom of the email.

I've tested these options earlier, you may want to try them all too and
see which one delivers better results.

> *2. *To* run the simulation*, I'll gather data from the first few thousands
> of test_run entries, and then start simulating results. Here's what I'll do:
>    1. *Gather data *first few thousands of test_run entries (i.e. 4
>    thousand)
>    2. After N thousand test_runs, I'll go through the test_run entries *one
>    by one*, and using the data gathered to that point, I will select '*running
>    sets*' of *100* *test suites* to run on each test_run entry. (The number
>    can be adjusted)

Absolutely, it should be. This is the main parameter we can tune, after
all. The larger your running set is, the better will be the recall.

May be not now but later, but it would be very useful to see these
graphs, recall as a function of the running set size. It's important to
know whether by increasing the running set by 10% we get 1% recall
increase of 70% recall increase (as you've seen, there's a region when
recall increases very fast as the running set grows).

>    3. If in this *test_run* entry, the list of *failed tests* contains
>    tests that are *NOT part* of the *running set*, the failure will be
>    ignored, and so the information of this failure will be lost (not used as
>    part of the relevancy index). *(See Comment 2)*
>    4. If the set of *failed tests *in the *test_run* entry intersect with
>    the *running_set*, this is better *recall*. This information will be
>    used to continue calculating the *relevancy index*.

Could you explain the terminology you're using?
What is a "test suite" and what is a "test run"?
How will you calculate the "recall"?

> According to the results obtained from the simulations, we can adjust the
> algorithm (i.e. to consider *relevancy index by* *platform* and *branch*,
> etc.)
> Comments about the *relevancy index:*
>    - The methods to calculate the relevancy index are very simple. There
>    are some other useful metrics that could be incorporated
>       - *Time since last run. *With the current methods, if a*
> test*completely *stops
>       running*, it only* becomes less relevant with time*, and so even if
>       it could expose defects, it doesn't get to run because its
> relevancy index
>       is just going down. Incorporating a function that* increases the
>       relevancy index* as the *time since the last run* *increases* can
>       help solve this issue. I believe this measure will be useful.

Right. I will not comment on this measure now.
But I absolutely agree that this is an issue that must be solved.

>       - *Correlation between test failures*. If two tests tend to fail
>       together, is it better to just run one of them? Incorporating
> this measure
>       seems difficult, but it is on the table, in case we should consider it.

Agree. Taking correlations into account looks very promising, but it
does seem to be difficult :)

>    - As you might have seen, I decided to not consider any data concerned
>    with *code changes*. I'll work like this and see if the results are
>    satisfactory.


> Comments regarding *buildbot infrasturcture:*
> These comments are out of the scope of this project, but it would be very
> desirable features for the buildbot infrastructure.
>    - Unfortunately, given the data available in the database, it is NOT
>    possible to know *which tests ran* on each *test_run*. This information
>    would be very useful, as it would help estimate the *exact failure
> rate*of a test. I didn't look into the code, but it seems that *class
>    MtrLogObserver*(http://buildbot.sourcearchive.com/documentation/0.8.3p1-1/mtrlogobserver_8py_source.html)
> contains
>    most of the infrastructure necessary to just add one or two more tables (
>    *test_suite*, and *test_suite_test_run*), some code, and start keeping
>    track of this information.
>    - Another problem with the data available in the database is that it is
>    not possible to know *how many test suites exist*. It is only possible
>    to estimate *how many different test suites have failed*. This would
>    also be helpful information.
>    - Actually, this information would be useful not only for this project,
>    but in general for book-keeping of the development of MariaDB.