Hello everyone:
I'm answering to both your emails here (Elena first, then Sergei)


On Thu, May 22, 2014 at 4:12 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
I suggest to stay with the terminology, for clarity.
You are right. I'll stick to MTR terminology.

But even on an ideal data set the mixed approach should still be most efficient, so it should be okay to use it even if some day we fix all the broken tests and collect reliable data.
Yes, I agree. Keeping the Mixed (Branch/Platform) approach.


    2. Include a new measure that increases relevancy: Time since last run.

    The relevancy index should have a component that makes the test more
    relevant the longer it spends not running

I agree with the idea, but have doubts about the criteria.
I think you should measure not the time, but the number of test runs that happened since the last time the test was run (it would be even better if we could count the number of revisions, but that's probably not easy).
The reason is that some branches are very active, while others can be extremely slow. So, with the same time-based coefficient the relevancy of a test can strike between two consequent test runs just because they happened with a month interval, but will be changing too slowly on a branch which has a dozen of commits a day.

Yes. I agree with you on this. This is what I had in mind, but I couldn't express it properly on my email : )

 
    3. Include also correlation. I still don't have a great idea of how

    correlation will be considered, but it's something like this:
       1. The data contains the list of test_runs where each test_suite has

       failed. If two test suites have failed together a certain percentage of
       times (>30%?), then when test A fails, the relevancy test of test B also
       goes up... and when test A runs without failing, the relevancy
test of test
       B goes down too.

We'll need to see how it goes.
In real life correlation of this kind does exist, but I'd say much more often related failures happen due to some environmental problems, so the presumed correlation will be fake.

Good point. Let's see how the numbers play out, but  I think you are right that this will end up with a severe bias due to test blowups and failures due to environmental problems.

 

I think in any case we'll have to rely on the fact that your script will choose tests not from the whole universe of tests, but from an initial list that MTR produces for this particular test run. That is, it will go something like that:
- test run is started in buildbot;
- MTR collects test cases to run, according to the startup parameters, as it always does;
- the list is passed to your script;
- the script filters it according to the algorithm that you developed, keeps only a small portion of the initial list, and passes it back to MTR;
- MTR runs the requested tests.

That is, you do exclusion of tests rather than inclusion.

This will solve two problems:
- first test run: when a new test is added, only MTR knows about it, buildbot doesn't; so, when MTR passes to you a test that you know nothing about (and assuming that we do have a list of all executed tests in buildbot), you'll know it's a new test and will act accordingly;
- abandoned tests: MTR just won't pass them to your script, so it won't take them into account.

Great. This is good to know, to have a more precise idea of how the project would fit into the MariaDB development.

On Thu, May 22, 2014 at 5:39 PM, Sergei Golubchik <serg@mariadb.org> wrote:
>    - *test_suite, test suite, test case* - When I say test suite or test
>    case, I am referring to a single test file. For instance '
>    *pbxt.group_min_max*'. They are the ones that fail, and whose failures
>    we want to attempt to predict.

may I suggest to distinguish between a test *suite* and a test *case*?
the latter is usually a one test file, but a suite (for mtr) is a
directory with many test files. Like, "main", "pbxt", etc.

Right. I didn't define this properly. Let's keep the definitions exactly from MTR, as Elena suggested.

I don't think you should introduce artificial limitations that make the
recall worse, because they "look realistic".

You can do it realistic instead, not look realistic - simply pretend
that your code is already running on buildbot and limits the number of
tests to run. So, if the test didn't run - you don't have any failure
information about it.

And then you only need to do what improves recall, nothing else :)

(of course, to calculate the recall you need to use all failures,
even for tests that you didn't run)

Yes, my code already works this way. It doesn't consider failure information from tests that were not supposed to run. 
The graphs that I sent are from scripts that ran like this.

Of course, the recall is just the number of spotted failures from the 100% of known failures : )

Anyway, with all this, I will get to work on adapting the simulation a little bit:
  • Time since last run will also affect the relevancy of a test
  • I will try to use the list of changed files from commits to make sure new tests start running right away
Any other comments are welcome.

Regards
Pablo