Hello Elena and all,
First, addressing the previous email:

Looking at the dump, I see it can also happen that the dump contains several records for a pair platform/bbnum. I am not sure why it happens, I think it shouldn't, might be a bug in buildbot and/or configuration, or environmental problems. Anyway, due to the way we store output files, they can well override each other in this case, thus for several platform/bbnum record you will have only one file. I suppose that's what was hard to resolve, sorry about that.

No worries ; ).  There are several cases where platform and build number are the same. The system just names the files as follows:
<platform>_<build_id>-log-test_1-stdio
<platform>_<build_id>-log-test_2-stdio
.....
<platform>_<build_id>-log-test_5-stdio

These files seem to correspond temporarily with the test runs (*test_1-stdio) belongs to the first test_run of the same plt-bnum, and so on. Unfortunately, there are some cases where there are more test_runs on the dump than files available, and this means that it's impossible to be sure which file belongs to which test_run exactly.
 
You should consider skipped tests, at least for now. Your logic that they are skipped because they can't be run is generally correct; unfortunately, MTR first produces the *full* list of tests to run, and determines whether a test can be run or not on a later stage, when it starts running the tests. Your tool will receive the initial test list, and I'm not sure it's realistic to re-write MTR so that it takes into account limitations that cause skipping tests before creating the list.

I see. Okay then, duly noted.

Possibly it's better to skip a test run altogether if there is no input list for it; it would be definitely the best if there were 5K (or whatever slice you are currently using) of continuous test runs with input lists; if it so happens that there are lists for some branches but not others, you can skip the branch entirely.

This doesn't seem like a good option. Recall drops seriously, and the test_runs that have a corresponding file don't seem to have a special pattern, and tend to have long spaces between them, so the information becomes irrelevant, and seemingly, not useful.
 
The core module should take as parameters
- list of tests to choose from,
- size of the running set (%),
- branch/platform (if we use them in the end),
and produce a new list of tests of the size of the running set.

The wrapper module should
- read the list of tests from the outside world (for now, from a file),
- receive branch/platform as command-line options,
- have the running set size set as an easily changeable constant or as a configuration parameter,

and return the list of tests -- lets say for now, in the form of <test suite>.<test name>, blank-separated, e.g.
main.select innondb.create-index ...

 
 I am almost done 'translating' the code into a solution that divides it in 'core' and 'wrapper'. There are a few bugs that I still haven't figured out, but I believe I can iron those out pretty soon. I will also incorporate the percentage rather than fixed running_set.

Now, regarding the state of the project (and the recall numbers that I am able to achieve so far), here are some observations:

  • Unfortunately, I am running out of ideas to try to improve recall. I tried tuning some parameters, giving more weight to ones or others, etc. I still wasn't able to push recall beyond ~87% on the strategy that uses file correlations. For what I've seen, some failures are just extremely hard to predict.
  • The strategy that uses only a weighted average of the failure frequency achieves a higher recall, but for a shorter time. The recall decays quickly afterwards. I may try to add some file-correlations to this strategy, to see if the recall can be sustained for a longer term.
  • There is one problem that I see regarding the data and the potential real-world implementation of the program: By verifying the recall with the historical data, we run the possibility of 'expecting' overfitting... so the results regarding the errors found when comparing to the historical data, and the results that could have been obtained by a real-world implementation are potentially different. A possible way to address that issue would require modifying the buildbot to gather more data over a longer term.
So having said that, I am looking for some advice in the following regards:
  • I will try to take a step back from the new strategy, and see how I can adapt the original strategy to prevent the recall function from declining so sharply with time. 
  • I will also spend some time keeping a codebase that adjusts better to the model that we need for the implementation. I will upload code soon. All suggestions are welcome.
  • Nonetheless, I feel that more data would allow to improve the algorithm greatly. Is it possible to prepare logging into the buildbot that would allow for more precise data collection? A slower, more iterative process, working closer with the buildbot and doing more detailed data collection might deliver better results. (I understand that this would probably influence the time-scope of the project)
Let me know what you think about my suggestions.
Regards
Pablo