Hi Elena,

First, about integration. I agree it makes much more sense to store metrics rather than calculate them each time from scratch. As an additional bonus, it becomes unimportant whether we make buildbot store lists of all executed tests as we initially discussed. If the tool stores metrics, it only needs to receive the list of tests at runtime, which seems much less invasive.

I see your point. I still think there is a benefit to having a list of run tests: If we want to run simulations or analyze other algorithms with the data, rather than just storing the test_info dictionary, having the list of tests that ran would be more useful, as we could actually look at which tests ran, rather than just what their relevance is. Of course, if we want to commit to one algorithm, then we don't need any extra information; but if we want more flexibility, then maybe storing more information would be useful.

Nonetheless, I do understand that the change may be invasive, and it would go into an important production system, so it is reasonable to want to avoid it. I just want to point out advantages and disadvantages, to not dismiss it completely.

One thing that I want to see there is fully developed platform mode. I see that mode option is still there, so it should not be difficult. I actually did it myself while experimenting, but since I only made hasty and crude changes, I don't expect them to be useful.

I'm not sure what code you are referring to. Can you be more specific on what seems to be missing? I might have missed something when migrating from the previous architecture...

Of the code that's definitely not there, there are a couple things that could be added:

1. When we calculate the relevance of a test on a given platform, we might want to set the relevance to 0, or we might want to derive a default relevance from other platforms (An average, the 'standard', etc...). Currently, it's just set to 0.

2. We might also, just in case, want to keep the 'standard' queue for when we don't have the data for this platform (related to the previous point).

It doesn't matter in which order they fail/finish; the problem is, when builder2 starts, it doesn't have information about builder1 results, and builder3 doesn't know anything about the first two. So, the metric for test X could not be increased yet.

But in your current calculation, it is. So, naturally, if we happen to catch the failure on builder1, the metric raises dramatically, and the failure will be definitely caught on builders 2 and 3.

It is especially important now, when you use incoming lists, and the running sets might be not identical for builders 1-3 even in standard mode.

Right, I see your point. Although if test_run 1 would catch the error, test_run 2, although it would be using the same data. might not catch the same errors if the running set makes it such that they are pushed out due to lower relevance. The effect might not be too big, but it definitely has potential to affect the results.

Over-pessimistic part:

It is similar to the previous one, but look at the same problem from a different angle. Suppose the push broke test X, and the test started failing on all builders (platforms). So, you have 20 failures, one per test run, for the same push. Now, suppose you caught it on one platform but not on others. Your statistics will still show 19 failures missed vs 1 failure caught, and recall will be dreadful (~0.05). But in fact, the goal is achieved: the failure has been caught for this push. It doesn't really matter whether you catch it 1 time or 20 times. So, recall here should be 1.

It should mainly affect per-platform approach, but probably the standard one can also suffer if running sets are not identical for all builders.

Right. It seems that solving these two issues is non-trivial (the test_run table does not contain duration of the test_run, or anything). But we can keep in mind these issues.

Finally, a couple of small details.

I wonder if it's because of different versions or anything, but this didn't work for me:

exp = re.compile('([^, ]+) ?([^ ]*)? *.*\[ (fail|disabled|pass|skipped) \]')

It would give me an error. I had to modify it this way:

exp = re.compile('([^, ]+) ?([^ ]*) *.*\[ (fail|disabled|pass|skipped) \]')

>From what I see, it should be the same. If you agree, please make the same change (or somehow else get rid of the error).

I guess it's a version issue. I fixed it.

Also, it appears that csv/test_fail_history.csv is the old file. I replaced it with csv/fails_ptest_run.csv in the code. It doesn't matter for the final version, but might be important for experiments.

In the code we should be using test_fail_history_inv.csv. That is the updated file with ascending test_run id. I will add the instructions for creating and using these files into the readme.

Regards

Pablo