First, about integration. I agree it makes much more sense to store metrics rather than calculate them each time from scratch. As an additional bonus, it becomes unimportant whether we make buildbot store lists of all executed tests as we initially discussed. If the tool stores metrics, it only needs to receive the list of tests at runtime, which seems much less invasive.
One thing that I want to see there is fully developed platform mode. I see that mode option is still there, so it should not be difficult. I actually did it myself while experimenting, but since I only made hasty and crude changes, I don't expect them to be useful.
It doesn't matter in which order they fail/finish; the problem is, when builder2 starts, it doesn't have information about builder1 results, and builder3 doesn't know anything about the first two. So, the metric for test X could not be increased yet.
But in your current calculation, it is. So, naturally, if we happen to catch the failure on builder1, the metric raises dramatically, and the failure will be definitely caught on builders 2 and 3.
It is especially important now, when you use incoming lists, and the running sets might be not identical for builders 1-3 even in standard mode.
Over-pessimistic part:
It is similar to the previous one, but look at the same problem from a different angle. Suppose the push broke test X, and the test started failing on all builders (platforms). So, you have 20 failures, one per test run, for the same push. Now, suppose you caught it on one platform but not on others. Your statistics will still show 19 failures missed vs 1 failure caught, and recall will be dreadful (~0.05). But in fact, the goal is achieved: the failure has been caught for this push. It doesn't really matter whether you catch it 1 time or 20 times. So, recall here should be 1.
It should mainly affect per-platform approach, but probably the standard one can also suffer if running sets are not identical for all builders.
Finally, a couple of small details.
I wonder if it's because of different versions or anything, but this didn't work for me:
exp = re.compile('([^, ]+) ?([^ ]*)? *.*\[ (fail|disabled|pass|skipped) \]')
It would give me an error. I had to modify it this way:
exp = re.compile('([^, ]+) ?([^ ]*) *.*\[ (fail|disabled|pass|skipped) \]')
>From what I see, it should be the same. If you agree, please make the same change (or somehow else get rid of the error).
Also, it appears that csv/test_fail_history.csv is the old file. I replaced it with csv/fails_ptest_run.csv in the code. It doesn't matter for the final version, but might be important for experiments.