Hi, Elena! Just one comment: On Jun 16, Elena Stepanova wrote:
4. Failed tests vs executed tests
Further, as I understand you only calculate the metrics for tests which were either edited, or failed at least once; and thus, only such tests can ever make to a corresponding queue. Not only does it create a bubble, but it also makes the comparison of modes faulty, and the whole simulation less efficient.
About the bubble. Why is it bad? Because it decreases the recall - there are test failures (namely, outside of the bubble) that we'll never see. But because the whole purpose of this task is to optimize for a *high recall* in a short testing time, everything that makes recall worse needs to be analyzed. I mean, this is important - the bubble isn't bad for itself, it's only bad because it reduces the recall. If no strategy to break this bubble will help to improve the recall - we shouldn't break it at all! On the other hand, perhaps you, Elena, think that missing a new test failure in one of the test files that wasn't touched by a particular revision - that missing such a failure is worse than missing some other test failure? Because that's a regression and so on. If this is the case, Pablo needs to change the fitness function he's optimizing. Recall assigns equal weights to all test failures, missing one failure is equally bad for all tests. If some failures are worse than others, a different fitness function, uhm, let's call it "weighted recall", could be used to adequally map your expectations into the model. Again - if you think that optimizing the model doesn't do what we want it to do, the way to fix it is not to add artificial heuristics and rules into it, but to modify the model.
It means that even though you set the running set to 500, in fact you'll only run 20 tests at most. It's not desirable -- if we say we can afford running 500 tests, we'd rather run 500 than 20, even if some of them never failed before. This will also help us break the bubble
Same as above. The model optimizes the recall as a function of test time (ideally, that is). If it shows that running 20 tests produces the same recall as running 500 tests - it should run 20 tests. Indeed, why should it run more if it doesn't improve the recall? Although I expect that running 500 tests *will* improve the recall, of course, even if only marginally. Anyway, my whole point is - let's stay within the model and improve the fitness function (which is recall at the moment). It's the only way to see quantatively what every strategy gives and whether it should be used at all. That said, Pablo should try to do something about the bubble, I suppose. E.g. run more tests and randomize the tail? And see whether it helps to improve the recall. Regards, Sergei