Hi Sergei, On 16.06.2014 10:57, Sergei Golubchik wrote:
Hi, Elena!
Just one comment:
On Jun 16, Elena Stepanova wrote:
4. Failed tests vs executed tests
Further, as I understand you only calculate the metrics for tests which were either edited, or failed at least once; and thus, only such tests can ever make to a corresponding queue. Not only does it create a bubble, but it also makes the comparison of modes faulty, and the whole simulation less efficient.
About the bubble. Why is it bad? Because it decreases the recall - there are test failures (namely, outside of the bubble) that we'll never see.
But because the whole purpose of this task is to optimize for a *high recall* in a short testing time, everything that makes recall worse needs to be analyzed.
I mean, this is important - the bubble isn't bad for itself, it's only bad because it reduces the recall. If no strategy to break this bubble will help to improve the recall - we shouldn't break it at all!
Right, and I want to see a proof that it really does *not* improve recall, because I think it should. Currently we think that our recall is a function of a running set, and we say -- okay, after N=100 it flattens and doesn't improve much further. But it might well be that it flattens simply because the queue doesn't get filled -- of course there will be no difference between N=100 and N=500 if the queue is less than 100 anyway. Then again, if recall is close to 100% either way, it might not be important, but a) I doubt it is. as Pablo said, the previous results were not accurate, and from what I saw after we remove dependencies between simulation runs, we should be somewhere below 50% with the mixed mode on N=500. b) Unless I'm missing something, the bubble becomes critical if we add lets say a new platform, because it does not allow to choose tests which never failed on this platform, and the queue will be empty and the platform won't be tested at all, at least until some tests get edited (assuming we use the editing factor). In any case, now the experiments provide results different from what we think they do. If we want to compare the "full queue" effect with the "non-full queue", lets make it another parameter.
On the other hand, perhaps you, Elena, think that missing a new test failure in one of the test files that wasn't touched by a particular revision - that missing such a failure is worse than missing some other test failure? Because that's a regression and so on. If this is the case, Pablo needs to change the fitness function he's optimizing. Recall assigns equal weights to all test failures, missing one failure is equally bad for all tests. If some failures are worse than others, a different fitness function, uhm, let's call it "weighted recall", could be used to adequally map your expectations into the model.
No, I wasn't thinking about it. I'm still staying withing the same model, where all failures have equal weights. On the contrary, my notes regarding "first time failures" vs "sporadic failures" were supposed to say that we don't need to do anything specific about sporadic failures, if they are caught then they are caught, if not then not. Sorry if it wasn't clear. I do however think that abandoning a test forever because it hadn't failed for a long time is a wrong thing to do, but tools for dealing with it are already in the model -- time factor and editing factor, and they had been there from the beginning, they just need to be tuned (editing needs to be fixed, and possibly time coefficient to be changed if the current value doesn't provide good results -- that's something to experiment with).
Again - if you think that optimizing the model doesn't do what we want it to do, the way to fix it is not to add artificial heuristics and rules into it, but to modify the model.
It means that even though you set the running set to 500, in fact you'll only run 20 tests at most. It's not desirable -- if we say we can afford running 500 tests, we'd rather run 500 than 20, even if some of them never failed before. This will also help us break the bubble
Same as above. The model optimizes the recall as a function of test time (ideally, that is). If it shows that running 20 tests produces the same recall as running 500 tests - it should run 20 tests. Indeed, why should it run more if it doesn't improve the recall?
Same as above, I think it will improve the recall, and most likely even essentially, and at the very least we need to see the difference so we can make an informed decision about it.
Although I expect that running 500 tests *will* improve the recall, of course, even if only marginally.
Anyway, my whole point is - let's stay within the model and improve the fitness function (which is recall at the moment). It's the only way to see quantatively what every strategy gives and whether it should be used at all.
It is still the same model. The core of the model was to make recall a function of cutoff, right? So lets try it first, lets make it real cutoff and see the results. Not filling the queue completely (or in some cases having it empty) is optimization over the initial model, which improves the execution time (marginally) but affects recall (even only marginally). It can be considered, but the results should be compared to the basic ones. And if lets say we decide that N=100 (or N=10%) is the best cutoff value, and then find out that by not filling the queue completely we lose even 1% in recall,we might want to stay with the full queue. What is the time difference between running 50 tests and 100 tests? Almost nothing, especially comparing to what we spend on preparation of the tests. So, if 100 tests vs 50 tests add 1% to recall, and also helps to solve the problem of never-running-tests, i'd say it's better to stay within the initial model. Regards, Elena
That said, Pablo should try to do something about the bubble, I suppose. E.g. run more tests and randomize the tail? And see whether it helps to improve the recall.
Regards, Sergei