Hello Elena and all, I have pushed the fixed code. There are a lot of changes in it because I went through all the code making sure that it made sense. The commit is here <https://github.com/pabloem/Kokiri/commit/7c47afc45a7b1f390e8737df58205fa53334ba09>, and although there are a lot of changes, the main line where failures are caught or missed is this <https://github.com/pabloem/Kokiri/blob/7c47afc45a7b1f390e8737df58205fa53334ba09/simulator.py#L496> . 1. The test result file edition information helps improve recall - if marginally 2. The time since last run information does not improve recall much at all - See [Weaknesses - 2] A couple of concepts that I want to define before going on: - *First failures*. These are failures that happen because of new bugs. They don't occur close in time as part of a chain of failures. The occur as a consequence of a transaction that introduces a bug, but they might occur soon or long after this transaction (usually soon, rather than long). They might be correlated with the frequency of failure of a test (core or basic tests that fail often might be specially good at exposing bugs); but many of them are not (tests of a feature, that don't fail often, but rather, when that feature is modified). - *Strict simulation mode.* This is the mode where, if a test is not part of the running set, its failure is not considered. Weaknesses: - It's very difficult to predict 'first failures'. With the current strategy, if it's been long since a test failed (or if it has never failed before), the relevancy of the test just goes down, and it never runs. - Specially in database, and parallel software, there are bugs that hide in the code for a long time until one test discovers them. Unfortunately, the analysis that I'm doing requires that the test runs exactly when the data indicates it will fail. If a test that would fail doesn't run in test run Z, even though it might run in test run Z+1, the failure is just considered as missed, as if the bug was 'not encountered' ever. - This affects the *time since last run* factor. This factor helps encounter 'hidden' bugs that can be exposed by tests that have not run, but the data available makes it difficult - This would also affect the *correlation* factor. If test A and B fail together often, and on test_run Z both of them would fail, but only A runs, the heightened relevancy of B on the next test_run would not make it catch anything (again, this is a limitation of the data, not of reality) - Humans are probably a lot better at predicting first failures than the current strategy. Some ideas: - I need to be more strict with my testing, and reviewing my code : ) - I need to improve prediction of 'first failures'. What would be a good way to improve this? - Correlation between files changed - Tests failed? Apparently Sergei tried this, but the results were not too good - But this is before running in strict simulation mode. With strict simulation mode, anything that could help spot first failures could be considered. I am currently running tests to get the adjusted results. I will graph them, and send them out in a couple hours. Regards Pablo On Fri, Jun 13, 2014 at 12:40 AM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update.
On 12.06.2014 19:13, Pablo Estrada wrote:
Hello Sergei, Elena and all, Today while working on the script, I found and fixed an issue:
There is some faulty code code in my script that is in charge of collecting the statistics about whether a test failure was caught or not (here <https://github.com/pabloem/Kokiri/blob/master/basic_simulator.py#L393>). I looked into fixing it, and then I could see another *problem*: The *recall numbers* that I had collected previously were *too high*.
The actual recall numbers, once we consider the test failures that are *not caught*, are disappointingly lower. I won't show you results yet, since I
want to make sure that the code has been fixed, and I have accurate tests first.
This is all for now. The strategy that I was using is a lot less effective than it seemed initially. I will send out a more detailed report with results, my opinion on the weak points of the strategy, and ideas, including a roadmap to try to improve results.
Regards. All feedback is welcome.
Please push your fixed code that triggered the new results, even if you are not ready to share the results themselves yet. It will be easier to discuss then.
Regards, Elena
Pablo
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp