Hi Pablo, On 13.06.2014 14:12, Pablo Estrada wrote:
Hello Elena and all, I have pushed the fixed code. There are a lot of changes in it because I
I went through your code (the latest revision). I think I'll postpone detailed in-code comments till the next iteration, and instead will just list my major concerns and questions here. 1. Structure The structure of the code seems to be not quite what we need at the end. As you know, the goal is to return a set (list) of tests that MTR would run. I understand that you are currently experimenting and hence don't have the final algorithm to produce a single list. But what I would expect to see is - a core module which takes various parameters -- type of metric, running set, calculation mode, etc. -- does all the work and produces such a list (possibly with some statistical data which you need for the research, and which might be useful later anyway); - a wrapper which would feed the core module with different sets of parameters, get the results and compare them. After the research is finished, the best parameters would become default, the wrapper would be abandoned or re-written to pass the resulting test list to MTR, while the core module would stay pretty much intact. At the first glance it seemed to be the case in your code, but it turned out it was not. run_basic_simulations.py looks like a wrapper described above, only it does the extra work initializing the simulator, which it should not. On the other hand, simulator.py does not look like the described core module at all. It executes logic for all modes regardless the start up parameters, and this logic is very interleaved. After you choose the best approach, you will have to re-write it majorly, which is not only waste of time but is also error-prone. 2. Cleanup To prove the previous point, currently experiments that you run are not independent. That is, if you call several simulations from run_basic_simulations.py, only the very first one will use the correct data and get real results. All consequent ones will use the data modified by the previous ones, and the results will be totally irrelevant. It happens because there is initial prepare in run_basic_simulations.py, but there is no cleanup between simulations. The whole test_info structure remains what it was by the end of the previous simulation, importantly metrics. Also, test_edit_factor cannot work for any simulations except for the first one at all, because during the simulation you truncate the editions list, but never restore it. 3. Modes These flaws should be easy to fix by doing proper cleanup before each simulation. But there are also other fragments of code where, for example, logic for 'standard' mode is supposed to be always run and is relied upon, even if the desired mode is different. In fact, you build all queues every time. It would be an understandable trade-off to save the time on simulations, but you re-run them separately anyway, and only return the requested queue. 4. Failed tests vs executed tests Further, as I understand you only calculate the metrics for tests which were either edited, or failed at least once; and thus, only such tests can ever make to a corresponding queue. Not only does it create a bubble, but it also makes the comparison of modes faulty, and the whole simulation less efficient. Lets suppose for simplicity that we do not use the editing factor. In standard mode, the number of relevant failed tests for a single test run is obviously greater than lets say in mixed mode (because in standard mode all failures count, while in mixed mode -- only those that happened on platform+branch). So, when in the standard mode you'll calculate metrics for lets say 1K tests, in the mixed mode for a particular combination of platform+branch you'll do so only for 20 tests. It means that even though you set the running set to 500, in fact you'll only run 20 tests at most. It's not desirable -- if we say we can afford running 500 tests, we'd rather run 500 than 20, even if some of them never failed before. This will also help us break the bubble, especially if we randomize the "tail" (tests with the minimal priority that we add to fill the queue). If some of them fail, they'll get a proper metric and will migrate to the meaningful part of the queue. I know you don't have all the data about which tests were run or can be run in a certain test run; but for initial simulation the information is fairly easy to obtain -- just use the corresponding stdio files which you can obtain via the web interface, or run MTR to produce the lists; and in real life it should be possible to make MTR pass it over to your tool. To populate the queue, You don't really need the information which tests had ever been run; you only need to know which ones MTR *wants* to run, if the running set is unlimited. If we assume that it passes the list to you, and you iterate through it, you can use your metrics for tests that failed or were edited before, and a default minimal metric for other tests. Then, if the calculated tests are not enough to fill the queue, you'll randomly choose from the rest. It won't completely solve the problem of tests that never failed and were never edited, but at least it will make it less critical. 5. Running set It's a smaller issue, but back to the real usage of the algorithm, we cannot really set an absolute value of the running set. MTR options can be very different, in one builder it can run hundreds tests the most, in another thousands. We should use a percentage instead. 6. Full / non-full simulation mode I couldn't understand what the *non*-full simulation mode is for, can you explain this? 7. Matching logic (get_test_file_change_history) The logic where you are trying to match result file names to test names is not quite correct. There are some highlights: There can also be subsuites. Consider the example: ./mysql-test/suite/engines/iuds/r/delete_decimal.result The result file can live not only in /r dir, but also in /t dir, together with the test file. It's not cool, but it happens, see for example mysql-test/suite/mtr/t/ Here are some other possible patterns for engine/plugin suites: ./storage/tokudb/mysql-test/suite/tokudb/r/rows-32m-1.result ./storage/innodb_plugin/mysql-test/innodb.result Also, in release builds they can be in mysql-test/plugin folder: mysql-test/plugin/example/mtr/t Be aware that the logic where you compare branch names doesn't currently work as expected. Your list of "fail branches" consists of clean names only, e.g. "10.0", while row[BRANCH] can be like "lp:~maria-captains/maria/10.0". I'm not sure yet why it is sometimes stored this way, but it is. I had more comments/questions, but lets address these ones first, and then we'll see what of the rest remains relevant. Comments on your notes from the email are below inline.
went through all the code making sure that it made sense. The commit is here <https://github.com/pabloem/Kokiri/commit/7c47afc45a7b1f390e8737df58205fa53334ba09>, and although there are a lot of changes, the main line where failures are caught or missed is this <https://github.com/pabloem/Kokiri/blob/7c47afc45a7b1f390e8737df58205fa53334ba09/simulator.py#L496> .
1. The test result file edition information helps improve recall - if marginally 2. The time since last run information does not improve recall much at all - See [Weaknesses - 2]
Lets get back to it (both of them) after the logic with dependent simulations is fixed, after that we'll review it and see why it doesn't work if it still doesn't. Right now any effect that file edition might have is rather coincidental, possibly the other one is also broken.
A couple of concepts that I want to define before going on:
- *First failures*. These are failures that happen because of new bugs. They don't occur close in time as part of a chain of failures. The occur as a consequence of a transaction that introduces a bug, but they might occur soon or long after this transaction (usually soon, rather than long). They might be correlated with the frequency of failure of a test (core or basic tests that fail often might be specially good at exposing bugs); but many of them are not (tests of a feature, that don't fail often, but rather, when that feature is modified). - *Strict simulation mode.* This is the mode where, if a test is not part of the running set, its failure is not considered.
Weaknesses:
- It's very difficult to predict 'first failures'. With the current strategy, if it's been long since a test failed (or if it has never failed before), the relevancy of the test just goes down, and it never runs. - Specially in database, and parallel software, there are bugs that hide in the code for a long time until one test discovers them. Unfortunately, the analysis that I'm doing requires that the test runs exactly when the data indicates it will fail. If a test that would fail doesn't run in test run Z, even though it might run in test run Z+1, the failure is just considered as missed, as if the bug was 'not encountered' ever.
What you call "First failures" is the main target of the regression test suite. So, however difficult they are to predict, we should attempt to do so. On the bright side, we don't need to care that much about the other type, those that "hide in the code for a long time". There are indeed sporadic failures of either code or a test, which happen every now and then, some often, some rarely; but they are not what the test suite is after. Ideally, they should not exist at all, the regression test suite is supposed to be totally deterministic, which means that a test that passed before may only fail if the related code or the test itself changed. So, both "test edit" factor and "time" factor are not really expected to improve recall a lot, their purpose is to help to break the bubble. New and edited tests must run, it seems obvious. The time factor is less obvious but it's our only realistic way to make sure that we don't forget some tests forever.
- This affects the *time since last run* factor. This factor helps encounter 'hidden' bugs that can be exposed by tests that have not run, but the data available makes it difficult - This would also affect the *correlation* factor. If test A and B fail together often, and on test_run Z both of them would fail, but only A runs, the heightened relevancy of B on the next test_run would not make it catch anything (again, this is a limitation of the data, not of reality) - Humans are probably a lot better at predicting first failures than the current strategy.
This is true, unfortunately it's a full time job which we can't afford to waste a human resource on.
Some ideas:
- I need to be more strict with my testing, and reviewing my code : ) - I need to improve prediction of 'first failures'. What would be a good way to improve this?
Putting aside code changes which are too difficult to analyze, the only obvious realistic way is to combine test editing with time factor, tune the time factor better, and also add randomization of the tests with equal priority that you put at the end of the queue.
- Correlation between files changed - Tests failed? Apparently Sergei tried this, but the results were not too good - But this is before running in strict simulation mode. With strict simulation mode, anything that could help spot first failures could be considered.
As discussed before, it seems difficult to implement. Lets fix what we have now, and if the results are still not satisfactory, re-consider it later.
I am currently running tests to get the adjusted results. I will graph them, and send them out in a couple hours.
Please at least fix the dependent logic first. You should be able to see it easily by changing the order of modes in run_basic_simulations -- e.g. try to run standard / platform / branch / mixed for one running set, and then run again, with mixed / branch / platform / standard. Regards, Elena
Regards
Pablo
On Fri, Jun 13, 2014 at 12:40 AM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update.
On 12.06.2014 19:13, Pablo Estrada wrote:
Hello Sergei, Elena and all, Today while working on the script, I found and fixed an issue:
There is some faulty code code in my script that is in charge of collecting the statistics about whether a test failure was caught or not (here <https://github.com/pabloem/Kokiri/blob/master/basic_simulator.py#L393>). I looked into fixing it, and then I could see another *problem*: The *recall numbers* that I had collected previously were *too high*.
The actual recall numbers, once we consider the test failures that are *not caught*, are disappointingly lower. I won't show you results yet, since I
want to make sure that the code has been fixed, and I have accurate tests first.
This is all for now. The strategy that I was using is a lot less effective than it seemed initially. I will send out a more detailed report with results, my opinion on the weak points of the strategy, and ideas, including a roadmap to try to improve results.
Regards. All feedback is welcome.
Please push your fixed code that triggered the new results, even if you are not ready to share the results themselves yet. It will be easier to discuss then.
Regards, Elena
Pablo
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp