[Maria-developers] [GSoC] Optimize mysql-test-runs - Setback
Hello Sergei, Elena and all, Today while working on the script, I found and fixed an issue: There is some faulty code code in my script that is in charge of collecting the statistics about whether a test failure was caught or not (here <https://github.com/pabloem/Kokiri/blob/master/basic_simulator.py#L393>). I looked into fixing it, and then I could see another *problem*: The *recall numbers* that I had collected previously were *too high*. The actual recall numbers, once we consider the test failures that are *not caught*, are disappointingly lower. I won't show you results yet, since I want to make sure that the code has been fixed, and I have accurate tests first. This is all for now. The strategy that I was using is a lot less effective than it seemed initially. I will send out a more detailed report with results, my opinion on the weak points of the strategy, and ideas, including a roadmap to try to improve results. Regards. All feedback is welcome. Pablo
Hi Pablo, Thanks for the update. On 12.06.2014 19:13, Pablo Estrada wrote:
Hello Sergei, Elena and all, Today while working on the script, I found and fixed an issue:
There is some faulty code code in my script that is in charge of collecting the statistics about whether a test failure was caught or not (here <https://github.com/pabloem/Kokiri/blob/master/basic_simulator.py#L393>). I looked into fixing it, and then I could see another *problem*: The *recall numbers* that I had collected previously were *too high*.
The actual recall numbers, once we consider the test failures that are *not caught*, are disappointingly lower. I won't show you results yet, since I want to make sure that the code has been fixed, and I have accurate tests first.
This is all for now. The strategy that I was using is a lot less effective than it seemed initially. I will send out a more detailed report with results, my opinion on the weak points of the strategy, and ideas, including a roadmap to try to improve results.
Regards. All feedback is welcome.
Please push your fixed code that triggered the new results, even if you are not ready to share the results themselves yet. It will be easier to discuss then. Regards, Elena
Pablo
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Hello Elena and all, I have pushed the fixed code. There are a lot of changes in it because I went through all the code making sure that it made sense. The commit is here <https://github.com/pabloem/Kokiri/commit/7c47afc45a7b1f390e8737df58205fa53334ba09>, and although there are a lot of changes, the main line where failures are caught or missed is this <https://github.com/pabloem/Kokiri/blob/7c47afc45a7b1f390e8737df58205fa53334ba09/simulator.py#L496> . 1. The test result file edition information helps improve recall - if marginally 2. The time since last run information does not improve recall much at all - See [Weaknesses - 2] A couple of concepts that I want to define before going on: - *First failures*. These are failures that happen because of new bugs. They don't occur close in time as part of a chain of failures. The occur as a consequence of a transaction that introduces a bug, but they might occur soon or long after this transaction (usually soon, rather than long). They might be correlated with the frequency of failure of a test (core or basic tests that fail often might be specially good at exposing bugs); but many of them are not (tests of a feature, that don't fail often, but rather, when that feature is modified). - *Strict simulation mode.* This is the mode where, if a test is not part of the running set, its failure is not considered. Weaknesses: - It's very difficult to predict 'first failures'. With the current strategy, if it's been long since a test failed (or if it has never failed before), the relevancy of the test just goes down, and it never runs. - Specially in database, and parallel software, there are bugs that hide in the code for a long time until one test discovers them. Unfortunately, the analysis that I'm doing requires that the test runs exactly when the data indicates it will fail. If a test that would fail doesn't run in test run Z, even though it might run in test run Z+1, the failure is just considered as missed, as if the bug was 'not encountered' ever. - This affects the *time since last run* factor. This factor helps encounter 'hidden' bugs that can be exposed by tests that have not run, but the data available makes it difficult - This would also affect the *correlation* factor. If test A and B fail together often, and on test_run Z both of them would fail, but only A runs, the heightened relevancy of B on the next test_run would not make it catch anything (again, this is a limitation of the data, not of reality) - Humans are probably a lot better at predicting first failures than the current strategy. Some ideas: - I need to be more strict with my testing, and reviewing my code : ) - I need to improve prediction of 'first failures'. What would be a good way to improve this? - Correlation between files changed - Tests failed? Apparently Sergei tried this, but the results were not too good - But this is before running in strict simulation mode. With strict simulation mode, anything that could help spot first failures could be considered. I am currently running tests to get the adjusted results. I will graph them, and send them out in a couple hours. Regards Pablo On Fri, Jun 13, 2014 at 12:40 AM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update.
On 12.06.2014 19:13, Pablo Estrada wrote:
Hello Sergei, Elena and all, Today while working on the script, I found and fixed an issue:
There is some faulty code code in my script that is in charge of collecting the statistics about whether a test failure was caught or not (here <https://github.com/pabloem/Kokiri/blob/master/basic_simulator.py#L393>). I looked into fixing it, and then I could see another *problem*: The *recall numbers* that I had collected previously were *too high*.
The actual recall numbers, once we consider the test failures that are *not caught*, are disappointingly lower. I won't show you results yet, since I
want to make sure that the code has been fixed, and I have accurate tests first.
This is all for now. The strategy that I was using is a lot less effective than it seemed initially. I will send out a more detailed report with results, my opinion on the weak points of the strategy, and ideas, including a roadmap to try to improve results.
Regards. All feedback is welcome.
Please push your fixed code that triggered the new results, even if you are not ready to share the results themselves yet. It will be easier to discuss then.
Regards, Elena
Pablo
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Hi Pablo, On 13.06.2014 14:12, Pablo Estrada wrote:
Hello Elena and all, I have pushed the fixed code. There are a lot of changes in it because I
I went through your code (the latest revision). I think I'll postpone detailed in-code comments till the next iteration, and instead will just list my major concerns and questions here. 1. Structure The structure of the code seems to be not quite what we need at the end. As you know, the goal is to return a set (list) of tests that MTR would run. I understand that you are currently experimenting and hence don't have the final algorithm to produce a single list. But what I would expect to see is - a core module which takes various parameters -- type of metric, running set, calculation mode, etc. -- does all the work and produces such a list (possibly with some statistical data which you need for the research, and which might be useful later anyway); - a wrapper which would feed the core module with different sets of parameters, get the results and compare them. After the research is finished, the best parameters would become default, the wrapper would be abandoned or re-written to pass the resulting test list to MTR, while the core module would stay pretty much intact. At the first glance it seemed to be the case in your code, but it turned out it was not. run_basic_simulations.py looks like a wrapper described above, only it does the extra work initializing the simulator, which it should not. On the other hand, simulator.py does not look like the described core module at all. It executes logic for all modes regardless the start up parameters, and this logic is very interleaved. After you choose the best approach, you will have to re-write it majorly, which is not only waste of time but is also error-prone. 2. Cleanup To prove the previous point, currently experiments that you run are not independent. That is, if you call several simulations from run_basic_simulations.py, only the very first one will use the correct data and get real results. All consequent ones will use the data modified by the previous ones, and the results will be totally irrelevant. It happens because there is initial prepare in run_basic_simulations.py, but there is no cleanup between simulations. The whole test_info structure remains what it was by the end of the previous simulation, importantly metrics. Also, test_edit_factor cannot work for any simulations except for the first one at all, because during the simulation you truncate the editions list, but never restore it. 3. Modes These flaws should be easy to fix by doing proper cleanup before each simulation. But there are also other fragments of code where, for example, logic for 'standard' mode is supposed to be always run and is relied upon, even if the desired mode is different. In fact, you build all queues every time. It would be an understandable trade-off to save the time on simulations, but you re-run them separately anyway, and only return the requested queue. 4. Failed tests vs executed tests Further, as I understand you only calculate the metrics for tests which were either edited, or failed at least once; and thus, only such tests can ever make to a corresponding queue. Not only does it create a bubble, but it also makes the comparison of modes faulty, and the whole simulation less efficient. Lets suppose for simplicity that we do not use the editing factor. In standard mode, the number of relevant failed tests for a single test run is obviously greater than lets say in mixed mode (because in standard mode all failures count, while in mixed mode -- only those that happened on platform+branch). So, when in the standard mode you'll calculate metrics for lets say 1K tests, in the mixed mode for a particular combination of platform+branch you'll do so only for 20 tests. It means that even though you set the running set to 500, in fact you'll only run 20 tests at most. It's not desirable -- if we say we can afford running 500 tests, we'd rather run 500 than 20, even if some of them never failed before. This will also help us break the bubble, especially if we randomize the "tail" (tests with the minimal priority that we add to fill the queue). If some of them fail, they'll get a proper metric and will migrate to the meaningful part of the queue. I know you don't have all the data about which tests were run or can be run in a certain test run; but for initial simulation the information is fairly easy to obtain -- just use the corresponding stdio files which you can obtain via the web interface, or run MTR to produce the lists; and in real life it should be possible to make MTR pass it over to your tool. To populate the queue, You don't really need the information which tests had ever been run; you only need to know which ones MTR *wants* to run, if the running set is unlimited. If we assume that it passes the list to you, and you iterate through it, you can use your metrics for tests that failed or were edited before, and a default minimal metric for other tests. Then, if the calculated tests are not enough to fill the queue, you'll randomly choose from the rest. It won't completely solve the problem of tests that never failed and were never edited, but at least it will make it less critical. 5. Running set It's a smaller issue, but back to the real usage of the algorithm, we cannot really set an absolute value of the running set. MTR options can be very different, in one builder it can run hundreds tests the most, in another thousands. We should use a percentage instead. 6. Full / non-full simulation mode I couldn't understand what the *non*-full simulation mode is for, can you explain this? 7. Matching logic (get_test_file_change_history) The logic where you are trying to match result file names to test names is not quite correct. There are some highlights: There can also be subsuites. Consider the example: ./mysql-test/suite/engines/iuds/r/delete_decimal.result The result file can live not only in /r dir, but also in /t dir, together with the test file. It's not cool, but it happens, see for example mysql-test/suite/mtr/t/ Here are some other possible patterns for engine/plugin suites: ./storage/tokudb/mysql-test/suite/tokudb/r/rows-32m-1.result ./storage/innodb_plugin/mysql-test/innodb.result Also, in release builds they can be in mysql-test/plugin folder: mysql-test/plugin/example/mtr/t Be aware that the logic where you compare branch names doesn't currently work as expected. Your list of "fail branches" consists of clean names only, e.g. "10.0", while row[BRANCH] can be like "lp:~maria-captains/maria/10.0". I'm not sure yet why it is sometimes stored this way, but it is. I had more comments/questions, but lets address these ones first, and then we'll see what of the rest remains relevant. Comments on your notes from the email are below inline.
went through all the code making sure that it made sense. The commit is here <https://github.com/pabloem/Kokiri/commit/7c47afc45a7b1f390e8737df58205fa53334ba09>, and although there are a lot of changes, the main line where failures are caught or missed is this <https://github.com/pabloem/Kokiri/blob/7c47afc45a7b1f390e8737df58205fa53334ba09/simulator.py#L496> .
1. The test result file edition information helps improve recall - if marginally 2. The time since last run information does not improve recall much at all - See [Weaknesses - 2]
Lets get back to it (both of them) after the logic with dependent simulations is fixed, after that we'll review it and see why it doesn't work if it still doesn't. Right now any effect that file edition might have is rather coincidental, possibly the other one is also broken.
A couple of concepts that I want to define before going on:
- *First failures*. These are failures that happen because of new bugs. They don't occur close in time as part of a chain of failures. The occur as a consequence of a transaction that introduces a bug, but they might occur soon or long after this transaction (usually soon, rather than long). They might be correlated with the frequency of failure of a test (core or basic tests that fail often might be specially good at exposing bugs); but many of them are not (tests of a feature, that don't fail often, but rather, when that feature is modified). - *Strict simulation mode.* This is the mode where, if a test is not part of the running set, its failure is not considered.
Weaknesses:
- It's very difficult to predict 'first failures'. With the current strategy, if it's been long since a test failed (or if it has never failed before), the relevancy of the test just goes down, and it never runs. - Specially in database, and parallel software, there are bugs that hide in the code for a long time until one test discovers them. Unfortunately, the analysis that I'm doing requires that the test runs exactly when the data indicates it will fail. If a test that would fail doesn't run in test run Z, even though it might run in test run Z+1, the failure is just considered as missed, as if the bug was 'not encountered' ever.
What you call "First failures" is the main target of the regression test suite. So, however difficult they are to predict, we should attempt to do so. On the bright side, we don't need to care that much about the other type, those that "hide in the code for a long time". There are indeed sporadic failures of either code or a test, which happen every now and then, some often, some rarely; but they are not what the test suite is after. Ideally, they should not exist at all, the regression test suite is supposed to be totally deterministic, which means that a test that passed before may only fail if the related code or the test itself changed. So, both "test edit" factor and "time" factor are not really expected to improve recall a lot, their purpose is to help to break the bubble. New and edited tests must run, it seems obvious. The time factor is less obvious but it's our only realistic way to make sure that we don't forget some tests forever.
- This affects the *time since last run* factor. This factor helps encounter 'hidden' bugs that can be exposed by tests that have not run, but the data available makes it difficult - This would also affect the *correlation* factor. If test A and B fail together often, and on test_run Z both of them would fail, but only A runs, the heightened relevancy of B on the next test_run would not make it catch anything (again, this is a limitation of the data, not of reality) - Humans are probably a lot better at predicting first failures than the current strategy.
This is true, unfortunately it's a full time job which we can't afford to waste a human resource on.
Some ideas:
- I need to be more strict with my testing, and reviewing my code : ) - I need to improve prediction of 'first failures'. What would be a good way to improve this?
Putting aside code changes which are too difficult to analyze, the only obvious realistic way is to combine test editing with time factor, tune the time factor better, and also add randomization of the tests with equal priority that you put at the end of the queue.
- Correlation between files changed - Tests failed? Apparently Sergei tried this, but the results were not too good - But this is before running in strict simulation mode. With strict simulation mode, anything that could help spot first failures could be considered.
As discussed before, it seems difficult to implement. Lets fix what we have now, and if the results are still not satisfactory, re-consider it later.
I am currently running tests to get the adjusted results. I will graph them, and send them out in a couple hours.
Please at least fix the dependent logic first. You should be able to see it easily by changing the order of modes in run_basic_simulations -- e.g. try to run standard / platform / branch / mixed for one running set, and then run again, with mixed / branch / platform / standard. Regards, Elena
Regards
Pablo
On Fri, Jun 13, 2014 at 12:40 AM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update.
On 12.06.2014 19:13, Pablo Estrada wrote:
Hello Sergei, Elena and all, Today while working on the script, I found and fixed an issue:
There is some faulty code code in my script that is in charge of collecting the statistics about whether a test failure was caught or not (here <https://github.com/pabloem/Kokiri/blob/master/basic_simulator.py#L393>). I looked into fixing it, and then I could see another *problem*: The *recall numbers* that I had collected previously were *too high*.
The actual recall numbers, once we consider the test failures that are *not caught*, are disappointingly lower. I won't show you results yet, since I
want to make sure that the code has been fixed, and I have accurate tests first.
This is all for now. The strategy that I was using is a lot less effective than it seemed initially. I will send out a more detailed report with results, my opinion on the weak points of the strategy, and ideas, including a roadmap to try to improve results.
Regards. All feedback is welcome.
Please push your fixed code that triggered the new results, even if you are not ready to share the results themselves yet. It will be easier to discuss then.
Regards, Elena
Pablo
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Hi, Elena! Just one comment: On Jun 16, Elena Stepanova wrote:
4. Failed tests vs executed tests
Further, as I understand you only calculate the metrics for tests which were either edited, or failed at least once; and thus, only such tests can ever make to a corresponding queue. Not only does it create a bubble, but it also makes the comparison of modes faulty, and the whole simulation less efficient.
About the bubble. Why is it bad? Because it decreases the recall - there are test failures (namely, outside of the bubble) that we'll never see. But because the whole purpose of this task is to optimize for a *high recall* in a short testing time, everything that makes recall worse needs to be analyzed. I mean, this is important - the bubble isn't bad for itself, it's only bad because it reduces the recall. If no strategy to break this bubble will help to improve the recall - we shouldn't break it at all! On the other hand, perhaps you, Elena, think that missing a new test failure in one of the test files that wasn't touched by a particular revision - that missing such a failure is worse than missing some other test failure? Because that's a regression and so on. If this is the case, Pablo needs to change the fitness function he's optimizing. Recall assigns equal weights to all test failures, missing one failure is equally bad for all tests. If some failures are worse than others, a different fitness function, uhm, let's call it "weighted recall", could be used to adequally map your expectations into the model. Again - if you think that optimizing the model doesn't do what we want it to do, the way to fix it is not to add artificial heuristics and rules into it, but to modify the model.
It means that even though you set the running set to 500, in fact you'll only run 20 tests at most. It's not desirable -- if we say we can afford running 500 tests, we'd rather run 500 than 20, even if some of them never failed before. This will also help us break the bubble
Same as above. The model optimizes the recall as a function of test time (ideally, that is). If it shows that running 20 tests produces the same recall as running 500 tests - it should run 20 tests. Indeed, why should it run more if it doesn't improve the recall? Although I expect that running 500 tests *will* improve the recall, of course, even if only marginally. Anyway, my whole point is - let's stay within the model and improve the fitness function (which is recall at the moment). It's the only way to see quantatively what every strategy gives and whether it should be used at all. That said, Pablo should try to do something about the bubble, I suppose. E.g. run more tests and randomize the tail? And see whether it helps to improve the recall. Regards, Sergei
Hi Sergei, On 16.06.2014 10:57, Sergei Golubchik wrote:
Hi, Elena!
Just one comment:
On Jun 16, Elena Stepanova wrote:
4. Failed tests vs executed tests
Further, as I understand you only calculate the metrics for tests which were either edited, or failed at least once; and thus, only such tests can ever make to a corresponding queue. Not only does it create a bubble, but it also makes the comparison of modes faulty, and the whole simulation less efficient.
About the bubble. Why is it bad? Because it decreases the recall - there are test failures (namely, outside of the bubble) that we'll never see.
But because the whole purpose of this task is to optimize for a *high recall* in a short testing time, everything that makes recall worse needs to be analyzed.
I mean, this is important - the bubble isn't bad for itself, it's only bad because it reduces the recall. If no strategy to break this bubble will help to improve the recall - we shouldn't break it at all!
Right, and I want to see a proof that it really does *not* improve recall, because I think it should. Currently we think that our recall is a function of a running set, and we say -- okay, after N=100 it flattens and doesn't improve much further. But it might well be that it flattens simply because the queue doesn't get filled -- of course there will be no difference between N=100 and N=500 if the queue is less than 100 anyway. Then again, if recall is close to 100% either way, it might not be important, but a) I doubt it is. as Pablo said, the previous results were not accurate, and from what I saw after we remove dependencies between simulation runs, we should be somewhere below 50% with the mixed mode on N=500. b) Unless I'm missing something, the bubble becomes critical if we add lets say a new platform, because it does not allow to choose tests which never failed on this platform, and the queue will be empty and the platform won't be tested at all, at least until some tests get edited (assuming we use the editing factor). In any case, now the experiments provide results different from what we think they do. If we want to compare the "full queue" effect with the "non-full queue", lets make it another parameter.
On the other hand, perhaps you, Elena, think that missing a new test failure in one of the test files that wasn't touched by a particular revision - that missing such a failure is worse than missing some other test failure? Because that's a regression and so on. If this is the case, Pablo needs to change the fitness function he's optimizing. Recall assigns equal weights to all test failures, missing one failure is equally bad for all tests. If some failures are worse than others, a different fitness function, uhm, let's call it "weighted recall", could be used to adequally map your expectations into the model.
No, I wasn't thinking about it. I'm still staying withing the same model, where all failures have equal weights. On the contrary, my notes regarding "first time failures" vs "sporadic failures" were supposed to say that we don't need to do anything specific about sporadic failures, if they are caught then they are caught, if not then not. Sorry if it wasn't clear. I do however think that abandoning a test forever because it hadn't failed for a long time is a wrong thing to do, but tools for dealing with it are already in the model -- time factor and editing factor, and they had been there from the beginning, they just need to be tuned (editing needs to be fixed, and possibly time coefficient to be changed if the current value doesn't provide good results -- that's something to experiment with).
Again - if you think that optimizing the model doesn't do what we want it to do, the way to fix it is not to add artificial heuristics and rules into it, but to modify the model.
It means that even though you set the running set to 500, in fact you'll only run 20 tests at most. It's not desirable -- if we say we can afford running 500 tests, we'd rather run 500 than 20, even if some of them never failed before. This will also help us break the bubble
Same as above. The model optimizes the recall as a function of test time (ideally, that is). If it shows that running 20 tests produces the same recall as running 500 tests - it should run 20 tests. Indeed, why should it run more if it doesn't improve the recall?
Same as above, I think it will improve the recall, and most likely even essentially, and at the very least we need to see the difference so we can make an informed decision about it.
Although I expect that running 500 tests *will* improve the recall, of course, even if only marginally.
Anyway, my whole point is - let's stay within the model and improve the fitness function (which is recall at the moment). It's the only way to see quantatively what every strategy gives and whether it should be used at all.
It is still the same model. The core of the model was to make recall a function of cutoff, right? So lets try it first, lets make it real cutoff and see the results. Not filling the queue completely (or in some cases having it empty) is optimization over the initial model, which improves the execution time (marginally) but affects recall (even only marginally). It can be considered, but the results should be compared to the basic ones. And if lets say we decide that N=100 (or N=10%) is the best cutoff value, and then find out that by not filling the queue completely we lose even 1% in recall,we might want to stay with the full queue. What is the time difference between running 50 tests and 100 tests? Almost nothing, especially comparing to what we spend on preparation of the tests. So, if 100 tests vs 50 tests add 1% to recall, and also helps to solve the problem of never-running-tests, i'd say it's better to stay within the initial model. Regards, Elena
That said, Pablo should try to do something about the bubble, I suppose. E.g. run more tests and randomize the tail? And see whether it helps to improve the recall.
Regards, Sergei
Hi, Elena! On Jun 16, Elena Stepanova wrote:
It is still the same model. The core of the model was to make recall a function of cutoff, right? So
Yes, kind of. Ideally, it should be a recall as a function of the total testing time (taking into account number of tests to run, relative builders' speeds and individual test durations). For example, it's better to run two tests on a fast builder, than one test on a slow builder. Better to run three short tests, than one long test. Because that's what we care about - to get highest possible recall and as fast as possible. But, indeed, using number of tests, instead of time, makes the model simpler and only marginally affects the results (if at all). That's what I've seen in my experiments. So, yes, recall a function of cutoff it is.
lets try it first, lets make it real cutoff and see the results. Not filling the queue completely (or in some cases having it empty) is optimization over the initial model, which improves the execution time (marginally) but affects recall (even only marginally). It can be considered, but the results should be compared to the basic ones.
And if lets say we decide that N=100 (or N=10%) is the best cutoff value, and then find out that by not filling the queue completely we lose even 1% in recall,we might want to stay with the full queue. What is the time difference between running 50 tests and 100 tests? Almost nothing, especially comparing to what we spend on preparation of the tests.
I don't think this project will determine what the "best" value is. It can only find the "best set of model parameters" for recall as a function of cutoff (here the "best set" means that for any other set of parameters, recall - in the target recall range - will be less for any given value of cutoff). In other words, this project will deliver a function recall(cutoff). And then we can decide what cutoff we want and how many failures we can afford to miss. For example, for 80% recall one achieve in 5% of the time, 90% recall - in 10% of the time, 95% recal - in 50% of the time, etc. And then you can say, "no, we cannot miss more than 5% of failures, so we'll have to live with 50% speedup only". But no experiments will tell you how many failures are acceptable for us to miss. Regards, Sergei
Hi Sergei, On 16.06.2014 13:46, Sergei Golubchik wrote:
And if lets say we decide that N=100 (or N=10%) is the best cutoff value, and then find out that by not filling the queue completely we lose even 1% in recall,we might want to stay with the full queue. What is the time difference between running 50 tests and 100 tests? Almost nothing, especially comparing to what we spend on preparation of the tests.
I don't think this project will determine what the "best" value is. It can only find the "best set of model parameters" for recall as a function of cutoff (here the "best set" means that for any other set of parameters, recall - in the target recall range - will be less for any given value of cutoff).
In other words, this project will deliver a function recall(cutoff). And then we can decide what cutoff we want and how many failures we can afford to miss. For example, for 80% recall one achieve in 5% of the time, 90% recall - in 10% of the time, 95% recal - in 50% of the time, etc.
Of course... The example above with the numbers was merely to show that since our primary goal is to make cutoff small enough, an attempt to artificially reduce it further by not using the whole queue won't necessarily pay off. In terms of execution time, it's an essential difference whether to run 100 vs 2000 tests, or even 1000 vs 2000 tests; it's hardly critical whether to run 50 or 100 tests. But to be sure, we need to see whether filling/not-filling the queue makes such an obvious difference that only one approach should always be used, or whether the difference is marginal and we should make it a part of the "set of model parameters and choose its value later as a part of our voluntary decision (and maybe adjust it later). Currently we only have the "non-filling" strategy in place, so we can't see the difference, and it will be a pity if we miss this chance to improve the recall at almost no cost. /E
And then you can say, "no, we cannot miss more than 5% of failures, so we'll have to live with 50% speedup only". But no experiments will tell you how many failures are acceptable for us to miss.
Regards, Sergei
Hello everyone, Thank you Elena, I really appreciate that you took the time to make a thorough review of the code. Answers to your comments are inline. Conclusion at the end : ) 1. Structure
- When I started coding, I had no idea how the project would fit in the whole system, so, after asking a bit, I wrote a script much like Sergei's example: Only with the simulations and statistics in mind and not considering the implementation phase. I tried to make it a bit more practical to use, but I did not have in mind using this code as part of the implementation. Only the lessons learned from it. - If you think it's a better idea, we can define exactly what the interfaces of my project should be, so that I can write something with views on the implementation, rather than a pure research-oriented script. - Regarding the multi-mode logic: I decided to keep the logic from all modes because it's only a constant overhead (it doesn't affect the complexity of the original algorithm significantly), and at first I was considering all failures, not only those from tests that ran. I can easily add the code to run only the mode of interest. 2. Cleanup
if you call several simulations from run_basic_simulations.py, only the very first one will use the correct data and get real results. All consequent ones will use the data modified by the previous ones, and the results will be totally irrelevant.
Yup, this is a huge mistake. When I implemented the series of simulations, I completely forgot about the metrics, and thought there was no extra cleanup necessary; but I was wrong. This is the first thing I'll fix. 3. Modes
These flaws should be easy to fix by doing proper cleanup before each simulation. But there are also other fragments of code where, for example, logic for 'standard' mode is supposed to be always run and is relied upon, even if the desired mode is different.
In fact, you build all queues every time. It would be an understandable trade-off to save the time on simulations, but you re-run them separately anyway, and only return the requested queue.
I decided to keep the logic of all modes because it does not affect the complexity of the algorithm... and yes, I relied on the standard mode because it always runs. This is not a mistake, but rather it is a 'cheat' that I was aware of, but indulged in. This is easy to fix - and I don't think it is too serious of a problem. If I were to code something with views towards the implementation, I would make sure that only strictly necessary code would run. 4. Failed tests vs executed tests
- if we say we can afford running 500 tests, we'd rather run 500 than 20, even if some of them never failed before. This will also help us break the bubble, especially if we randomize the "tail" (tests with the minimal priority that we add to fill the queue). If some of them fail, they'll get a proper metric and will migrate to the meaningful part of the queue.
I don't understand what you mean by bubble, but I see your point. You are right. I will work on making sure that we run RUNNING_SET tests, even when the priority queue contains less than that.
To populate the queue, You don't really need the information which tests had ever been run; you only need to know which ones MTR *wants* to run, if the running set is unlimited. If we assume that it passes the list to you, and you iterate through it, you can use your metrics for tests that failed or were edited before, and a default minimal metric for other tests. Then, if the calculated tests are not enough to fill the queue, you'll randomly choose from the rest. It won't completely solve the problem of tests that never failed and were never edited, but at least it will make it less critical.
This brings us back to the question of research-only vs preparing-for-implementation, yes, this can be done. 6. Full / non-full simulation mode
I couldn't understand what the *non*-full simulation mode is for, can you explain this?
This is from the beginning of the implementation, when I was considering all failures - even from tests that were not supposed to run. I can remove this already.
7. Matching logic (get_test_file_change_history)
The logic where you are trying to match result file names to test names is not quite correct. There are some highlights:
There can also be subsuites. Consider the example: ./mysql-test/suite/engines/iuds/r/delete_decimal.result
The result file can live not only in /r dir, but also in /t dir, together with the test file. It's not cool, but it happens, see for example mysql-test/suite/mtr/t/
Here are some other possible patterns for engine/plugin suites: ./storage/tokudb/mysql-test/suite/tokudb/r/rows-32m-1.result ./storage/innodb_plugin/mysql-test/innodb.result Also, in release builds they can be in mysql-test/plugin folder: mysql-test/plugin/example/mtr/t
When I coded that logic, I used the website that you had suggested, and thought that I had covered all possibilities. I see that I didn't. I will work on that too.
Be aware that the logic where you compare branch names doesn't currently work as expected. Your list of "fail branches" consists of clean names only, e.g. "10.0", while row[BRANCH] can be like "lp:~maria-captains/maria/10.0". I'm not sure yet why it is sometimes stored this way, but it is.
Yes, I am aware of this; but when I looked at the data, there were very, very few cases like this, so I decided to just ignore them. All the other cases (>27 thousand) are presented as clean names. I believe this is negligible, but if you think otherwise, I can add the logic to parse these few cases. MariaDB [kokiri_apr23]> select branch, count(*) from changes where branch like 'lp%' group by 1 order by 2; +------------------------------------------------------+----------+ | branch | count(*) | +------------------------------------------------------+----------+ | lp:~maria-captains/maria/5.3-knielsen | 1 | | lp:~maria-captains/maria/5.5 | 1 | | lp:~maria-captains/maria/mariadb-5.1-mwl132 | 1 | | lp:~maria-captains/maria/mariadb-5.1-mwl163 | 1 | | lp:~maria-captains/maria/maria-5.1-table-elimination | 3 | | lp:~maria-captains/maria/10.0-FusionIO | 7 | | lp:~maria-captains/maria/5.1-converting | 9 | | lp:~maria-captains/maria/10.0-connect | 15 | | lp:~maria-captains/maria/10.0 | 45 | +------------------------------------------------------+----------+ Lets get back to it (both of them) after the logic with dependent
simulations is fixed, after that we'll review it and see why it doesn't work if it still doesn't. Right now any effect that file edition might have is rather coincidental, possibly the other one is also broken.
Yup!
- Humans are probably a lot better at predicting first failures than
the current strategy.
This is true, unfortunately it's a full time job which we can't afford to waste a human resource on.
I was not suggesting to hire a person for this : ). What I meant is that a developer would be a lot better at choosing how to test their own changes, rather than a script that just accounts for recent failures. This is information that a developer can easily leverage on his own, along with his code changes, etc... I was trying to say that considering all this, I need to improve the algorithm. In conclusion, this is what I will work on now: - Add cleanup for every run - I just added (and pushed) this. Right now I am running tests with this logic. - Get new results, graph and report them ASAP. - Add randomized running of tests that have not ran or been modified - I will make this optional. - Fix the logic that converts test result file names to test names Questions to consider: - Should we define precisely what the implementation should look like, so that I can code the core, wrapper, etc? Or should I focus on finishing the experiments and the 'research' first? - I believe the logic to compare branch names should be good enough, but if you consider that it should be 100% fixed, I can do that. What do you think? Regards Pablo
Hi Pablo, On 17.06.2014 12:16, Pablo Estrada wrote:
Questions to consider:
- Should we define precisely what the implementation should look like, so that I can code the core, wrapper, etc? Or should I focus on finishing the experiments and the 'research' first?
I suggest to focus on finishing the research, unless Sergei prefers otherwise. The structure is a nuisance, but not a critical one; if you happen to have time at the end of the project, you can modify it, otherwise we will do it later.
- I believe the logic to compare branch names should be good enough, but if you consider that it should be 100% fixed, I can do that. What do you think?
Lets make a TODO note about it and postpone it till later. It shouldn't affect the research part, but we will probably need to fix it before we start using the script. Regards, Elena
Regards Pablo
Hi Pablo, One more note: On 17.06.2014 12:16, Pablo Estrada wrote:
- Add randomized running of tests that have not ran or been modified - I will make this optional.
I think it will be very useful for the *research* part if you make the seed for this randomization deterministic for every particular test run (or make it configurable); e.g. you can use the timestamp of the test run as a seed. This way you will keep results invariant. Otherwise when you run consequent experiments with the same parameters, you will be getting different results, which might be confusing and difficult to interpret. Regards, Elena
Hello everyone, I'll start with the more trivial content: I am fixing the logic to match test result files with test names. I already fixed the following cases that you provided, Elena: ./mysql-test/suite/engines/iuds/r/delete_decimal.result mysql-test/suite/mtr/t/ ./storage/tokudb/mysql-test/suite/tokudb/r/rows-32m-1.result ./storage/innodb_plugin/mysql-test/innodb.result I'm just not quite sure what you mean with this example: mysql-test/plugin/example/mtr/t In this example, what is the test name? And what is exactly the path? (./mysql-test/...) or (./something/mysql-test/...)? I tried to look at some of the test result files but I couldn't find one certain example of this pattern (Meaning that I'm not sure what would be a real instance of it). Can you be more specific please? Now, the more *serious issue*: Here are the results. They are both a bit counterintuitive, and a bit strange; but that's what I received. Basically, it seems that less is more, and recall gets worse as we add more detailed metrics. I ran the simulations only for 5000 test_runs, but I can also run them with more, and see what comes out of that. By the way, just to be sure, I recreated the objects every new simulation, so there is no doubt regarding the dependency between simulations in this case. (Although I implemented a cleanup function as well, and I'll push the final version soon). In any case, I don't think the results are good enough even in the best case... so maybe we'd want to explore more strategies. What do you guys think? I feel that going on with the same model, we can get back to the correlations between test failures. I already have a local branch in my repository where I started experimenting with this, so it should not take long to perfect that code. I can try this. If we want to be a bit more adventurous, we can also observe correlation between file changes and test failures. If we want to incorporate correlation between file changes and test failures, we can base everything in correlations, and make it a more probabilistic model: In the learning phase collect probabilities that each test fails given the 'events' that occurred since the last test run (for example files that changed, test files that changed, and tests that failed the last time). This will build a set of *probabilities*, and we run the tests that have the highest 'p*robabilities *to fail in this run. The probabilities are updated as new test_runs happen, and the data and model are updated. (We can use a decay algorithm, or we can let the probabilities decay on their own as correlations between events and test failures shift)... Of course, this idea is is more 'disruptive', would take a lot more time to code, and it has a bigger risk (since we don't have any warranties of the results).... also, it is probably more complex, computationally.... Maybe I should focus on fixing all possible bugs in my code right now, but on the other hand, I don't think that will improve results dramatically. What do you guys think? In summary: - I am fixing the logic to convert test result file, can you clarify your last example Elena? (See top of email) - The results are close to 60% for 200 tests. How's that? - Once I've fixed the issues pointed out by Elena, I will start working on correlation between test failures, to see if that helps at all. - Considering a very different strategy to try to improve results. How's that? Regards Pablo On Tue, Jun 17, 2014 at 5:40 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
One more note:
On 17.06.2014 12:16, Pablo Estrada wrote:
- Add randomized running of tests that have not ran or been modified
- I will make this optional.
I think it will be very useful for the *research* part if you make the seed for this randomization deterministic for every particular test run (or make it configurable); e.g. you can use the timestamp of the test run as a seed. This way you will keep results invariant. Otherwise when you run consequent experiments with the same parameters, you will be getting different results, which might be confusing and difficult to interpret.
Regards, Elena
Hi Pablo, I'll send a more detailed reply later, just a couple of quick comments/questions now. To your question
I'm just not quite sure what you mean with this example: mysql-test/plugin/example/mtr/t
In this example, what is the test name? And what is exactly the path? (./mysql-test/...) or (./something/mysql-test/...)? I tried to look at some of the test result files but I couldn't find one certain example of this pattern (Meaning that I'm not sure what would be a real instance of it). Can you be more specific please?
I meant that if you look into the folder <tree>/mysql-test/suite/mtr/t/ , you'll see an example of what I described as "The result file can live not only in /r dir, but also in /t dir, together with the test file": ls mysql-test/suite/mtr/t/ combs.combinations combs.inc inc.inc newcomb.result newcomb.test proxy.inc self.result self.test simple,c2,s1.rdiff simple.combinations simple.result simple,s2,c2.rdiff simple,s2.result simple.test single.result single.test source.result source.test test2.result test2.test testsh.result testsh.test As far as I remember, your matching algorithm didn't cover that.
Here are the results. They are both a bit counterintuitive, and a bit strange
Have you already done anything regarding (not) populating the queue completely? I did expect that with the current logic, after adding full cleanup between simulations, the more restrictive configuration would have lower recall, because it generally runs much fewer tests. It would be interesting to somehow indicate in the results how many tests were *actually* run. But if you don't have this information, please don't re-run the full set just for the sake of it, maybe run only one running set for standard/platform/branch/mixed, and let us see the results. No need to spend time on graphs for that, a text form will be ok. Either way, please push the current code, I'd like to see it before I come up with any suggestions about the next big moves. Regards, Elena
I have pushed my latest version of the code, and here is a test run that ran on this version of the code. It is quite different from the original expectation; so I'm taking a close look at the code for bugs, and will run another simulation ASAP (I'll use less data to make it faster). On Thu, Jun 19, 2014 at 5:16 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
I'll send a more detailed reply later, just a couple of quick comments/questions now.
To your question
I'm just not quite sure what you mean with this example: mysql-test/plugin/example/mtr/t
In this example, what is the test name? And what is exactly the path? (./mysql-test/...) or (./something/mysql-test/...)? I tried to look at some of the test result files but I couldn't find one certain example of this pattern (Meaning that I'm not sure what would be a real instance of it). Can you be more specific please?
I meant that if you look into the folder <tree>/mysql-test/suite/mtr/t/ , you'll see an example of what I described as "The result file can live not only in /r dir, but also in /t dir, together with the test file":
ls mysql-test/suite/mtr/t/ combs.combinations combs.inc inc.inc newcomb.result newcomb.test proxy.inc self.result self.test simple,c2,s1.rdiff simple.combinations simple.result simple,s2,c2.rdiff simple,s2.result simple.test single.result single.test source.result source.test test2.result test2.test testsh.result testsh.test
As far as I remember, your matching algorithm didn't cover that.
Here are the results. They are both a bit counterintuitive, and a bit
strange
Have you already done anything regarding (not) populating the queue completely? I did expect that with the current logic, after adding full cleanup between simulations, the more restrictive configuration would have lower recall, because it generally runs much fewer tests.
It would be interesting to somehow indicate in the results how many tests were *actually* run. But if you don't have this information, please don't re-run the full set just for the sake of it, maybe run only one running set for standard/platform/branch/mixed, and let us see the results. No need to spend time on graphs for that, a text form will be ok.
Either way, please push the current code, I'd like to see it before I come up with any suggestions about the next big moves.
Regards, Elena
Hello everyone, I ran the tests with randomization on Standard and Mixed mode, and here are the results. 1. Standard does not experience variation - The queue is always long enough. 2. Mixed does experience some variation - Actually, the number of tests run changes dramatically, but I forgot to add the data in the chart. I can report it too, but yes, the difference is large. 3. In any case, the results are still not quite satisfactory, so we can think back to what I had mentioned earlier: How should we change our paradigm to try to improve our chances? Regards Pablo On Fri, Jun 20, 2014 at 7:45 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
I have pushed my latest version of the code, and here is a test run that ran on this version of the code. It is quite different from the original expectation; so I'm taking a close look at the code for bugs, and will run another simulation ASAP (I'll use less data to make it faster).
On Thu, Jun 19, 2014 at 5:16 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
I'll send a more detailed reply later, just a couple of quick comments/questions now.
To your question
I'm just not quite sure what you mean with this example: mysql-test/plugin/example/mtr/t
In this example, what is the test name? And what is exactly the path? (./mysql-test/...) or (./something/mysql-test/...)? I tried to look at some of the test result files but I couldn't find one certain example of this pattern (Meaning that I'm not sure what would be a real instance of it). Can you be more specific please?
I meant that if you look into the folder <tree>/mysql-test/suite/mtr/t/ , you'll see an example of what I described as "The result file can live not only in /r dir, but also in /t dir, together with the test file":
ls mysql-test/suite/mtr/t/ combs.combinations combs.inc inc.inc newcomb.result newcomb.test proxy.inc self.result self.test simple,c2,s1.rdiff simple.combinations simple.result simple,s2,c2.rdiff simple,s2.result simple.test single.result single.test source.result source.test test2.result test2.test testsh.result testsh.test
As far as I remember, your matching algorithm didn't cover that.
Here are the results. They are both a bit counterintuitive, and a bit
strange
Have you already done anything regarding (not) populating the queue completely? I did expect that with the current logic, after adding full cleanup between simulations, the more restrictive configuration would have lower recall, because it generally runs much fewer tests.
It would be interesting to somehow indicate in the results how many tests were *actually* run. But if you don't have this information, please don't re-run the full set just for the sake of it, maybe run only one running set for standard/platform/branch/mixed, and let us see the results. No need to spend time on graphs for that, a text form will be ok.
Either way, please push the current code, I'd like to see it before I come up with any suggestions about the next big moves.
Regards, Elena
Hi Pablo, Thanks for the update. I'm looking into it right now, but meanwhile I have one quick suggestion. Currently your experiments are being run on a small part of the historical data (5% or so). From all I see, you can't afford running on a bigger share even if you want to, because the script is slow. Since it's obvious that you will need to run it many more times before we achieve the results we hope for, it's worth investing a little bit of time into the performance. For starters, please remove logger initialization from internal functions. Now you call getLogger from a couple of functions, including the one calculating the metric, which means that it's called literally millions of times even on a small part of the data set. Instead, make logger a member of the simulator class, initialize it once, e.g. in __init__, I expect you'll gain quite a lot by this no-cost change. If it becomes faster, please run the same tests with e.g. ~50% of data (learning set 47,000 max_count 50,000), or less if it's still not fast enough. No need to run all run_set values, do for example 100 and 500. It's interesting to see whether using the deeper history makes essential difference, I expect it might, but not sure. Please also indicate which parameters the experiments were run with (editing and timing factors). Regards, Elena On 22.06.2014 18:13, Pablo Estrada wrote:
Hello everyone, I ran the tests with randomization on Standard and Mixed mode, and here are the results. 1. Standard does not experience variation - The queue is always long enough. 2. Mixed does experience some variation - Actually, the number of tests run changes dramatically, but I forgot to add the data in the chart. I can report it too, but yes, the difference is large. 3. In any case, the results are still not quite satisfactory, so we can think back to what I had mentioned earlier: How should we change our paradigm to try to improve our chances?
Regards Pablo
On Fri, Jun 20, 2014 at 7:45 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
I have pushed my latest version of the code, and here is a test run that ran on this version of the code. It is quite different from the original expectation; so I'm taking a close look at the code for bugs, and will run another simulation ASAP (I'll use less data to make it faster).
On Thu, Jun 19, 2014 at 5:16 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
I'll send a more detailed reply later, just a couple of quick comments/questions now.
To your question
I'm just not quite sure what you mean with this example: mysql-test/plugin/example/mtr/t
In this example, what is the test name? And what is exactly the path? (./mysql-test/...) or (./something/mysql-test/...)? I tried to look at some of the test result files but I couldn't find one certain example of this pattern (Meaning that I'm not sure what would be a real instance of it). Can you be more specific please?
I meant that if you look into the folder <tree>/mysql-test/suite/mtr/t/ , you'll see an example of what I described as "The result file can live not only in /r dir, but also in /t dir, together with the test file":
ls mysql-test/suite/mtr/t/ combs.combinations combs.inc inc.inc newcomb.result newcomb.test proxy.inc self.result self.test simple,c2,s1.rdiff simple.combinations simple.result simple,s2,c2.rdiff simple,s2.result simple.test single.result single.test source.result source.test test2.result test2.test testsh.result testsh.test
As far as I remember, your matching algorithm didn't cover that.
Here are the results. They are both a bit counterintuitive, and a bit
strange
Have you already done anything regarding (not) populating the queue completely? I did expect that with the current logic, after adding full cleanup between simulations, the more restrictive configuration would have lower recall, because it generally runs much fewer tests.
It would be interesting to somehow indicate in the results how many tests were *actually* run. But if you don't have this information, please don't re-run the full set just for the sake of it, maybe run only one running set for standard/platform/branch/mixed, and let us see the results. No need to spend time on graphs for that, a text form will be ok.
Either way, please push the current code, I'd like to see it before I come up with any suggestions about the next big moves.
Regards, Elena
Hi Elena, I ran these tests using the time factor but not the test edit factor. I will make the code changes and run the test on a bigger scale then. I will take a serious look through the code to try to squeeze out as much performance as possible as well : ) Regards Pablo On Jun 23, 2014 1:01 AM, "Elena Stepanova" <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update. I'm looking into it right now, but meanwhile I have one quick suggestion.
Currently your experiments are being run on a small part of the historical data (5% or so). From all I see, you can't afford running on a bigger share even if you want to, because the script is slow. Since it's obvious that you will need to run it many more times before we achieve the results we hope for, it's worth investing a little bit of time into the performance.
For starters, please remove logger initialization from internal functions. Now you call getLogger from a couple of functions, including the one calculating the metric, which means that it's called literally millions of times even on a small part of the data set.
Instead, make logger a member of the simulator class, initialize it once, e.g. in __init__, I expect you'll gain quite a lot by this no-cost change.
If it becomes faster, please run the same tests with e.g. ~50% of data (learning set 47,000 max_count 50,000), or less if it's still not fast enough. No need to run all run_set values, do for example 100 and 500. It's interesting to see whether using the deeper history makes essential difference, I expect it might, but not sure.
Please also indicate which parameters the experiments were run with (editing and timing factors).
Regards, Elena
On 22.06.2014 18:13, Pablo Estrada wrote:
Hello everyone, I ran the tests with randomization on Standard and Mixed mode, and here are the results. 1. Standard does not experience variation - The queue is always long enough. 2. Mixed does experience some variation - Actually, the number of tests run changes dramatically, but I forgot to add the data in the chart. I can report it too, but yes, the difference is large. 3. In any case, the results are still not quite satisfactory, so we can think back to what I had mentioned earlier: How should we change our paradigm to try to improve our chances?
Regards Pablo
On Fri, Jun 20, 2014 at 7:45 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
I have pushed my latest version of the code, and here is a test run that
ran on this version of the code. It is quite different from the original expectation; so I'm taking a close look at the code for bugs, and will run another simulation ASAP (I'll use less data to make it faster).
On Thu, Jun 19, 2014 at 5:16 PM, Elena Stepanova < elenst@montyprogram.com> wrote:
Hi Pablo,
I'll send a more detailed reply later, just a couple of quick comments/questions now.
To your question
I'm just not quite sure what you mean with this example:
mysql-test/plugin/example/mtr/t
In this example, what is the test name? And what is exactly the path? (./mysql-test/...) or (./something/mysql-test/...)? I tried to look at some of the test result files but I couldn't find one certain example of this pattern (Meaning that I'm not sure what would be a real instance of it). Can you be more specific please?
I meant that if you look into the folder <tree>/mysql-test/suite/mtr/t/ , you'll see an example of what I described as "The result file can live not only in /r dir, but also in /t dir, together with the test file":
ls mysql-test/suite/mtr/t/ combs.combinations combs.inc inc.inc newcomb.result newcomb.test proxy.inc self.result self.test simple,c2,s1.rdiff simple.combinations simple.result simple,s2,c2.rdiff simple,s2.result simple.test single.result single.test source.result source.test test2.result test2.test testsh.result testsh.test
As far as I remember, your matching algorithm didn't cover that.
Here are the results. They are both a bit counterintuitive, and a bit
strange
Have you already done anything regarding (not) populating the queue completely? I did expect that with the current logic, after adding full cleanup between simulations, the more restrictive configuration would have lower recall, because it generally runs much fewer tests.
It would be interesting to somehow indicate in the results how many tests were *actually* run. But if you don't have this information, please don't re-run the full set just for the sake of it, maybe run only one running set for standard/platform/branch/mixed, and let us see the results. No need to spend time on graphs for that, a text form will be ok.
Either way, please push the current code, I'd like to see it before I come up with any suggestions about the next big moves.
Regards, Elena
Hello Elena, Here's what I have to report: 1. I removed the getLogger function from every called. It did improve performance significantly. 2. I also made sure that only the priority queues that concern us are built. This did not improve performance much. 3. Here are my results with 50,000 test runs with randomization, test_edit_factor and time_factor. They are not much better. (Should I run them without randomization or other options?) 4. I started with concepts and a bit of code for a new strategy. I am still open to work with the current codebase. Here are results using 47,000 test_run cycles as training, and another 3,000 for predictions. They don't improve much. My theory is that this happens because they are really linear: They only leverage information from very recent runs, and older information becomes irrelevant quickly. I started coding a bit of a new approach, looking at correlation between events since last test run and test failures. So for instance: - Correlation between files changed and tests failed - Correlation between failures in the last test run and the new one (tests that fail several times subsequently are more relevant) I just started, so this is only the idea, and there is not much code in place. I believe I can code most of it in less than a week. Of course, I am still spending time thinking how to improve the current strategy, and am definitely open to any kind of advice. Please let me know your thoughts. Regards Pablo On Mon, Jun 23, 2014 at 1:08 AM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena, I ran these tests using the time factor but not the test edit factor. I will make the code changes and run the test on a bigger scale then. I will take a serious look through the code to try to squeeze out as much performance as possible as well : )
Regards Pablo On Jun 23, 2014 1:01 AM, "Elena Stepanova" <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update. I'm looking into it right now, but meanwhile I have one quick suggestion.
Currently your experiments are being run on a small part of the historical data (5% or so). From all I see, you can't afford running on a bigger share even if you want to, because the script is slow. Since it's obvious that you will need to run it many more times before we achieve the results we hope for, it's worth investing a little bit of time into the performance.
For starters, please remove logger initialization from internal functions. Now you call getLogger from a couple of functions, including the one calculating the metric, which means that it's called literally millions of times even on a small part of the data set.
Instead, make logger a member of the simulator class, initialize it once, e.g. in __init__, I expect you'll gain quite a lot by this no-cost change.
If it becomes faster, please run the same tests with e.g. ~50% of data (learning set 47,000 max_count 50,000), or less if it's still not fast enough. No need to run all run_set values, do for example 100 and 500. It's interesting to see whether using the deeper history makes essential difference, I expect it might, but not sure.
Please also indicate which parameters the experiments were run with (editing and timing factors).
Regards, Elena
On 22.06.2014 18:13, Pablo Estrada wrote:
Hello everyone, I ran the tests with randomization on Standard and Mixed mode, and here are the results. 1. Standard does not experience variation - The queue is always long enough. 2. Mixed does experience some variation - Actually, the number of tests run changes dramatically, but I forgot to add the data in the chart. I can report it too, but yes, the difference is large. 3. In any case, the results are still not quite satisfactory, so we can think back to what I had mentioned earlier: How should we change our paradigm to try to improve our chances?
Regards Pablo
On Fri, Jun 20, 2014 at 7:45 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
I have pushed my latest version of the code, and here is a test run that
ran on this version of the code. It is quite different from the original expectation; so I'm taking a close look at the code for bugs, and will run another simulation ASAP (I'll use less data to make it faster).
On Thu, Jun 19, 2014 at 5:16 PM, Elena Stepanova < elenst@montyprogram.com> wrote:
Hi Pablo,
I'll send a more detailed reply later, just a couple of quick comments/questions now.
To your question
I'm just not quite sure what you mean with this example:
mysql-test/plugin/example/mtr/t
In this example, what is the test name? And what is exactly the path? (./mysql-test/...) or (./something/mysql-test/...)? I tried to look at some of the test result files but I couldn't find one certain example of this pattern (Meaning that I'm not sure what would be a real instance of it). Can you be more specific please?
I meant that if you look into the folder <tree>/mysql-test/suite/mtr/t/ , you'll see an example of what I described as "The result file can live not only in /r dir, but also in /t dir, together with the test file":
ls mysql-test/suite/mtr/t/ combs.combinations combs.inc inc.inc newcomb.result newcomb.test proxy.inc self.result self.test simple,c2,s1.rdiff simple.combinations simple.result simple,s2,c2.rdiff simple,s2.result simple.test single.result single.test source.result source.test test2.result test2.test testsh.result testsh.test
As far as I remember, your matching algorithm didn't cover that.
Here are the results. They are both a bit counterintuitive, and a bit
strange
Have you already done anything regarding (not) populating the queue completely? I did expect that with the current logic, after adding full cleanup between simulations, the more restrictive configuration would have lower recall, because it generally runs much fewer tests.
It would be interesting to somehow indicate in the results how many tests were *actually* run. But if you don't have this information, please don't re-run the full set just for the sake of it, maybe run only one running set for standard/platform/branch/mixed, and let us see the results. No need to spend time on graphs for that, a text form will be ok.
Either way, please push the current code, I'd like to see it before I come up with any suggestions about the next big moves.
Regards, Elena
Hi Pablo, On 23.06.2014 17:29, Pablo Estrada wrote:
Hello Elena, Here's what I have to report:
1. I removed the getLogger function from every called. It did improve performance significantly.
That's nice.
2. I also made sure that only the priority queues that concern us are built. This did not improve performance much.
It's good that you removed unnecessary parts anyway. With bigger data sets it might make a bit of difference -- not big, but at some point everything might count.
3. Here are my results with 50,000 test runs with randomization, test_edit_factor and time_factor. They are not much better. (Should I run them without randomization or other options?)
Not just yet, lets analyze the results first.
4. I started with concepts and a bit of code for a new strategy. I am still open to work with the current codebase.
I think it's premature to switch to a new strategy. Before doing that, we need to be clear why the results of the current strategy are not satisfactory. I have some hypothetical ideas about it, but confirming or ruling them out requires more experimentation.
Here are results using 47,000 test_run cycles as training, and another 3,000 for predictions. They don't improve much. My theory is that this happens because they are really linear: They only leverage information from very recent runs, and older information becomes irrelevant quickly.
This is an interesting result. The number of training cycles must matter, at least it defines which part of the algorithm is working (or failing). The lack of difference in recall actually means a lot. For simplicity, lets further in this email assume we are talking about the standard mode without any factors, unless specified otherwise. There are two reasons for missing a failure: 1) the test is not in the queue at all (it never failed before, index is -1); 2) the test is in the queue, but too far from the head (index > running_set). When you are running your simulation close to the head of the history (2000/5000), you only consider as many failures as happened in the first cycles, so your queue is rather shallow, even though it's still bigger than the running set. So, most likely the main part of the 'misses' is caused by the reason (1) -- the tests are not in the queue at all. There isn't much you can do about it, as long as you only use previously failed tests. Time factor will be also of no help here, because the queue just doesn't contain the needed test. Test editing factor is important, even if it doesn't improve recall short-term, it should extend the queue. But when you are running simulation deeper in the history, your queue contains many more tests to choose from, and you have more material to work with. If currently it doesn't help to improve recall, it means that more 'misses' happen on the reason (2), the queues are not properly sorted/filtered. It can and needs to be improved within the current strategy. Unfortunately, the deeper into the history, the more time the script takes, which should also be taken into account -- we can't afford 30 min of script in each test run, it will make the whole idea pointless. Here your last results become really important. If it doesn't matter whether you calculated metrics based on 2000 runs or 40,000 runs, maybe we won't need to use the complete logic on the whole dataset. Instead, we can very quickly populate the queue with test names and minimal initial metric values, and only do calculation for the learning set. Now, about improving sorting/filtering. For one, I once again suggest to use an *incoming test set* as a starting point for choosing tests. This is important both for appropriate evaluation of the current strategy, and for real-life use. Here is what happens now: - a test run ran some 3000 tests, and 10 of them failed; - your queue contains 1500 failed tests, somehow arranged -- these are tests from all possible MTR configurations; - you take the "first" 500 tests, compare them with the list of failures in the simulated test run; - lets say you get 6 intersections, so your recall is 0.6. The problem here is that the queue contains *any* tests, many of which this test run didn't use at all, so they couldn't possibly have failed. It can contain pbxt tests which are already gone; or unit tests which are only run in special MTR runs; or plugin tests for a plugin which is not present in this build; and so on. So, while your resulting queue contains 500 tests, there are only lets say 200 tests which were *really* run in that test run. If you had considered that, your recall 0.6 would have been not for RS 500, but for RS 200, which is better. Or, if while populating the queue you had ignored irrelevant tests, the relevant ones would end up much closer to the head of the queue, and would probably make it to the running set 500, thus you would have "caught" more failures with RS 500, and recall would have been better. It will be even more important in the real-life use, because when we want to pick N% of tests, we don't want *any* tests: each MTR run is configured to run certain logical subset of the whole MTR suite, and we want to choose from it only. In real life, MTR creates the complete list of tests at the beginning. It should be easy enough to pass it to your script. For your experiments, while test lists are not in the database, they can be easily enough extracted from test logs which I can upload for you for a certain number of test runs. Only in order to do that, you need to start using the end of your data set (the most recent test runs), because we might not have the old logs. It's an easy thing to do, you will just need to skip first len(test_history) - max_limit runs. You'll need to send me the range of test run IDs for which you need the logs. The logs look like that: http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-quantal-x86/builds... That is, they are text files which are easy enough to parse. You will need to choose lines which contain [ pass ] or [ fail ] or [ skipped ] or [ disabled ] (yes, skipped and disabled too, because they will be on the list that MTR initially creates). Further, before you start rethinking the strategy of *choosing* tests, you should analyze why the current one isn't working. Did you try to see the dynamics of recall within a single experiment? I mean, you go through 2000 training runs where you take into account all available information and calculate metrics based on it. Then, you run 3000 simulation sets where you calculate recall and re-calculate metrics, but now you only take into account information which would be available if the test run used the simulation strategy. This is a right thing to do; but did you check what happens with recall over these 3000 runs? What I expect is that it's very good at the beginning of the simulation set, because you use the full and recent data, so recall will be close to 1. But then, it will begin deteriorate. If so, the real question is not how to improve the metrics and queuing algorithms, but how to preserve the accuracy. That's where the strategy might need some adjustments, I don't have ready-to-use suggestions, we need to understand how exactly it deteriorates. I'll look into it more.
I started coding a bit of a new approach, looking at correlation between events since last test run and test failures. So for instance:
- Correlation between files changed and tests failed - Correlation between failures in the last test run and the new one (tests that fail several times subsequently are more relevant)
I just started, so this is only the idea, and there is not much code in place. I believe I can code most of it in less than a week.
Of course, I am still spending time thinking how to improve the current strategy, and am definitely open to any kind of advice. Please let me know your thoughts.
See the above. I'm afraid making correlation between code changes and test failures to work accurately might take much longer than initial coding, so I'd rather we focus on analyzing and improving functionality that we already have. That said, if you already have results, of course by all means share them, lets see how promising it looks. Regards, Elena
Regards Pablo
On Mon, Jun 23, 2014 at 1:08 AM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena, I ran these tests using the time factor but not the test edit factor. I will make the code changes and run the test on a bigger scale then. I will take a serious look through the code to try to squeeze out as much performance as possible as well : )
Regards Pablo On Jun 23, 2014 1:01 AM, "Elena Stepanova" <elenst@montyprogram.com> wrote:
Hi Pablo,
Thanks for the update. I'm looking into it right now, but meanwhile I have one quick suggestion.
Currently your experiments are being run on a small part of the historical data (5% or so). From all I see, you can't afford running on a bigger share even if you want to, because the script is slow. Since it's obvious that you will need to run it many more times before we achieve the results we hope for, it's worth investing a little bit of time into the performance.
For starters, please remove logger initialization from internal functions. Now you call getLogger from a couple of functions, including the one calculating the metric, which means that it's called literally millions of times even on a small part of the data set.
Instead, make logger a member of the simulator class, initialize it once, e.g. in __init__, I expect you'll gain quite a lot by this no-cost change.
If it becomes faster, please run the same tests with e.g. ~50% of data (learning set 47,000 max_count 50,000), or less if it's still not fast enough. No need to run all run_set values, do for example 100 and 500. It's interesting to see whether using the deeper history makes essential difference, I expect it might, but not sure.
Please also indicate which parameters the experiments were run with (editing and timing factors).
Regards, Elena
On 22.06.2014 18:13, Pablo Estrada wrote:
Hello everyone, I ran the tests with randomization on Standard and Mixed mode, and here are the results. 1. Standard does not experience variation - The queue is always long enough. 2. Mixed does experience some variation - Actually, the number of tests run changes dramatically, but I forgot to add the data in the chart. I can report it too, but yes, the difference is large. 3. In any case, the results are still not quite satisfactory, so we can think back to what I had mentioned earlier: How should we change our paradigm to try to improve our chances?
Regards Pablo
On Fri, Jun 20, 2014 at 7:45 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
I have pushed my latest version of the code, and here is a test run that
ran on this version of the code. It is quite different from the original expectation; so I'm taking a close look at the code for bugs, and will run another simulation ASAP (I'll use less data to make it faster).
On Thu, Jun 19, 2014 at 5:16 PM, Elena Stepanova < elenst@montyprogram.com> wrote:
Hi Pablo,
I'll send a more detailed reply later, just a couple of quick comments/questions now.
To your question
I'm just not quite sure what you mean with this example: > mysql-test/plugin/example/mtr/t > > In this example, what is the test name? And what is exactly the path? > (./mysql-test/...) or (./something/mysql-test/...)? I tried to look at > some > of the test result files but I couldn't find one certain example of > this > pattern (Meaning that I'm not sure what would be a real instance of > it). > Can you be more specific please? > > > I meant that if you look into the folder <tree>/mysql-test/suite/mtr/t/ , you'll see an example of what I described as "The result file can live not only in /r dir, but also in /t dir, together with the test file":
ls mysql-test/suite/mtr/t/ combs.combinations combs.inc inc.inc newcomb.result newcomb.test proxy.inc self.result self.test simple,c2,s1.rdiff simple.combinations simple.result simple,s2,c2.rdiff simple,s2.result simple.test single.result single.test source.result source.test test2.result test2.test testsh.result testsh.test
As far as I remember, your matching algorithm didn't cover that.
Here are the results. They are both a bit counterintuitive, and a bit
> strange > > Have you already done anything regarding (not) populating the queue completely? I did expect that with the current logic, after adding full cleanup between simulations, the more restrictive configuration would have lower recall, because it generally runs much fewer tests.
It would be interesting to somehow indicate in the results how many tests were *actually* run. But if you don't have this information, please don't re-run the full set just for the sake of it, maybe run only one running set for standard/platform/branch/mixed, and let us see the results. No need to spend time on graphs for that, a text form will be ok.
Either way, please push the current code, I'd like to see it before I come up with any suggestions about the next big moves.
Regards, Elena
participants (3)
-
Elena Stepanova
-
Pablo Estrada
-
Sergei Golubchik