[Maria-developers] [GSoC] Optimize mysql-test-runs - Results of new strategy
Hello everyone, I took the last couple of days working on a new strategy to calculate the relevance of a test. The results are not sufficient by themselves, but I believe they point to an interesting direction. This strategy uses that rate of co-occurrence of events to estimate the relevance of a test, and the events that it uses are the following: - File editions since last run - Test failure in last run The strategy has also two stages: 1. Training stage 2. Executing stage In the training stage, it goes through the available data, and does the following: - If test A failed: - It counts and stores all the files that were edited since the last test_run (the last test_run depends on BRANCH, PLATFORM, and other factors) - If test A failed also in the previous test run, it also counts that In the executing stage, the training algorithm is still applied, but the decision of whether a test runs is based on its relevance, the relevance is calculated as the sum of the following: - The percentage of times a test has failed in two subsequent test_runs, multiplied by whether the test failed in the previous run (if the test didn't fail in the previous run, this quantity is 0) - For each file that was edited since the last test_run, the percentage of times that the test has failed after this file was edited (The explanation is a bit clumsy, I can clear it up if you wish so) The results have not been too bad, nor too good. With a running set of 200 tests, a training phase of 3000 test runs, and an executing stage of 2000 test runs, I have achieved recall of 0.50. It's not too great, nor too bad. Nonetheless, while running tests, I found something interesting: - I removed the first factor of the relevance. I decided to not care about whether a test failed in the previous test run. I was only using the file-change factor. Naturally, the recall decreased, from 0.50 to 0.39 (the decrease was not too big)... and the distribution of failed tests in the priority queue had a good skew towards the front of the queue (so it seems that the files help somewhat, to indicate the likelihood of a failure). I attached this chart. An interesting problem that I encountered was that about 50% of the test_runs don't have any file changes nor test failures, and so the relevance of all tests is zero. Here is where the original strategy (a weighted average of failures) could be useful, so that even if we don't have any information to guess which tests to run, we just go ahead and run the ones that have failed the most, recently. I will work on mixing up both strategies a bit in the next few days, and see what comes of that. By the way, I pushed the code to github. The code is completely different, so may be better to wait until I have new results soon. Regards! Pablo
Hi all, well, as I said, I have incorporated a very simple weighted failure rate into the strategy, and I have found quite encouraging results. The recall looks better than earlier tests. I am attaching two charts with data compiled from runs with 3000 training rounds and 2000 simulation (5000 test runs analyzed in total): - The recall by running set size (As shown, it reaches 80% with 300 tests) - The index of failure in the priority queue (running set: 500, training 3000, simulation 2000) It is interesting to look at chart number 2: The first 10 or so places have a very high count of found failures. These most likely come from repeated failures (tests that failed in the previous run and were caught in the next one). The next ones have a skew to the right, and these come from the file-change model. I am glad of these new results : ). I have a couple new ideas to try to push the recall a bit further up, but I wanted to show the progress first. Also, I will do a thorough code review before any new changes, to make sure that the results are valid. Interestingly enough, in this new strategy the code is simpler. Also, I will run a test with a more long term period (20,000 training, 20,000 simulation), to see if the recall degrades as time passes and we miss more failures. Regards! Pablo On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hello everyone, I took the last couple of days working on a new strategy to calculate the relevance of a test. The results are not sufficient by themselves, but I believe they point to an interesting direction. This strategy uses that rate of co-occurrence of events to estimate the relevance of a test, and the events that it uses are the following:
- File editions since last run - Test failure in last run
The strategy has also two stages:
1. Training stage 2. Executing stage
In the training stage, it goes through the available data, and does the following:
- If test A failed: - It counts and stores all the files that were edited since the last test_run (the last test_run depends on BRANCH, PLATFORM, and other factors) - If test A failed also in the previous test run, it also counts that
In the executing stage, the training algorithm is still applied, but the decision of whether a test runs is based on its relevance, the relevance is calculated as the sum of the following:
- The percentage of times a test has failed in two subsequent test_runs, multiplied by whether the test failed in the previous run (if the test didn't fail in the previous run, this quantity is 0) - For each file that was edited since the last test_run, the percentage of times that the test has failed after this file was edited
(The explanation is a bit clumsy, I can clear it up if you wish so) The results have not been too bad, nor too good. With a running set of 200 tests, a training phase of 3000 test runs, and an executing stage of 2000 test runs, I have achieved recall of 0.50. It's not too great, nor too bad.
Nonetheless, while running tests, I found something interesting:
- I removed the first factor of the relevance. I decided to not care about whether a test failed in the previous test run. I was only using the file-change factor. Naturally, the recall decreased, from 0.50 to 0.39 (the decrease was not too big)... and the distribution of failed tests in the priority queue had a good skew towards the front of the queue (so it seems that the files help somewhat, to indicate the likelihood of a failure). I attached this chart.
An interesting problem that I encountered was that about 50% of the test_runs don't have any file changes nor test failures, and so the relevance of all tests is zero. Here is where the original strategy (a weighted average of failures) could be useful, so that even if we don't have any information to guess which tests to run, we just go ahead and run the ones that have failed the most, recently.
I will work on mixing up both strategies a bit in the next few days, and see what comes of that.
By the way, I pushed the code to github. The code is completely different, so may be better to wait until I have new results soon.
Regards! Pablo
Hi Pablo, Could you please explain why you are considering the new results being better? I don't see any obvious improvement. As I understand from the defaults, previously you were running tests with 2000 training rounds and 3000 simulation rounds, and you've already had ~70% on 300 runs and ~80% on 500 runs, see your email of June 19, no_options_simulation.jpg. Now you have switched the limits, you are running with 3000 training and 2000 simulation rounds. It makes a big difference, if you re-run tests with the old algorithm with the new limits, you'll get +10% easily, thus RS 300 will be around the same 80%, and RS 500 should be even higher, pushing 90%, while now you have barely 85%. Before jumping onto the new algorithm, please provide the comparison of the old and new approach with equal pre-conditions and parameters. Thanks, Elena On 28.06.2014 6:44, Pablo Estrada wrote:
Hi all, well, as I said, I have incorporated a very simple weighted failure rate into the strategy, and I have found quite encouraging results. The recall looks better than earlier tests. I am attaching two charts with data compiled from runs with 3000 training rounds and 2000 simulation (5000 test runs analyzed in total):
- The recall by running set size (As shown, it reaches 80% with 300 tests) - The index of failure in the priority queue (running set: 500, training 3000, simulation 2000)
It is interesting to look at chart number 2: The first 10 or so places have a very high count of found failures. These most likely come from repeated failures (tests that failed in the previous run and were caught in the next one). The next ones have a skew to the right, and these come from the file-change model.
I am glad of these new results : ). I have a couple new ideas to try to push the recall a bit further up, but I wanted to show the progress first. Also, I will do a thorough code review before any new changes, to make sure that the results are valid. Interestingly enough, in this new strategy the code is simpler. Also, I will run a test with a more long term period (20,000 training, 20,000 simulation), to see if the recall degrades as time passes and we miss more failures.
Regards! Pablo
On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hello everyone, I took the last couple of days working on a new strategy to calculate the relevance of a test. The results are not sufficient by themselves, but I believe they point to an interesting direction. This strategy uses that rate of co-occurrence of events to estimate the relevance of a test, and the events that it uses are the following:
- File editions since last run - Test failure in last run
The strategy has also two stages:
1. Training stage 2. Executing stage
In the training stage, it goes through the available data, and does the following:
- If test A failed: - It counts and stores all the files that were edited since the last test_run (the last test_run depends on BRANCH, PLATFORM, and other factors) - If test A failed also in the previous test run, it also counts that
In the executing stage, the training algorithm is still applied, but the decision of whether a test runs is based on its relevance, the relevance is calculated as the sum of the following:
- The percentage of times a test has failed in two subsequent test_runs, multiplied by whether the test failed in the previous run (if the test didn't fail in the previous run, this quantity is 0) - For each file that was edited since the last test_run, the percentage of times that the test has failed after this file was edited
(The explanation is a bit clumsy, I can clear it up if you wish so) The results have not been too bad, nor too good. With a running set of 200 tests, a training phase of 3000 test runs, and an executing stage of 2000 test runs, I have achieved recall of 0.50. It's not too great, nor too bad.
Nonetheless, while running tests, I found something interesting:
- I removed the first factor of the relevance. I decided to not care about whether a test failed in the previous test run. I was only using the file-change factor. Naturally, the recall decreased, from 0.50 to 0.39 (the decrease was not too big)... and the distribution of failed tests in the priority queue had a good skew towards the front of the queue (so it seems that the files help somewhat, to indicate the likelihood of a failure). I attached this chart.
An interesting problem that I encountered was that about 50% of the test_runs don't have any file changes nor test failures, and so the relevance of all tests is zero. Here is where the original strategy (a weighted average of failures) could be useful, so that even if we don't have any information to guess which tests to run, we just go ahead and run the ones that have failed the most, recently.
I will work on mixing up both strategies a bit in the next few days, and see what comes of that.
By the way, I pushed the code to github. The code is completely different, so may be better to wait until I have new results soon.
Regards! Pablo
Hi Elena and all, I guess I should admit that my excitement was a bit too much; but also I'm definitely not 'jumping' into this strategy. As I said, I am trying to use the lessons learned from all the experiments to make the best predictions. That being said, a strong point about the new strategy is that rather than purely use failure rate to predict failure rate, it uses more data to try to make predictions - and it experiences more consistency of prediction. On the 3k-training and 2k-predicting simulations its advantage is not so apparent (they fare similarly, with the 'standard' strategy being the best one), but it becomes more evident with longer predicting. I ran tests with 20k-training rounds and 20k-prediction rounds, and the new strategy fared a lot better. I have attached charts with comparisons of both of them. We can observe that with a running set of 500, the original algorithm had a very nice almost 95% recall in shorter tests, but it falls to less than 50% with longer testing (And it must be a lot lower if we average the last couple of thousand runs, rathen the the 20k simulation runs together) Since the goal of the project is to provide consistent long-term test optimization, we would want to take all we can learn from the new strategy - and, improve the consistency of the recall over long-term simulation. Nevertheless, I agree that there are important lessons in the original strategy, particularly that >90% recall ion shorter prediction periods. That's why I'm still tuning and testing. Again, all advice and observations are welcome. Hope everyone is having a nice weekend. Pablo On Sun, Jun 29, 2014 at 12:53 AM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Could you please explain why you are considering the new results being better? I don't see any obvious improvement.
As I understand from the defaults, previously you were running tests with 2000 training rounds and 3000 simulation rounds, and you've already had ~70% on 300 runs and ~80% on 500 runs, see your email of June 19, no_options_simulation.jpg.
Now you have switched the limits, you are running with 3000 training and 2000 simulation rounds. It makes a big difference, if you re-run tests with the old algorithm with the new limits, you'll get +10% easily, thus RS 300 will be around the same 80%, and RS 500 should be even higher, pushing 90%, while now you have barely 85%.
Before jumping onto the new algorithm, please provide the comparison of the old and new approach with equal pre-conditions and parameters.
Thanks, Elena
On 28.06.2014 6:44, Pablo Estrada wrote:
Hi all, well, as I said, I have incorporated a very simple weighted failure rate into the strategy, and I have found quite encouraging results. The recall looks better than earlier tests. I am attaching two charts with data compiled from runs with 3000 training rounds and 2000 simulation (5000 test runs analyzed in total):
- The recall by running set size (As shown, it reaches 80% with 300 tests) - The index of failure in the priority queue (running set: 500, training
3000, simulation 2000)
It is interesting to look at chart number 2: The first 10 or so places have a very high count of found failures. These most likely come from repeated failures (tests that failed in the previous run and were caught in the next one). The next ones have a skew to the right, and these come from the file-change model.
I am glad of these new results : ). I have a couple new ideas to try to push the recall a bit further up, but I wanted to show the progress first. Also, I will do a thorough code review before any new changes, to make sure that the results are valid. Interestingly enough, in this new strategy the code is simpler. Also, I will run a test with a more long term period (20,000 training, 20,000 simulation), to see if the recall degrades as time passes and we miss more failures.
Regards! Pablo
On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hello everyone,
I took the last couple of days working on a new strategy to calculate the relevance of a test. The results are not sufficient by themselves, but I believe they point to an interesting direction. This strategy uses that rate of co-occurrence of events to estimate the relevance of a test, and the events that it uses are the following:
- File editions since last run - Test failure in last run
The strategy has also two stages:
1. Training stage 2. Executing stage
In the training stage, it goes through the available data, and does the following:
- If test A failed: - It counts and stores all the files that were edited since the last
test_run (the last test_run depends on BRANCH, PLATFORM, and other factors) - If test A failed also in the previous test run, it also counts that
In the executing stage, the training algorithm is still applied, but the decision of whether a test runs is based on its relevance, the relevance is calculated as the sum of the following:
- The percentage of times a test has failed in two subsequent
test_runs, multiplied by whether the test failed in the previous run (if the test didn't fail in the previous run, this quantity is 0) - For each file that was edited since the last test_run, the
percentage of times that the test has failed after this file was edited
(The explanation is a bit clumsy, I can clear it up if you wish so) The results have not been too bad, nor too good. With a running set of 200 tests, a training phase of 3000 test runs, and an executing stage of 2000 test runs, I have achieved recall of 0.50. It's not too great, nor too bad.
Nonetheless, while running tests, I found something interesting:
- I removed the first factor of the relevance. I decided to not care
about whether a test failed in the previous test run. I was only using the file-change factor. Naturally, the recall decreased, from 0.50 to 0.39 (the decrease was not too big)... and the distribution of failed tests in the priority queue had a good skew towards the front of the queue (so it seems that the files help somewhat, to indicate the likelihood of a failure). I attached this chart.
An interesting problem that I encountered was that about 50% of the test_runs don't have any file changes nor test failures, and so the relevance of all tests is zero. Here is where the original strategy (a weighted average of failures) could be useful, so that even if we don't have any information to guess which tests to run, we just go ahead and run the ones that have failed the most, recently.
I will work on mixing up both strategies a bit in the next few days, and see what comes of that.
By the way, I pushed the code to github. The code is completely different, so may be better to wait until I have new results soon.
Regards! Pablo
Hi Pablo, On 29.06.2014 6:25, Pablo Estrada wrote:
Hi Elena and all, I guess I should admit that my excitement was a bit too much; but also I'm definitely not 'jumping' into this strategy. As I said, I am trying to use the lessons learned from all the experiments to make the best predictions.
That being said, a strong point about the new strategy is that rather than purely use failure rate to predict failure rate, it uses more data to try to make predictions - and it experiences more consistency of prediction. On the 3k-training and 2k-predicting simulations its advantage is not so apparent (they fare similarly, with the 'standard' strategy being the best one), but it becomes more evident with longer predicting.
I ran tests with 20k-training rounds and 20k-prediction rounds, and the new strategy fared a lot better. I have attached charts with comparisons of both of them. We can observe that with a running set of 500, the original algorithm had a very nice almost 95% recall in shorter tests, but it falls to less than 50% with longer testing (And it must be a lot lower if we average the last couple of thousand runs, rathen the the 20k simulation runs together)
Okay, thanks, it looks much more convincing. Indeed, as we already discussed before, the problem with the previous strategy or implementation is that the recall deteriorates quickly after you stop using complete results as the learning material, and start taking into account only simulated results (which is what would be happening in real life). If the new method helps to solve this problem, it's worth looking into. You mentioned before that you pushed the new code, where is it located? I'd like to look at it before making any further conclusions. Regards, Elena
Since the goal of the project is to provide consistent long-term test optimization, we would want to take all we can learn from the new strategy - and, improve the consistency of the recall over long-term simulation.
Nevertheless, I agree that there are important lessons in the original strategy, particularly that >90% recall ion shorter prediction periods. That's why I'm still tuning and testing.
Again, all advice and observations are welcome. Hope everyone is having a nice weekend. Pablo
On Sun, Jun 29, 2014 at 12:53 AM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
Could you please explain why you are considering the new results being better? I don't see any obvious improvement.
As I understand from the defaults, previously you were running tests with 2000 training rounds and 3000 simulation rounds, and you've already had ~70% on 300 runs and ~80% on 500 runs, see your email of June 19, no_options_simulation.jpg.
Now you have switched the limits, you are running with 3000 training and 2000 simulation rounds. It makes a big difference, if you re-run tests with the old algorithm with the new limits, you'll get +10% easily, thus RS 300 will be around the same 80%, and RS 500 should be even higher, pushing 90%, while now you have barely 85%.
Before jumping onto the new algorithm, please provide the comparison of the old and new approach with equal pre-conditions and parameters.
Thanks, Elena
On 28.06.2014 6:44, Pablo Estrada wrote:
Hi all, well, as I said, I have incorporated a very simple weighted failure rate into the strategy, and I have found quite encouraging results. The recall looks better than earlier tests. I am attaching two charts with data compiled from runs with 3000 training rounds and 2000 simulation (5000 test runs analyzed in total):
- The recall by running set size (As shown, it reaches 80% with 300 tests) - The index of failure in the priority queue (running set: 500, training
3000, simulation 2000)
It is interesting to look at chart number 2: The first 10 or so places have a very high count of found failures. These most likely come from repeated failures (tests that failed in the previous run and were caught in the next one). The next ones have a skew to the right, and these come from the file-change model.
I am glad of these new results : ). I have a couple new ideas to try to push the recall a bit further up, but I wanted to show the progress first. Also, I will do a thorough code review before any new changes, to make sure that the results are valid. Interestingly enough, in this new strategy the code is simpler. Also, I will run a test with a more long term period (20,000 training, 20,000 simulation), to see if the recall degrades as time passes and we miss more failures.
Regards! Pablo
On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hello everyone,
I took the last couple of days working on a new strategy to calculate the relevance of a test. The results are not sufficient by themselves, but I believe they point to an interesting direction. This strategy uses that rate of co-occurrence of events to estimate the relevance of a test, and the events that it uses are the following:
- File editions since last run - Test failure in last run
The strategy has also two stages:
1. Training stage 2. Executing stage
In the training stage, it goes through the available data, and does the following:
- If test A failed: - It counts and stores all the files that were edited since the last
test_run (the last test_run depends on BRANCH, PLATFORM, and other factors) - If test A failed also in the previous test run, it also counts that
In the executing stage, the training algorithm is still applied, but the decision of whether a test runs is based on its relevance, the relevance is calculated as the sum of the following:
- The percentage of times a test has failed in two subsequent
test_runs, multiplied by whether the test failed in the previous run (if the test didn't fail in the previous run, this quantity is 0) - For each file that was edited since the last test_run, the
percentage of times that the test has failed after this file was edited
(The explanation is a bit clumsy, I can clear it up if you wish so) The results have not been too bad, nor too good. With a running set of 200 tests, a training phase of 3000 test runs, and an executing stage of 2000 test runs, I have achieved recall of 0.50. It's not too great, nor too bad.
Nonetheless, while running tests, I found something interesting:
- I removed the first factor of the relevance. I decided to not care
about whether a test failed in the previous test run. I was only using the file-change factor. Naturally, the recall decreased, from 0.50 to 0.39 (the decrease was not too big)... and the distribution of failed tests in the priority queue had a good skew towards the front of the queue (so it seems that the files help somewhat, to indicate the likelihood of a failure). I attached this chart.
An interesting problem that I encountered was that about 50% of the test_runs don't have any file changes nor test failures, and so the relevance of all tests is zero. Here is where the original strategy (a weighted average of failures) could be useful, so that even if we don't have any information to guess which tests to run, we just go ahead and run the ones that have failed the most, recently.
I will work on mixing up both strategies a bit in the next few days, and see what comes of that.
By the way, I pushed the code to github. The code is completely different, so may be better to wait until I have new results soon.
Regards! Pablo
Hi Elena, It's on a new branch in the same repository, you can see it here: https://github.com/pabloem/Kokiri/tree/file_correlation I changed the whole simulator.py file. I made sure to comment the header of every function, but don't have too many in-line comments. You can let me know if you need more clarifications. Regards Pablo On Sun, Jun 29, 2014 at 6:57 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
On 29.06.2014 6:25, Pablo Estrada wrote:
Hi Elena and all, I guess I should admit that my excitement was a bit too much; but also I'm definitely not 'jumping' into this strategy. As I said, I am trying to use the lessons learned from all the experiments to make the best predictions.
That being said, a strong point about the new strategy is that rather than purely use failure rate to predict failure rate, it uses more data to try to make predictions - and it experiences more consistency of prediction. On the 3k-training and 2k-predicting simulations its advantage is not so apparent (they fare similarly, with the 'standard' strategy being the best one), but it becomes more evident with longer predicting.
I ran tests with 20k-training rounds and 20k-prediction rounds, and the new strategy fared a lot better. I have attached charts with comparisons of both of them. We can observe that with a running set of 500, the original algorithm had a very nice almost 95% recall in shorter tests, but it falls to less than 50% with longer testing (And it must be a lot lower if we average the last couple of thousand runs, rathen the the 20k simulation runs together)
Okay, thanks, it looks much more convincing. Indeed, as we already discussed before, the problem with the previous strategy or implementation is that the recall deteriorates quickly after you stop using complete results as the learning material, and start taking into account only simulated results (which is what would be happening in real life). If the new method helps to solve this problem, it's worth looking into.
You mentioned before that you pushed the new code, where is it located? I'd like to look at it before making any further conclusions.
Regards, Elena
Since the goal of the project is to provide consistent long-term test optimization, we would want to take all we can learn from the new strategy - and, improve the consistency of the recall over long-term simulation.
Nevertheless, I agree that there are important lessons in the original strategy, particularly that >90% recall ion shorter prediction periods. That's why I'm still tuning and testing.
Again, all advice and observations are welcome. Hope everyone is having a nice weekend. Pablo
On Sun, Jun 29, 2014 at 12:53 AM, Elena Stepanova < elenst@montyprogram.com> wrote:
Hi Pablo,
Could you please explain why you are considering the new results being better? I don't see any obvious improvement.
As I understand from the defaults, previously you were running tests with 2000 training rounds and 3000 simulation rounds, and you've already had ~70% on 300 runs and ~80% on 500 runs, see your email of June 19, no_options_simulation.jpg.
Now you have switched the limits, you are running with 3000 training and 2000 simulation rounds. It makes a big difference, if you re-run tests with the old algorithm with the new limits, you'll get +10% easily, thus RS 300 will be around the same 80%, and RS 500 should be even higher, pushing 90%, while now you have barely 85%.
Before jumping onto the new algorithm, please provide the comparison of the old and new approach with equal pre-conditions and parameters.
Thanks, Elena
On 28.06.2014 6:44, Pablo Estrada wrote:
Hi all,
well, as I said, I have incorporated a very simple weighted failure rate into the strategy, and I have found quite encouraging results. The recall looks better than earlier tests. I am attaching two charts with data compiled from runs with 3000 training rounds and 2000 simulation (5000 test runs analyzed in total):
- The recall by running set size (As shown, it reaches 80% with 300 tests) - The index of failure in the priority queue (running set: 500, training
3000, simulation 2000)
It is interesting to look at chart number 2: The first 10 or so places have a very high count of found failures. These most likely come from repeated failures (tests that failed in the previous run and were caught in the next one). The next ones have a skew to the right, and these come from the file-change model.
I am glad of these new results : ). I have a couple new ideas to try to push the recall a bit further up, but I wanted to show the progress first. Also, I will do a thorough code review before any new changes, to make sure that the results are valid. Interestingly enough, in this new strategy the code is simpler. Also, I will run a test with a more long term period (20,000 training, 20,000 simulation), to see if the recall degrades as time passes and we miss more failures.
Regards! Pablo
On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hello everyone,
I took the last couple of days working on a new strategy to calculate the relevance of a test. The results are not sufficient by themselves, but I believe they point to an interesting direction. This strategy uses that rate of co-occurrence of events to estimate the relevance of a test, and the events that it uses are the following:
- File editions since last run - Test failure in last run
The strategy has also two stages:
1. Training stage 2. Executing stage
In the training stage, it goes through the available data, and does the following:
- If test A failed: - It counts and stores all the files that were edited since the last
test_run (the last test_run depends on BRANCH, PLATFORM, and other factors) - If test A failed also in the previous test run, it also counts that
In the executing stage, the training algorithm is still applied, but the decision of whether a test runs is based on its relevance, the relevance is calculated as the sum of the following:
- The percentage of times a test has failed in two subsequent
test_runs, multiplied by whether the test failed in the previous run (if the test didn't fail in the previous run, this quantity is 0) - For each file that was edited since the last test_run, the
percentage of times that the test has failed after this file was edited
(The explanation is a bit clumsy, I can clear it up if you wish so) The results have not been too bad, nor too good. With a running set of 200 tests, a training phase of 3000 test runs, and an executing stage of 2000 test runs, I have achieved recall of 0.50. It's not too great, nor too bad.
Nonetheless, while running tests, I found something interesting:
- I removed the first factor of the relevance. I decided to not care
about whether a test failed in the previous test run. I was only using the file-change factor. Naturally, the recall decreased, from 0.50 to 0.39 (the decrease was not too big)... and the distribution of failed tests in the priority queue had a good skew towards the front of the queue (so it seems that the files help somewhat, to indicate the likelihood of a failure). I attached this chart.
An interesting problem that I encountered was that about 50% of the test_runs don't have any file changes nor test failures, and so the relevance of all tests is zero. Here is where the original strategy (a weighted average of failures) could be useful, so that even if we don't have any information to guess which tests to run, we just go ahead and run the ones that have failed the most, recently.
I will work on mixing up both strategies a bit in the next few days, and see what comes of that.
By the way, I pushed the code to github. The code is completely different, so may be better to wait until I have new results soon.
Regards! Pablo
Hello everyone, I spent these days observing some of the missed failures, and trying to tune the strategy to improve recall. I have been trying to look at how the relevance of missed failures was not enough, and what would be a good tune up to the strategy to prevent these problems. In the current strategy, the relevance of a test depends mostly on one of the three factors: 1. A test failed in the previous run. These tests have a high relevance, usually around 1, and tend to be at the very front of the priority queue. These tests are the high bars on 1,2,3 and 4. 2. Files related to this test were changed. These tests have a rather low relevance, but tend to be at the beginning of the priority queue. Some of the missed failures come from here. We might be able to avoid missing these tests by being more precise when checking relationship between changed files and failed tests. 3. No files were changed, but the weighted failure rate is higher for this test. These tests tend to have low relevance. Some of the missed failures are here. They usually are tests that failed too long ago, and have become irrelevant (and since no files were changed, their relevance pales compared to tests that failed more recently). These tests are very hard to predict. Randomization can be helpful *in practice*, but with the data that we have now, randomization does not improve recall very much - just makes it vary a tiny bit, up and down. I can go further on why I think this would be a good measure in practice, but doesn't work for our data. So here is what I think can be done to try to improve recall: 1. Tests that failed in the previous run: These are quite fine. No need to improve this metric. 2. Become more precise when assessing which file changes correspond to which test_run. Right now, we take EVERY test change that happened between previous_run and next_run. That includes files that went to other branches. Instead of doing this, I plan to use fie_changes that are related to test_runs through the changes-sourcestamp-buildset-buildrequests-builds-test_run chain of relationships. I still have not analyzed this data, but I believe it should be workable. *I will work on this the next few days.* 3. Weighted failure rate and randomization are an interesting option, but I believe this would be more useful in practice; and so it would require an extra phase in the project, and time is limited. (We would require a period of comparing predictions with results in buildbot). I am definitely willing to consider working on this, but I guess now we should focus on the Aug 16 deadline. Again, if anyone sees any 'area of opportunity', or has any advice, it's all welcome. Regards Pablo On Sun, Jun 29, 2014 at 7:25 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena, It's on a new branch in the same repository, you can see it here: https://github.com/pabloem/Kokiri/tree/file_correlation
I changed the whole simulator.py file. I made sure to comment the header of every function, but don't have too many in-line comments. You can let me know if you need more clarifications.
Regards Pablo
On Sun, Jun 29, 2014 at 6:57 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
On 29.06.2014 6:25, Pablo Estrada wrote:
Hi Elena and all, I guess I should admit that my excitement was a bit too much; but also I'm definitely not 'jumping' into this strategy. As I said, I am trying to use the lessons learned from all the experiments to make the best predictions.
That being said, a strong point about the new strategy is that rather than purely use failure rate to predict failure rate, it uses more data to try to make predictions - and it experiences more consistency of prediction. On the 3k-training and 2k-predicting simulations its advantage is not so apparent (they fare similarly, with the 'standard' strategy being the best one), but it becomes more evident with longer predicting.
I ran tests with 20k-training rounds and 20k-prediction rounds, and the new strategy fared a lot better. I have attached charts with comparisons of both of them. We can observe that with a running set of 500, the original algorithm had a very nice almost 95% recall in shorter tests, but it falls to less than 50% with longer testing (And it must be a lot lower if we average the last couple of thousand runs, rathen the the 20k simulation runs together)
Okay, thanks, it looks much more convincing. Indeed, as we already discussed before, the problem with the previous strategy or implementation is that the recall deteriorates quickly after you stop using complete results as the learning material, and start taking into account only simulated results (which is what would be happening in real life). If the new method helps to solve this problem, it's worth looking into.
You mentioned before that you pushed the new code, where is it located? I'd like to look at it before making any further conclusions.
Regards, Elena
Since the goal of the project is to provide consistent long-term test optimization, we would want to take all we can learn from the new strategy - and, improve the consistency of the recall over long-term simulation.
Nevertheless, I agree that there are important lessons in the original strategy, particularly that >90% recall ion shorter prediction periods. That's why I'm still tuning and testing.
Again, all advice and observations are welcome. Hope everyone is having a nice weekend. Pablo
On Sun, Jun 29, 2014 at 12:53 AM, Elena Stepanova < elenst@montyprogram.com> wrote:
Hi Pablo,
Could you please explain why you are considering the new results being better? I don't see any obvious improvement.
As I understand from the defaults, previously you were running tests with 2000 training rounds and 3000 simulation rounds, and you've already had ~70% on 300 runs and ~80% on 500 runs, see your email of June 19, no_options_simulation.jpg.
Now you have switched the limits, you are running with 3000 training and 2000 simulation rounds. It makes a big difference, if you re-run tests with the old algorithm with the new limits, you'll get +10% easily, thus RS 300 will be around the same 80%, and RS 500 should be even higher, pushing 90%, while now you have barely 85%.
Before jumping onto the new algorithm, please provide the comparison of the old and new approach with equal pre-conditions and parameters.
Thanks, Elena
On 28.06.2014 6:44, Pablo Estrada wrote:
Hi all,
well, as I said, I have incorporated a very simple weighted failure rate into the strategy, and I have found quite encouraging results. The recall looks better than earlier tests. I am attaching two charts with data compiled from runs with 3000 training rounds and 2000 simulation (5000 test runs analyzed in total):
- The recall by running set size (As shown, it reaches 80% with 300 tests) - The index of failure in the priority queue (running set: 500, training
3000, simulation 2000)
It is interesting to look at chart number 2: The first 10 or so places have a very high count of found failures. These most likely come from repeated failures (tests that failed in the previous run and were caught in the next one). The next ones have a skew to the right, and these come from the file-change model.
I am glad of these new results : ). I have a couple new ideas to try to push the recall a bit further up, but I wanted to show the progress first. Also, I will do a thorough code review before any new changes, to make sure that the results are valid. Interestingly enough, in this new strategy the code is simpler. Also, I will run a test with a more long term period (20,000 training, 20,000 simulation), to see if the recall degrades as time passes and we miss more failures.
Regards! Pablo
On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hello everyone,
I took the last couple of days working on a new strategy to calculate the relevance of a test. The results are not sufficient by themselves, but I believe they point to an interesting direction. This strategy uses that rate of co-occurrence of events to estimate the relevance of a test, and the events that it uses are the following:
- File editions since last run - Test failure in last run
The strategy has also two stages:
1. Training stage 2. Executing stage
In the training stage, it goes through the available data, and does the following:
- If test A failed: - It counts and stores all the files that were edited since the last
test_run (the last test_run depends on BRANCH, PLATFORM, and other factors) - If test A failed also in the previous test run, it also counts that
In the executing stage, the training algorithm is still applied, but the decision of whether a test runs is based on its relevance, the relevance is calculated as the sum of the following:
- The percentage of times a test has failed in two subsequent
test_runs, multiplied by whether the test failed in the previous run (if the test didn't fail in the previous run, this quantity is 0) - For each file that was edited since the last test_run, the
percentage of times that the test has failed after this file was edited
(The explanation is a bit clumsy, I can clear it up if you wish so) The results have not been too bad, nor too good. With a running set of 200 tests, a training phase of 3000 test runs, and an executing stage of 2000 test runs, I have achieved recall of 0.50. It's not too great, nor too bad.
Nonetheless, while running tests, I found something interesting:
- I removed the first factor of the relevance. I decided to not care
about whether a test failed in the previous test run. I was only using the file-change factor. Naturally, the recall decreased, from 0.50 to 0.39 (the decrease was not too big)... and the distribution of failed tests in the priority queue had a good skew towards the front of the queue (so it seems that the files help somewhat, to indicate the likelihood of a failure). I attached this chart.
An interesting problem that I encountered was that about 50% of the test_runs don't have any file changes nor test failures, and so the relevance of all tests is zero. Here is where the original strategy (a weighted average of failures) could be useful, so that even if we don't have any information to guess which tests to run, we just go ahead and run the ones that have failed the most, recently.
I will work on mixing up both strategies a bit in the next few days, and see what comes of that.
By the way, I pushed the code to github. The code is completely different, so may be better to wait until I have new results soon.
Regards! Pablo
Hi Pablo, Can you give a raw estimation of a ratio of failures missed due to being low in the priority queue vs those that were not in the queue at all? If you can't, don't waste time on getting this information, but since you have already analyzed the data, I thought you might have an answer right away. Also, once again, I would like you to start using an incoming test list as an initial point of your test set generation. It must be done sooner or later, I already explained earlier why; and while it's not difficult to implement even after the end of your project, it might affect the result considerably, so we need to know if it makes it better or worse, and adjust the algorithm accordingly. You don't need to create an actual interface with MTR, any form of the test list will do, as soon as it's the correct list. Also, there is another dimension in the test suite which I think you didn't take into account before, and which might be useful, especially if you run out of ideas. There are several options which change the way the tests are executed; e.g. tests can be run in a "normal" mode, or in PS protocol mode, or with valgrind, or with embedded server. And it might well be that some tests always fail e.g. with valgrind, but almost never fail otherwise. Information about these options is partially available in test_run.info, but it would require some parsing. It would be perfect if you could analyze the existing data to understand whether using it can affect your results before spending time on actual code changes. A couple more comments inline. On 14.07.2014 19:37, Pablo Estrada wrote:
Hello everyone, I spent these days observing some of the missed failures, and trying to tune the strategy to improve recall. I have been trying to look at how the relevance of missed failures was not enough, and what would be a good tune up to the strategy to prevent these problems. In the current strategy, the relevance of a test depends mostly on one of the three factors:
1. A test failed in the previous run. These tests have a high relevance, usually around 1, and tend to be at the very front of the priority queue. These tests are the high bars on 1,2,3 and 4.
This is expected. It happens because usually it takes some time for a developer to notice a new failure, to fix it and push the fix; in the meantime, several more runs are executed. These failures are indeed easy to predict. Unfortunately, they are not the most interesting ones, but as we agreed, we won't categorize failures as "more important" and "less important". In any case, it's good to have them caught.
2. Files related to this test were changed. These tests have a rather low relevance, but tend to be at the beginning of the priority queue. Some of the missed failures come from here. We might be able to avoid missing these tests by being more precise when checking relationship between changed files and failed tests.
Are you talking about test files (.test and .result, and maybe .inc), or any code changes? There might be a difference based on real-life practices. If a test/result file gets changed, it's almost certain that the person who changed it actually ran the test and made sure it passes. It can still happen that it fails on a different platform, but it's rather rare. So, using this information is not expected to provide big short-term gain. However, it is still very important to take it into account, because tests that get changed are the most "modern" ones, and hence should be run. Luckily, it's easy enough to do. When we are trying to watch all code changes and find correlation with test failures, if it's done well, it should actually provide immediate gain; however, it's very difficult to do it right, there is way too much noise in the statistical data to get a reliable picture. So, while it will be nice if you get it work (since you already started doing it), don't take it as a defeat if you eventually find out that it doesn't work very well.
3. No files were changed, but the weighted failure rate is higher for this test. These tests tend to have low relevance. Some of the missed failures are here. They usually are tests that failed too long ago, and have become irrelevant (and since no files were changed, their relevance pales compared to tests that failed more recently). These tests are very hard to predict. Randomization can be helpful *in practice*, but with the data that we have now, randomization does not improve recall very much - just makes it vary a tiny bit, up and down. I can go further on why I think this would be a good measure in practice, but doesn't work for our data.
If you are using your former "standard" method (where no branch/platform information is taken into account), and if you only randomize the empty tail of the queue, then randomization should make no difference at all: you'll have more failures in the queue than the target size of the test set, so there is simply no room for randomization. Or, maybe you are talking about some other kind of randomization, I'll look at the code (please make sure it's up-to-date in git). Regards, /E
So here is what I think can be done to try to improve recall:
1. Tests that failed in the previous run: These are quite fine. No need to improve this metric. 2. Become more precise when assessing which file changes correspond to which test_run. Right now, we take EVERY test change that happened between previous_run and next_run. That includes files that went to other branches. Instead of doing this, I plan to use fie_changes that are related to test_runs through the changes-sourcestamp-buildset-buildrequests-builds-test_run chain of relationships. I still have not analyzed this data, but I believe it should be workable. *I will work on this the next few days.* 3. Weighted failure rate and randomization are an interesting option, but I believe this would be more useful in practice; and so it would require an extra phase in the project, and time is limited. (We would require a period of comparing predictions with results in buildbot). I am definitely willing to consider working on this, but I guess now we should focus on the Aug 16 deadline.
Again, if anyone sees any 'area of opportunity', or has any advice, it's all welcome. Regards Pablo
On Sun, Jun 29, 2014 at 7:25 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena, It's on a new branch in the same repository, you can see it here: https://github.com/pabloem/Kokiri/tree/file_correlation
I changed the whole simulator.py file. I made sure to comment the header of every function, but don't have too many in-line comments. You can let me know if you need more clarifications.
Regards Pablo
On Sun, Jun 29, 2014 at 6:57 PM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,
On 29.06.2014 6:25, Pablo Estrada wrote:
Hi Elena and all, I guess I should admit that my excitement was a bit too much; but also I'm definitely not 'jumping' into this strategy. As I said, I am trying to use the lessons learned from all the experiments to make the best predictions.
That being said, a strong point about the new strategy is that rather than purely use failure rate to predict failure rate, it uses more data to try to make predictions - and it experiences more consistency of prediction. On the 3k-training and 2k-predicting simulations its advantage is not so apparent (they fare similarly, with the 'standard' strategy being the best one), but it becomes more evident with longer predicting.
I ran tests with 20k-training rounds and 20k-prediction rounds, and the new strategy fared a lot better. I have attached charts with comparisons of both of them. We can observe that with a running set of 500, the original algorithm had a very nice almost 95% recall in shorter tests, but it falls to less than 50% with longer testing (And it must be a lot lower if we average the last couple of thousand runs, rathen the the 20k simulation runs together)
Okay, thanks, it looks much more convincing. Indeed, as we already discussed before, the problem with the previous strategy or implementation is that the recall deteriorates quickly after you stop using complete results as the learning material, and start taking into account only simulated results (which is what would be happening in real life). If the new method helps to solve this problem, it's worth looking into.
You mentioned before that you pushed the new code, where is it located? I'd like to look at it before making any further conclusions.
Regards, Elena
Since the goal of the project is to provide consistent long-term test optimization, we would want to take all we can learn from the new strategy - and, improve the consistency of the recall over long-term simulation.
Nevertheless, I agree that there are important lessons in the original strategy, particularly that >90% recall ion shorter prediction periods. That's why I'm still tuning and testing.
Again, all advice and observations are welcome. Hope everyone is having a nice weekend. Pablo
On Sun, Jun 29, 2014 at 12:53 AM, Elena Stepanova < elenst@montyprogram.com> wrote:
Hi Pablo,
Could you please explain why you are considering the new results being better? I don't see any obvious improvement.
As I understand from the defaults, previously you were running tests with 2000 training rounds and 3000 simulation rounds, and you've already had ~70% on 300 runs and ~80% on 500 runs, see your email of June 19, no_options_simulation.jpg.
Now you have switched the limits, you are running with 3000 training and 2000 simulation rounds. It makes a big difference, if you re-run tests with the old algorithm with the new limits, you'll get +10% easily, thus RS 300 will be around the same 80%, and RS 500 should be even higher, pushing 90%, while now you have barely 85%.
Before jumping onto the new algorithm, please provide the comparison of the old and new approach with equal pre-conditions and parameters.
Thanks, Elena
On 28.06.2014 6:44, Pablo Estrada wrote:
Hi all,
well, as I said, I have incorporated a very simple weighted failure rate into the strategy, and I have found quite encouraging results. The recall looks better than earlier tests. I am attaching two charts with data compiled from runs with 3000 training rounds and 2000 simulation (5000 test runs analyzed in total):
- The recall by running set size (As shown, it reaches 80% with 300 tests) - The index of failure in the priority queue (running set: 500, training
3000, simulation 2000)
It is interesting to look at chart number 2: The first 10 or so places have a very high count of found failures. These most likely come from repeated failures (tests that failed in the previous run and were caught in the next one). The next ones have a skew to the right, and these come from the file-change model.
I am glad of these new results : ). I have a couple new ideas to try to push the recall a bit further up, but I wanted to show the progress first. Also, I will do a thorough code review before any new changes, to make sure that the results are valid. Interestingly enough, in this new strategy the code is simpler. Also, I will run a test with a more long term period (20,000 training, 20,000 simulation), to see if the recall degrades as time passes and we miss more failures.
Regards! Pablo
On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hello everyone,
> I took the last couple of days working on a new strategy to calculate > the > relevance of a test. The results are not sufficient by themselves, > but I > believe they point to an interesting direction. This strategy uses > that > rate of co-occurrence of events to estimate the relevance of a test, > and > the events that it uses are the following: > > - File editions since last run > - Test failure in last run > > > The strategy has also two stages: > > 1. Training stage > 2. Executing stage > > > In the training stage, it goes through the available data, and does > the > following: > > - If test A failed: > - It counts and stores all the files that were edited since the > last > > test_run (the last test_run depends on BRANCH, PLATFORM, and > other > factors) > - If test A failed also in the previous test run, it also counts > that > > > In the executing stage, the training algorithm is still applied, but > the > decision of whether a test runs is based on its relevance, the > relevance > is > calculated as the sum of the following: > > - The percentage of times a test has failed in two subsequent > > test_runs, multiplied by whether the test failed in the previous > run > (if > the test didn't fail in the previous run, this quantity is 0) > - For each file that was edited since the last test_run, the > > percentage of times that the test has failed after this file was > edited > > (The explanation is a bit clumsy, I can clear it up if you wish so) > The results have not been too bad, nor too good. With a running set of > 200 > tests, a training phase of 3000 test runs, and an executing stage of > 2000 > test runs, I have achieved recall of 0.50. It's not too great, nor too > bad. > > Nonetheless, while running tests, I found something interesting: > > - I removed the first factor of the relevance. I decided to not > care > > about whether a test failed in the previous test run. I was only > using the > file-change factor. Naturally, the recall decreased, from 0.50 to > 0.39 (the > decrease was not too big)... and the distribution of failed > tests in > the > priority queue had a good skew towards the front of the queue > (so it > seems > that the files help somewhat, to indicate the likelihood of a > failure). I > attached this chart. > > An interesting problem that I encountered was that about 50% of the > test_runs don't have any file changes nor test failures, and so the > relevance of all tests is zero. Here is where the original strategy (a > weighted average of failures) could be useful, so that even if we > don't > have any information to guess which tests to run, we just go ahead and > run > the ones that have failed the most, recently. > > I will work on mixing up both strategies a bit in the next few days, > and > see what comes of that. > > By the way, I pushed the code to github. The code is completely > different, > so may be better to wait until I have new results soon. > > Regards! > Pablo > > >
Hello Elena, Can you give a raw estimation of a ratio of failures missed due to being
low in the priority queue vs those that were not in the queue at all?
an initial point of your test set generation. It must be done sooner or later, I already explained earlier why; and while it's not difficult to implement even after the end of your project, it might affect the result considerably, so we need to know if it makes it better or worse, and adjust the algorithm accordingly.
You are right. I understand that this information is not fully available for all the test_runs, so can you upload the information going back as much as possible? I can parse these files and adjust the program to work with
I sent this information in a previous email, here: https://lists.launchpad.net/maria-developers/msg07482.html Also, once again, I would like you to start using an incoming test list as this. I will get on to work with this, I think this should significantly improve results. I think, it might even push my current strategy from promising results into attractive ones.
There are several options which change the way the tests are executed; e.g. tests can be run in a "normal" mode, or in PS protocol mode, or with valgrind, or with embedded server. And it might well be that some tests always fail e.g. with valgrind, but almost never fail otherwise. Information about these options is partially available in test_run.info, but it would require some parsing. It would be perfect if you could analyze the existing data to understand whether using it can affect your results before spending time on actual code changes.
I will keep this in consideration, but for now I will focus on these two main things: - Improving precision of selecting code changes to estimate correlation with test failures - Adding the use of an incoming test list
When we are trying to watch all code changes and find correlation with test failures, if it's done well, it should actually provide immediate gain; however, it's very difficult to do it right, there is way too much noise in the statistical data to get a reliable picture. So, while it will be nice if you get it work (since you already started doing it), don't take it as a defeat if you eventually find out that it doesn't work very well.
Well, actually, this is the only big difference between the original strategy using just a weighted average of failures; and the new strategy, which performs *significantly better* in longer testing settings. It has been working for a few weeks, and is up on github. Either way, as I said before, I will, from today, focus on improving precision of selecting code changes to estimate correlation with test failures. Regards Pablo
Hi Elena, A small progress report: I was able to quickly make the changes related to selecting code changes to measure correlations with test failures. Recall is still around 80% with running set of 300 and short prediction stages. I can focus now on the input file list, since I believe this will make results more realistic, and (I expect) help push recall a further up. Can you please upload the report files from MTR, so that I can include the logic of an input test list? Also, since I am going to incorporate this logic, it might be good to define (even if just roughly) the "core module" and the "wrapper module" that you had mentioned earlier, rather than just incorporating the list, and making the code that I have now even more bloated with mixed up functionalities. What do you think? Regards Pablo On Tue, Jul 15, 2014 at 2:18 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hello Elena,
Can you give a raw estimation of a ratio of failures missed due to being
low in the priority queue vs those that were not in the queue at all?
I sent this information in a previous email, here: https://lists.launchpad.net/maria-developers/msg07482.html
an initial point of your test set generation. It must be done sooner or later, I already explained earlier why; and while it's not difficult to implement even after the end of your project, it might affect the result considerably, so we need to know if it makes it better or worse, and adjust the algorithm accordingly.
You are right. I understand that this information is not fully available for all the test_runs, so can you upload the information going back as much as possible? I can parse these files and adjust the program to work with
Also, once again, I would like you to start using an incoming test list as this. I will get on to work with this, I think this should significantly improve results. I think, it might even push my current strategy from promising results into attractive ones.
There are several options which change the way the tests are executed; e.g. tests can be run in a "normal" mode, or in PS protocol mode, or with valgrind, or with embedded server. And it might well be that some tests always fail e.g. with valgrind, but almost never fail otherwise. Information about these options is partially available in test_run.info, but it would require some parsing. It would be perfect if you could analyze the existing data to understand whether using it can affect your results before spending time on actual code changes.
I will keep this in consideration, but for now I will focus on these two main things:
- Improving precision of selecting code changes to estimate correlation with test failures - Adding the use of an incoming test list
When we are trying to watch all code changes and find correlation with test failures, if it's done well, it should actually provide immediate gain; however, it's very difficult to do it right, there is way too much noise in the statistical data to get a reliable picture. So, while it will be nice if you get it work (since you already started doing it), don't take it as a defeat if you eventually find out that it doesn't work very well.
Well, actually, this is the only big difference between the original strategy using just a weighted average of failures; and the new strategy, which performs *significantly better* in longer testing settings. It has been working for a few weeks, and is up on github.
Either way, as I said before, I will, from today, focus on improving precision of selecting code changes to estimate correlation with test failures.
Regards Pablo
Hello Elena, It took me a while to figure out how the files and the test_run s correspond to each other, and there might still be some hard-to-solve inconsistencies with them: There were a few cases where it is not easy to determine -automatically- which file corresponds to which test_run (some cases where there are more platform+build test_runs than files)... but excluding those cases, yes, there are about 28k files that can be matched to test_runs appropriately. The distribution of these is quite random. They start matching around test_run #10,000 and then, they go on matching sometimes and sometimes not. What I'm doing, is the following: 1. If there is a file that matches this test_run: Parse the file, and return the tests in the file as the input list. I am not considering 'skipped' tests, because it seems that they are skipped because they can't be run. 2. If there is no file matching test_run: Consider ALL known tests as being in the input list. I would like to get some of your feedback on a couple of things: - I would still like to define some structure for the interfaces -even if a bit loose. - You mentioned earlier that rather than a specific running_set, you wanted to use a percentage. We can work like this. - Do you have any feedback on points 1 and 2 regarding the handling of the input test lists? And one more thing: - I have not incorporated test variant into the data, but I'll spend some time thinking about how to do this. That's it for now. Thanks Pablo On Wed, Jul 16, 2014 at 1:10 AM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena, A small progress report: I was able to quickly make the changes related to selecting code changes to measure correlations with test failures. Recall is still around 80% with running set of 300 and short prediction stages. I can focus now on the input file list, since I believe this will make results more realistic, and (I expect) help push recall a further up.
Can you please upload the report files from MTR, so that I can include the logic of an input test list?
Also, since I am going to incorporate this logic, it might be good to define (even if just roughly) the "core module" and the "wrapper module" that you had mentioned earlier, rather than just incorporating the list, and making the code that I have now even more bloated with mixed up functionalities. What do you think?
Regards Pablo
On Tue, Jul 15, 2014 at 2:18 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hello Elena,
Can you give a raw estimation of a ratio of failures missed due to being
low in the priority queue vs those that were not in the queue at all?
I sent this information in a previous email, here: https://lists.launchpad.net/maria-developers/msg07482.html
as an initial point of your test set generation. It must be done sooner or later, I already explained earlier why; and while it's not difficult to implement even after the end of your project, it might affect the result considerably, so we need to know if it makes it better or worse, and adjust the algorithm accordingly.
You are right. I understand that this information is not fully available for all the test_runs, so can you upload the information going back as much as possible? I can parse these files and adjust the program to work with
Also, once again, I would like you to start using an incoming test list this. I will get on to work with this, I think this should significantly improve results. I think, it might even push my current strategy from promising results into attractive ones.
There are several options which change the way the tests are executed; e.g. tests can be run in a "normal" mode, or in PS protocol mode, or with valgrind, or with embedded server. And it might well be that some tests always fail e.g. with valgrind, but almost never fail otherwise. Information about these options is partially available in test_run.info, but it would require some parsing. It would be perfect if you could analyze the existing data to understand whether using it can affect your results before spending time on actual code changes.
I will keep this in consideration, but for now I will focus on these two main things:
- Improving precision of selecting code changes to estimate correlation with test failures - Adding the use of an incoming test list
When we are trying to watch all code changes and find correlation with test failures, if it's done well, it should actually provide immediate gain; however, it's very difficult to do it right, there is way too much noise in the statistical data to get a reliable picture. So, while it will be nice if you get it work (since you already started doing it), don't take it as a defeat if you eventually find out that it doesn't work very well.
Well, actually, this is the only big difference between the original strategy using just a weighted average of failures; and the new strategy, which performs *significantly better* in longer testing settings. It has been working for a few weeks, and is up on github.
Either way, as I said before, I will, from today, focus on improving precision of selecting code changes to estimate correlation with test failures.
Regards Pablo
Hi Pablo, On 17.07.2014 16:17, Pablo Estrada wrote:
Hello Elena, It took me a while to figure out how the files and the test_run s correspond to each other, and there might still be some hard-to-solve inconsistencies with them: There were a few cases where it is not easy to determine -automatically- which file corresponds to which test_run (some cases where there are more platform+build test_runs than files)... but excluding those cases, yes, there are about 28k files that can be matched to test_runs appropriately.
As I said in the private email, the files are determined by a pair of platform / bbnum. It will definitely happen that there are test runs in the dump which don't have corresponding output files, and that there are files which don't have corresponding records in the data dump (because the files are fresher than the dump). Both of that are expected. Looking at the dump, I see it can also happen that the dump contains several records for a pair platform/bbnum. I am not sure why it happens, I think it shouldn't, might be a bug in buildbot and/or configuration, or environmental problems. Anyway, due to the way we store output files, they can well override each other in this case, thus for several platform/bbnum record you will have only one file. I suppose that's what was hard to resolve, sorry about that. Anyway, if you got 28K files, it should be more than enough for experiments, since you are normally running them on ~5K test runs.
The distribution of these is quite random. They start matching around test_run #10,000 and then, they go on matching sometimes and sometimes not.
What I'm doing, is the following:
1. If there is a file that matches this test_run: Parse the file, and return the tests in the file as the input list. I am not considering 'skipped' tests, because it seems that they are skipped because they can't be run.
You should consider skipped tests, at least for now. Your logic that they are skipped because they can't be run is generally correct; unfortunately, MTR first produces the *full* list of tests to run, and determines whether a test can be run or not on a later stage, when it starts running the tests. Your tool will receive the initial test list, and I'm not sure it's realistic to re-write MTR so that it takes into account limitations that cause skipping tests before creating the list.
2. If there is no file matching test_run: Consider ALL known tests as being in the input list.
I need to think about it. Possibly it's better to skip a test run altogether if there is no input list for it; it would be definitely the best if there were 5K (or whatever slice you are currently using) of continuous test runs with input lists; if it so happens that there are lists for some branches but not others, you can skip the branch entirely.
I would like to get some of your feedback on a couple of things:
- I would still like to define some structure for the interfaces -even if a bit loose.
If you mean separation the core module / wrapper, it should go like that. The core module should take as parameters - list of tests to choose from, - size of the running set (%), - branch/platform (if we use them in the end), and produce a new list of tests of the size of the running set. The wrapper module should - read the list of tests from the outside world (for now, from a file), - receive branch/platform as command-line options, - have the running set size set as an easily changeable constant or as a configuration parameter, and return the list of tests -- lets say for now, in the form of <test suite>.<test name>, blank-separated, e.g. main.select innondb.create-index ...
- You mentioned earlier that rather than a specific running_set, you wanted to use a percentage. We can work like this.
Yes, we should. Now if you look at those input files, you can see that the number of running tests is considerably different. Grep the files for 'Completed: All ' (this will exclude unsuccessful runs where test execution just stopped due to whatever reason), and you'll see that there are runs with 3.5K tests, and with 1.5K tests, and with 150 tests... So any constant running_set size you choose will be meaningless for one run or another. Of course, we can go smart with the percentage and lets say not apply it to the smallest test runs (or make it flexible), but for now just a single percentage will do.
- Do you have any feedback on points 1 and 2 regarding the handling of the input test lists?
And one more thing:
- I have not incorporated test variant into the data, but I'll spend some time thinking about how to do this.
It can be difficult, so it would be better to analyze the data first and see if it makes any (useful) difference. Regards, Elena
That's it for now. Thanks
Pablo
On Wed, Jul 16, 2014 at 1:10 AM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hi Elena, A small progress report: I was able to quickly make the changes related to selecting code changes to measure correlations with test failures. Recall is still around 80% with running set of 300 and short prediction stages. I can focus now on the input file list, since I believe this will make results more realistic, and (I expect) help push recall a further up.
Can you please upload the report files from MTR, so that I can include the logic of an input test list?
Also, since I am going to incorporate this logic, it might be good to define (even if just roughly) the "core module" and the "wrapper module" that you had mentioned earlier, rather than just incorporating the list, and making the code that I have now even more bloated with mixed up functionalities. What do you think?
Regards Pablo
On Tue, Jul 15, 2014 at 2:18 PM, Pablo Estrada <polecito.em@gmail.com> wrote:
Hello Elena,
Can you give a raw estimation of a ratio of failures missed due to being
low in the priority queue vs those that were not in the queue at all?
I sent this information in a previous email, here: https://lists.launchpad.net/maria-developers/msg07482.html
as an initial point of your test set generation. It must be done sooner or later, I already explained earlier why; and while it's not difficult to implement even after the end of your project, it might affect the result considerably, so we need to know if it makes it better or worse, and adjust the algorithm accordingly.
You are right. I understand that this information is not fully available for all the test_runs, so can you upload the information going back as much as possible? I can parse these files and adjust the program to work with
Also, once again, I would like you to start using an incoming test list this. I will get on to work with this, I think this should significantly improve results. I think, it might even push my current strategy from promising results into attractive ones.
There are several options which change the way the tests are executed; e.g. tests can be run in a "normal" mode, or in PS protocol mode, or with valgrind, or with embedded server. And it might well be that some tests always fail e.g. with valgrind, but almost never fail otherwise. Information about these options is partially available in test_run.info, but it would require some parsing. It would be perfect if you could analyze the existing data to understand whether using it can affect your results before spending time on actual code changes.
I will keep this in consideration, but for now I will focus on these two main things:
- Improving precision of selecting code changes to estimate correlation with test failures - Adding the use of an incoming test list
When we are trying to watch all code changes and find correlation with test failures, if it's done well, it should actually provide immediate gain; however, it's very difficult to do it right, there is way too much noise in the statistical data to get a reliable picture. So, while it will be nice if you get it work (since you already started doing it), don't take it as a defeat if you eventually find out that it doesn't work very well.
Well, actually, this is the only big difference between the original strategy using just a weighted average of failures; and the new strategy, which performs *significantly better* in longer testing settings. It has been working for a few weeks, and is up on github.
Either way, as I said before, I will, from today, focus on improving precision of selecting code changes to estimate correlation with test failures.
Regards Pablo
participants (2)
-
Elena Stepanova
-
Pablo Estrada