Hi Elena and all,
I guess I should admit that my excitement was a bit too much; but also I'm definitely not 'jumping' into this strategy. As I said, I am trying to use the lessons learned from all the experiments to make the best predictions.

That being said, a strong point about the new strategy is that rather than purely use failure rate to predict failure rate, it uses more data to try to make predictions - and it experiences more consistency of prediction. On the 3k-training and 2k-predicting simulations its advantage is not so apparent (they fare similarly, with the 'standard' strategy being the best one), but it becomes more evident with longer predicting.

I ran tests with 20k-training rounds and 20k-prediction rounds, and the new strategy fared a lot better. I have attached charts with comparisons of both of them. We can observe that with a running set of 500, the original algorithm had a very nice almost 95% recall in shorter tests, but it falls to less than 50% with longer testing (And it must be a lot lower if we average the last couple of thousand runs, rathen the the 20k simulation runs together)

Since the goal of the project is to provide consistent long-term test optimization, we would want to take all we can learn from the new strategy - and, improve the consistency of the recall over long-term simulation.

Nevertheless, I agree that there are important lessons in the original strategy, particularly that >90% recall ion shorter prediction periods. That's why I'm still tuning and testing.

Again, all advice and observations are welcome.
Hope everyone is having a nice weekend.
Pablo


On Sun, Jun 29, 2014 at 12:53 AM, Elena Stepanova <elenst@montyprogram.com> wrote:
Hi Pablo,

Could you please explain why you are considering the new results being better? I don't see any obvious improvement.

As I understand from the defaults, previously you were running tests with 2000 training rounds and 3000 simulation rounds, and you've already had ~70% on 300 runs and ~80% on 500 runs, see your email of June 19, no_options_simulation.jpg.

Now you have switched the limits, you are running with 3000 training and 2000 simulation rounds. It makes a big difference, if you re-run tests with the old algorithm with the new limits, you'll get +10% easily, thus RS 300 will be around the same 80%, and RS 500 should be even higher, pushing 90%, while now you have barely 85%.

Before jumping onto the new algorithm, please provide the comparison of the old and new approach with equal pre-conditions and parameters.

Thanks,
Elena



On 28.06.2014 6:44, Pablo Estrada wrote:
Hi all,
well, as I said, I have incorporated a very simple weighted failure rate
into the strategy, and I have found quite encouraging results. The recall
looks better than earlier tests. I am attaching two charts with data
compiled from runs with 3000 training rounds and 2000 simulation (5000 test
runs analyzed in total):

    - The recall by running set size (As shown, it reaches 80% with 300
    tests)
    - The index of failure in the priority queue (running set: 500, training

    3000, simulation 2000)

It is interesting to look at chart number 2:
The first 10 or so places have a very high count of found failures. These
most likely come from repeated failures (tests that failed in the previous
run and were caught in the next one). The next ones have a skew to the
right, and these come from the file-change model.

I am glad of these new results : ). I have a couple new ideas to try to
push the recall a bit further up, but I wanted to show the progress first.
Also, I will do a thorough code review before any new changes, to make sure
that the results are valid. Interestingly enough, in this new strategy the
code is simpler.
Also, I will run a test with a more long term period (20,000 training,
20,000 simulation), to see if the recall degrades as time passes and we
miss more failures.

Regards!
Pablo


On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito.em@gmail.com>
wrote:

Hello everyone,
I took the last couple of days working on a new strategy to calculate the
relevance of a test. The results are not sufficient by themselves, but I
believe they point to an interesting direction. This strategy uses that
rate of co-occurrence of events to estimate the relevance of a test, and
the events that it uses are the following:

    - File editions since last run
    - Test failure in last run


The strategy has also two stages:

    1. Training stage
    2. Executing stage


In the training stage, it goes through the available data, and does the
following:

    - If test A failed:
    - It counts and stores all the files that were edited since the last

    test_run (the last test_run depends on BRANCH, PLATFORM, and other factors)
    - If test A failed also in the previous test run, it also counts that


In the executing stage, the training algorithm is still applied, but the
decision of whether a test runs is based on its relevance, the relevance is
calculated as the sum of the following:

    - The percentage of times a test has failed in two subsequent

    test_runs, multiplied by whether the test failed in the previous run (if
    the test didn't fail in the previous run, this quantity is 0)
    - For each file that was edited since the last test_run, the

    percentage of times that the test has failed after this file was edited

(The explanation is a bit clumsy, I can clear it up if you wish so)
The results have not been too bad, nor too good. With a running set of 200
tests, a training phase of 3000 test runs, and an executing stage of 2000
test runs, I have achieved recall of 0.50. It's not too great, nor too bad.

Nonetheless, while running tests, I found something interesting:

    - I removed the first factor of the relevance. I decided to not care

    about whether a test failed in the previous test run. I was only using the
    file-change factor. Naturally, the recall decreased, from 0.50 to 0.39 (the
    decrease was not too big)... and the distribution of failed tests in the
    priority queue had a good skew towards the front of the queue (so it seems
    that the files help somewhat, to indicate the likelihood of a failure). I
    attached this chart.

An interesting problem that I encountered was that about 50% of the
test_runs don't have any file changes nor test failures, and so the
relevance of all tests is zero. Here is where the original strategy (a
weighted average of failures) could be useful, so that even if we don't
have any information to guess which tests to run, we just go ahead and run
the ones that have failed the most, recently.

I will work on mixing up both strategies a bit in the next few days, and
see what comes of that.

By the way, I pushed the code to github. The code is completely different,
so may be better to wait until I have new results soon.

Regards!
Pablo