Hello Elena,

Can you give a raw estimation of a ratio of failures missed due to being low in the priority queue vs those that were not in the queue at all?

I sent this information in a previous email, here:
https://lists.launchpad.net/maria-developers/msg07482.html

Also, once again, I would like you to start using an incoming test list as an initial point of your test set generation. It must be done sooner or later, I already explained earlier why; and while it's not difficult to implement even after the end of your project, it might affect the result considerably, so we need to know if it makes it better or worse, and adjust the algorithm accordingly.

You are right. I understand that this information is not fully available for all the test_runs, so can you upload the information going back as much as possible? I can parse these files and adjust the program to work with this. I will get on to work with this, I think this should significantly improve results. I think, it might even push my current strategy from promising results into attractive ones.
 
There are several options which change the way the tests are executed; e.g. tests can be run in a "normal" mode, or in PS protocol mode, or with valgrind, or with embedded server. And it might well be that some tests always fail e.g. with valgrind, but almost never fail otherwise.
Information about these options is partially available in test_run.info, but it would require some parsing. It would be perfect if you could analyze the existing data to understand whether using it can affect your results before spending time on actual code changes.

I will keep this in consideration, but for now I will focus on these two main things:
  • Improving precision of selecting code changes to estimate correlation with test failures
  • Adding the use of an incoming test list
 
When we are trying to watch all code changes and find correlation with test failures, if it's done well, it should actually provide immediate gain; however, it's very difficult to do it right, there is way too much noise in the statistical data to get a reliable picture. So, while it will be nice if you get it work (since you already started doing it), don't take it as a defeat if you eventually find out that it doesn't work very well.

Well, actually, this is the only big difference between the original strategy using just a weighted average of failures; and the new strategy, which performs significantly better in longer testing settings. It has been working for a few weeks, and is up on github.
 
Either way, as I said before, I will, from today, focus on improving precision of selecting code changes to estimate correlation with test failures.

Regards
Pablo