Hello Sergei, Elena and all,
Today while working on the script, I found and fixed an issue:

There is some faulty code code in my script that is in charge of collecting the statistics about whether a test failure was caught or not (here). I looked into fixing it, and then I could see another problem: The recall numbers that I had collected previously were too high.

The actual recall numbers, once we consider the test failures that are not caught, are disappointingly lower. I won't show you results yet, since I want to make sure that the code has been fixed, and I have accurate tests first.

This is all for now. The strategy that I was using is a lot less effective than it seemed initially. I will send out a more detailed report with results, my opinion on the weak points of the strategy, and ideas, including a roadmap to try to improve results.

Regards. All feedback is welcome.
Pablo