Hello everyone,
Thank you Elena, I really appreciate that you took the time to make a thorough review of the code. Answers to your comments are inline. Conclusion at the end : )


1. Structure
  • When I started coding, I had no idea how the project would fit in the whole system, so, after asking a bit, I wrote a script much like Sergei's example: Only with the simulations and statistics in mind and not considering the implementation phase. I tried to make it a bit more practical to use, but I did not have in mind using this code as part of the implementation. Only the lessons learned from it.
  • If you think it's a better idea, we can define exactly what the interfaces of my project should be, so that I can write something with views on the implementation, rather than a pure research-oriented script.
  • Regarding the multi-mode logic: I decided to keep the logic from all modes because it's only a constant overhead (it doesn't affect the complexity of the original algorithm significantly), and at first I was considering all failures, not only those from tests that ran. I can easily add the code to run only the mode of interest. 
2. Cleanup

if you call several simulations from run_basic_simulations.py, only the very first one will use the correct data and get real results. All consequent ones will use the data modified by the previous ones, and the results will be totally irrelevant.

Yup, this is a huge mistake. When I implemented the series of simulations, I completely forgot about the metrics, and thought there was no extra cleanup necessary; but I was wrong. This is the first thing I'll fix.

3. Modes

These flaws should be easy to fix by doing proper cleanup before each simulation. But there are also other fragments of code where, for example, logic for 'standard' mode is supposed to be always run and is relied upon, even if the desired mode is different.

In fact, you build all queues every time. It would be an understandable trade-off to save the time on simulations, but you re-run them separately anyway, and only return the requested queue.

I decided to keep the logic of all modes because it does not affect the complexity of the algorithm... and yes, I relied on the standard mode because it always runs. This is not a mistake, but rather it is a 'cheat' that I was aware of, but indulged in. This is easy to fix - and I don't think it is too serious of a problem. If I were to code something with views towards the implementation, I would make sure that only strictly necessary code would run.

4. Failed tests vs executed tests
- if we say we can afford running 500 tests, we'd rather run 500 than 20, even if some of them never failed before. This will also help us break the bubble, especially if we randomize the "tail" (tests with the minimal priority that we add to fill the queue). If some of them fail, they'll get a proper metric and will migrate to the meaningful part of the queue.

I don't understand what you mean by bubble, but I see your point. You are right. I will work on making sure that we run RUNNING_SET tests, even when the priority queue contains less than that.
 
To populate the queue, You don't really need the information which tests had ever been run; you only need to know which ones MTR *wants* to run, if the running set is unlimited. If we assume that it passes the list to you, and you iterate through it, you can use your metrics for tests that failed or were edited before, and a default minimal metric for other tests. Then, if the calculated tests are not enough to fill the queue, you'll randomly choose from the rest. It won't completely solve the problem of tests that never failed and were never edited, but at least it will make it less critical.

This brings us back to the question of research-only vs preparing-for-implementation, yes, this can be done.

6. Full / non-full simulation mode

I couldn't understand what the *non*-full simulation mode is for, can you explain this?

This is from the beginning of the implementation, when I was considering all failures - even from tests that were not supposed to run. I can remove this already.
 

7. Matching logic (get_test_file_change_history)

The logic where you are trying to match result file names to test names is not quite correct. There are some highlights:

There can also be subsuites. Consider the example:
./mysql-test/suite/engines/iuds/r/delete_decimal.result

The result file can live not only in /r dir, but also in /t dir, together with the test file. It's not cool, but it happens, see for example mysql-test/suite/mtr/t/

Here are some other possible patterns for engine/plugin suites:
./storage/tokudb/mysql-test/suite/tokudb/r/rows-32m-1.result
./storage/innodb_plugin/mysql-test/innodb.result
Also, in release builds they can be in mysql-test/plugin folder:
mysql-test/plugin/example/mtr/t

When I coded that logic, I used the website that you had suggested, and thought that I had covered all possibilities. I see that I didn't. I will work on that too.
 

Be aware that the logic where you compare branch names doesn't currently work as expected. Your list of "fail branches" consists of clean names only, e.g. "10.0", while row[BRANCH] can be like "lp:~maria-captains/maria/10.0". I'm not sure yet why it is sometimes stored this way, but it is.

Yes, I am aware of this; but when I looked at the data, there were very, very few cases like this, so I decided to just ignore them. All the other cases (>27 thousand) are presented as clean names. I believe this is negligible, but if you think otherwise, I can add the logic to parse these few cases.

 MariaDB [kokiri_apr23]> select branch, count(*) from changes where branch like 'lp%' group by 1 order by 2;
+------------------------------------------------------+----------+
| branch                                               | count(*) |
+------------------------------------------------------+----------+
| lp:~maria-captains/maria/5.3-knielsen                |        1 |
| lp:~maria-captains/maria/5.5                         |        1 |
| lp:~maria-captains/maria/mariadb-5.1-mwl132          |        1 |
| lp:~maria-captains/maria/mariadb-5.1-mwl163          |        1 |
| lp:~maria-captains/maria/maria-5.1-table-elimination |        3 |
| lp:~maria-captains/maria/10.0-FusionIO               |        7 |
| lp:~maria-captains/maria/5.1-converting              |        9 |
| lp:~maria-captains/maria/10.0-connect                |       15 |
| lp:~maria-captains/maria/10.0                        |       45 |
+------------------------------------------------------+----------+

Lets get back to it (both of them) after the logic with dependent simulations is fixed, after that we'll review it and see why it doesn't work if it still doesn't. Right now any effect that file edition might have is rather coincidental, possibly the other one is also broken.

Yup!
 
    - Humans are probably a lot better at predicting first failures than the
    current strategy.

This is true, unfortunately it's a full time job which we can't afford to waste a human resource on.

I was not suggesting to hire a person for this : ). What I meant is that a developer would be a lot better at choosing how to test their own changes, rather than a script that just accounts for recent failures. This is information that a developer can easily leverage on his own, along with his code changes, etc... I was trying to say that considering all this, I need to improve the algorithm.

In conclusion, this is what I will work on now:
  • Add cleanup for every run - I just added (and pushed) this. Right now I am running tests with this logic.
  • Get new results, graph and report them ASAP.
  • Add randomized running of tests that have not ran or been modified - I will make this optional.
  • Fix the logic that converts test result file names to test names
Questions to consider:
  • Should we define precisely what the implementation should look like, so that I can code the core, wrapper, etc? Or should I focus on finishing the experiments and the 'research' first?
  • I believe the logic to compare branch names should be good enough, but if you consider that it should be 100% fixed, I can do that. What do you think?
Regards
Pablo