Hello everyone,
Well, I now have familiarized myself with the data. I will start trying to simulate some scenarios. In this email, I will summarize my roadmap for the next few days. The text is quite complicated (and my redaction is kind of clumsy too). It's not necessary to read it in detail. This email is mostly to keep a record of what I'll do.

Anyhow, if anybody has questions or comments, I will be happy to address them, as well as receive any feedback. I will start coding the simulations in the next few days. Also, if you want more details of what I'm doing, I can write that up too.

So here's what I'll do for the simulations:

1. Calculating the: "Relevancy index" for a test, I have considered two simple options so far:
Regarding the Relevancy Index, it can be calculated grouping test results in many ways: Roughly using test_name+variation, or more granularly by including branch and platform. I'll add some thoughts regarding these options at the bottom of the email.

2. To run the simulation, I'll gather data from the first few thousands of test_run entries, and then start simulating results. Here's what I'll do:
  1. Gather data first few thousands of test_run entries (i.e. 4 thousand)
  2. After N thousand test_runs, I'll go through the test_run entries one by one, and using the data gathered to that point, I will select 'running sets' of 100 test suites to run on each test_run entry. (The number can be adjusted)
  3. If in this test_run entry, the list of failed tests contains tests that are NOT part of the running set, the failure will be ignored, and so the information of this failure will be lost (not used as part of the relevancy index). (See Comment 2)
  4. If the set of failed tests in the test_run entry intersect with the running_set, this is better recall. This information will be used to continue calculating the relevancy index.
According to the results obtained from the simulations, we can adjust the algorithm (i.e. to consider relevancy index by platform and branch, etc.)

Comments about the relevancy index:

Comments regarding buildbot infrasturcture:
These comments are out of the scope of this project, but it would be very desirable features for the buildbot infrastructure.
Thanks to all,

On Mon, Apr 28, 2014 at 9:57 PM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Kristian!

On Apr 28, Kristian Nielsen wrote:
> Sergei Golubchik <serg@mariadb.org> writes:
> > note, that two *different* revisions got the same revno! And the changes
> > from the first revision are completely and totally lost, there is no way
> > to retrieve from from anywhere.
> Indeed.
> But note that in main trees (5.1, 5.2, 5.3, 5.5, and 10.0), this cannot occur,
> since we have set the append_revision_only option (or "append_revisions_only",
> can't remember). This prevents a revision number from changing, once pushed.
> So in main trees, the revision number _should_ in fact be unique.

Yes. I omitted that detail, because I hope that we can find a solution
that works for all trees without checks that only work for main trees.
But, of course, as the last resort we can rely on append_revisions_only.

> > Revision-id is the only unique identifier for a revision, unfortunately,
> > it's not logged in these tables. I believe we'll change buildbot so that
> > revid would be logged in the future. But so far it wasn't needed, and
> > this is one of the defects in the data.
> I actually wanted to log it when I wrote the code. The problem is that the
> revision-id is not available to buildbot when the change is received from
> Launchpad. I even asked the bzr/launchpad developers to provide the revid: so
> it could be logged. The answer I got was that it is a deliberate feature to
> hide the revision id :-(
>     https://bugs.launchpad.net/launchpad/+bug/419057
> So I don't think we will get revid in Buildbot. Of course, if we go to git, we
> will not have this problem anymore, as it always uses a consistent, stable
> revision identifier.

Oh, I see, thanks.

Git - yes, that's not an issue. Bzr - perhaps we could figure out
something regardless. May be get the revid on the tarbake builder - it
needs the tree anyway. Or use fake revids. Or something. It is not a
showstopper for this project, we can think about it later, when we
finish the research part and get to the integration.
