Re: [Maria-developers] [GSoC] Optimize mysql-test-runs - Results of new strategy

19 Aug 2014

      Hi Pablo,

Thanks for the great work.

Just one thing -- In RESULTS.md, paragraphs "The Fail Frequency 
algorithm" and "The File-change correlation algorithm" are unfinished. 
It's not a big deal, but I want to be sure there wasn't anything 
important in the lost part. Could you please double-check?

Regards,
Elena

On 17.08.2014 16:32, Pablo Estrada wrote:
...
Hello Elena and all,
I have submitted the concluding commit to the project with a very
short 'RESULTS' file that explains briefly the project, the different
strategies and the results. It includes a chart with updated results
for both strategies and different modes. If you think I should add
anything else, please let me know.
Here it is:
https://github.com/pabloem/Kokiri/blob/master/RESULTS.md
Thank you very much.
Regards
Pablo
On 8/13/14, Elena Stepanova <elenst@montyprogram.com> wrote:
...
Hi Pablo,
On 10.08.2014 9:31, Pablo Estrada wrote:
...
Hello Elena,
You raise good points. I have just rewritten the save_state and
load_state
functions. Now they work with a MySQL database and a table that looks
like
this:
create table kokiri_data  ( dict varchar(20), labels varchar(200), value
varchar(100), primary key (dict,labels));
Since I wanted to store many dicts into the database, I decided to try
this
format. The 'dict' field includes the dictionary that the data belongs to
('upd_count','pred_count' or 'test_info'). The 'labels' field includes
the
space-separated list of labels in the dictionary (for a more detailed
explanation, check the README and the code). The value contains the value
of the datum (count of runs, relevance, etc.)
Since the labels are space-separated, this assumes we are not using the
mixed mode. If we use mixed mode, we may change the separator (, or & or
%
or $ are good alternatives).
Let me know what you think about this strategy to store into the
database.
I felt it was the most simple one, while still allowing to do some
querying
on the database (like loading only one metric or one 'unit'
(platform/branch/mix), etc). It may also allow to store many
configurations
if necessary.
Okay, lets have it this way. We can change it later if we want to.
In the remaining time, you can do the cleanup, check documentation, and
maybe run some last clean experiments with the existing data and
different parameters (modes, metrics etc.), to have the statistical
results with the latest code, which we'll use later to decide on the
final configuration.
Regards,
Elena
...
Regards
Pablo
On Sat, Aug 9, 2014 at 8:26 AM, Elena Stepanova <elenst@montyprogram.com>
wrote:
...
Hi Pablo,
Thanks for the update. Couple of comments inline.
On 08.08.2014 18:17, Pablo Estrada wrote:
...
Hello Elena,
I just pushed a transaction, with the following changes:
1. Added an internal counter to the kokiri class, and a function to
expose
it. This function can show how many update result runs and prediction
runs
have been run in total, or per unit (an unit being a platform, a branch
or
a mix of both). Using this counter, one can decide to add logic for
extra
learning rounds for new platforms (I added it to the wrapper class as
an
example).
2. Added functions to load and store status into temporary storage.
They
are very simple - they only serialize to a JSON file, but they can be
easily modified to fit the requirements of the implementation. I can
add
this in the README. If you'd like for me to add the capacity to connect
to
a database and store the data in a table, I can do that too (I think it
Yes, I think we'll have to have it stored in the database.
Chances are, the scripts will run on buildbot slaves rather than on the
master, so storing data in a file just won't do any good.
would be easiest to store the dicts as json data in text fields). Let
me
...
know if you'd prefer that.
I don't like the idea of storing the entire dicts as json. It doesn't
seem
to be justified by... well... anything, except for saving a tiny bit of
time on writing queries. But that's a one-time effort, while this way we
won't be able to [easily] join the statistical data with, lets say,
existing buildbot tables; and it generally won't be efficient and easy
to
read.
Besides, keep in mind that for real use, if, lets say, we are running in
'platform' mode, for each call we don't need the whole dict, we only
need
the part of dict which relates to this platform, and possibly the
standard
one. So, there is really no point loading other 20 platforms' data,
which
you will almost inevitably do if you store it in a single json.
The real (not json-ed) data structure seems quite suitable for SQL, so
it
makes sense to store it as such.
If you think it will take you long to do that, it's not critical: just
create an example interface for connecting to a database and running
*some*
queries to store/read the data, and we'll tune it later.
Regards,
Elena
...
By the way, these functions allow the two parts of the algorithm to be
called separately, e.g.:
Predicting phase (can be done depending of counts of training rounds
for
platform, etc..)
1. Create kokiri instance
2. Load status (call load_status)
3. Input test list, get smaller output
4. Eliminate instance from memory (no need to save state since nothing
changes until results are updated)
Training phase:
1. Create kokiri instance
2. Load status (call load_status)
3. Feed new information
4. Save status (call save_status)
5. Eliminate instance from memory
I added tests that check the new features to the wrapper. Both features
seem to be working okay. Of course, the more prediction rounds for new
platforms, the platform mode improves a bit, but not too dramatically,
for
what I've seen. I'll test it a bit more.
I will also add these features to the file_change_correlations branch,
and
document everything in the README file.
Regards
Pablo
On Wed, Aug 6, 2014 at 8:04 PM, Elena Stepanova
<elenst@montyprogram.com>
wrote:
(sorry, forgot the list in my reply, resending)
...
Hi Pablo,
On 03.08.2014 17:51, Pablo Estrada wrote:
> Hi Elena,
>
>
>    One thing that I want to see there is fully developed platform mode.
> I
>>
> see
> that mode option is still there, so it should not be difficult. I
>>
> actually
> did it myself while experimenting, but since I only made hasty and
> crude
>> changes, I don't expect them to be useful.
>>
>>
> I'm not sure what code you are referring to. Can you be more specific
> on
> what seems to be missing? I might have missed something when
> migrating
>
from
> the previous architecture...
>
I was mainly referring to the learning stage. Currently, the learning
stage is "global". You go through X test runs, collect data,
distribute
it
between platform-specific queues, and from X+1 test run you start
predicting based on whatever platform-specific data you have at the
moment.
But this is bound to cause rather sporadic quality of prediction,
because
it could happen that out of 3000 learning runs, 1000 belongs to
platform
A,
while platform B only had 100, and platform C was introduced later,
after
your learning cycle. So, for platform B the statistical data will be
very
limited, and for platform C there will be none -- you will simply
start
randomizing tests from the very beginning (or using data from other
platforms as you suggest below, which is still not quite the same as
pure
platform-specific approach).
It seems more reasonable, if the platform-specific mode is used, to do
learning per platform too. It is not just about current investigation
activity, but about the real-life implementation too.
Lets suppose tomorrow we start collecting the data and calculating the
metrics.
Some platforms will run more often than others, so lets say in 2 weeks
you
will have X test runs on these platforms so you can start predicting
for
them; while other platforms will run less frequently, and it will take
1
month to collect the same amount of data.
And 2 months later there will be Ubuntu Utopic Unicorn which will have
no
statistical data at all, and it will be cruel to jump into predicting
there
right away, without any statistical data at all.
It sounds more complicated than it is, in fact pretty much all you
need
to
add to your algorithm is making 'count' in your run_simulation a dict
rather than a constant.
So, I imagine that when you store your metrics after a test run, you
will
also store a number of test runs per platform, and only start
predicting
for this particular platform when the count for it reaches the
configured
number.
> Of the code that's definitely not there, there are a couple things
> that
> could be added:
> 1. When we calculate the relevance of a test on a given platform, we
>
might
> want to set the relevance to 0, or we might want to derive a default
> relevance from other platforms (An average, the 'standard', etc...).
> Currently, it's just set to 0.
>
I think you could combine this idea with what was described above.
While
it makes sense to run *some* full learning cycles on a new platform,
it
does not have to be thousands, especially since some non-LTS platforms
come
and go awfully fast. So, we run these no-too-many cycles, get clean
platform-specific data, and if necessary enrich it with the other
platforms' data.
> 2. We might also, just in case, want to keep the 'standard' queue for
>
when
> we don't have the data for this platform (related to the previous
> point).
>
If we do what's described above, we should always have data for the
platform.
But if you mean calculating and storing the standard metrics, then yes
--
since we are going to store the values rather than re-calculate them
every
time, there is no reason to be greedy about it. It might even make
sense
to
calculate both metrics that you developed, too. Who knows maybe one
day
we'll find out that the other one gives us better results.
>
>    It doesn't matter in which order they fail/finish; the problem is,
> when
>> builder2 starts, it doesn't have information about builder1 results,
>> and
>> builder3 doesn't know anything about the first two. So, the metric
>> for
>>
> test
> X could not be increased yet.
>>
>> But in your current calculation, it is. So, naturally, if we happen
>> to
>> catch the failure on builder1, the metric raises dramatically, and
>> the
>> failure will be definitely caught on builders 2 and 3.
>>
>> It is especially important now, when you use incoming lists, and the
>> running sets might be not identical for builders 1-3 even in
>> standard
>>
> mode.
>
>>
> Right, I see your point. Although if test_run 1 would catch the
> error,
> test_run 2, although it would be using the same data. might not catch
> the
> same errors if the running set makes it such that they are pushed out
> due
> to lower relevance. The effect might not be too big, but it
> definitely
>
has
> potential to affect the results.
>
> Over-pessimistic part:
>
>>
>> It is similar to the previous one, but look at the same problem from
>> a
>> different angle. Suppose the push broke test X, and the test started
>> failing on all builders (platforms). So, you have 20 failures, one
>> per
>>
> test
> run, for the same push. Now, suppose you caught it on one platform
> but
>>
> not
> on others. Your statistics will still show 19 failures missed vs 1
>>
> failure
> caught, and recall will be dreadful (~0.05). But in fact, the goal is
>> achieved: the failure has been caught for this push. It doesn't
>> really
>> matter whether you catch it 1 time or 20 times. So, recall here
>> should
>>
> be 1.
>
>> It should mainly affect per-platform approach, but probably the
>> standard
>> one can also suffer if running sets are not identical for all
>> builders.
>>
>>
> Right. It seems that solving these two issues is non-trivial (the
>
test_run
> table does not contain duration of the test_run, or anything). But we
> can
> keep in mind these issues.
>
Right. At this point it doesn't even make sense to solve hem -- in
real-life application, the first one will be gone naturally, just
because
there will be no data from unfinished test runs.
The second one only affects recall calculation, in other words --
evaluation of the algorithm. It is interesting from theoretical point
of
view, but not critical for real-life application.
I fixed up the repositories with updated versions of the queries, as
> well
> as instructions in the README on how to generate them.
>
> Now I am looking a bit at the buildbot code, just to try to suggest
> some
> design ideas for adding the statistician and the pythia into the MTR
> related classes.
>
As you know, we have the soft pencil-down in a few days, and the hard
one
a week later. At this point, there isn't much reason to keep
frantically
improving the algorithm (which is never perfect), so you are right not
planning on it.
In the remaining time I suggest to
- address the points above;
- make sure that everything that should be configurable is
configurable
(algorithm, mode, learning set, db connection details);
- create structures to store the metrics and reading to/writing from
the
database;
- make sure the predicting and the calculating part can be called
separately;
- update documentation, clean up logging and code in general.
As long as we have these two parts easily callable, we will find a
place
in buildbot/MTR to put them to, so don't waste too much time on it.
Regards,
Elena
> Regards
> Pablo
>
>