[MariaDB developers] Re: Debugging MDEV-28776, the rare rpl.rpl_mark_optimize_tbl_ddl test failure

21 Jul 2023

      Kristian, salve.

To respond now to just one of the points raised by you,
...
Andrei via developers <developers@lists.mariadb.org> writes:
...
...
...
It's a complicated matter. MDEV-28776 was not really neglected. There's
What I meant is - a number of GA releases were made over several years
despite this bug being visible as a failure in the buildbot/CI.
I think there's a perception that sporadic failures in buildbot are somehow
"too difficult to debug". What is important to understand is how much _MUCH_
more difficult a bug like this is to debug in the wild in user's production
environment. Or even for the user to report it.
I can't agree more having deals with the difference (of finding bug in
BB and on site) for years.
That must apply to the most of the engineering developers.
From my personal experience some of "sporadic" BB failures seemed to be
within a grasp, but I failed to tackle them timely for a number of
technical reasons, and not only technical ^ (you acked on the priority
list point).
...
In my experience, these bugs can always be tracked down with reasonable
efforts if using a systematic approach. First get the failure to reproduce
in some environent, typically with ./mtr --repeat=100000 or something like
that, as the last resort by running the exact same build that fails on the
actual buildbot machine. Then add debug printouts step by step until the
cause is identified. The process can take some time if the failure is rare,
but it can be done as a background task.
In many cases --repeat has been a reliable tool, though in a number of
cases all the BB env properties must just match and I had to push
"printout" commits to that specific BB.
This is a tedious process, and to get satisfactory info could take weeks.
...
And again, asking for advice can help.
...
Also to recognize the seriousness of that bug may take not just unordinary
skills (which we may rely on much more than before :-) :pray:).
This failure did stand out as likely quite serious, since it results in the
error "Slave worker thread retried transaction 10 time(s) in vain, giving
up" from just a normal query pattern in the test. Normally it

...
should never
be necessary with more than one retry from conflicts in parallel replication
itself.
I must say this conclusion occurred to me not at once.
Sadly too I did not raise it with you few years back (being "convinced" it's a feature).
...
In general, it is a fact that a lot of sporadic test failures turn out to be
"false positives"; caused by problems of the test, not of the server code.
But because of the extreme difficulty of debugging some of these problems in
production environments, the end result is still a lot of time saved.

...
caused by problems of the test
That's a part of the actual "too difficult to debug" 'perception'. And
no smile intended.
...
...
I'd take this on the chin.
I hope it's clear I'm not playing any blame game here.
And why should you when we're in the same boat..
...
I wrote most of the
MariaDB replication code; if not this one, then many of the other mistakes
in replication are mine. I speak openly about what I think can be done
better, hoping to constantly improve things.
Thank you for this piece of sobering feedback!
That should trigger thinking to what is more reasonable balance for the
test failures and their faster resolving.

Cheers,

Andrei

[MariaDB developers] Re: Debugging MDEV-28776, the rare rpl.rpl_mark_optimize_tbl_ddl test failure

andrei.elkin＠pp.inet.fi