Kristian, salve. To respond now to just one of the points raised by you,
Andrei via developers <developers@lists.mariadb.org> writes:
...
It's a complicated matter. MDEV-28776 was not really neglected. There's
What I meant is - a number of GA releases were made over several years despite this bug being visible as a failure in the buildbot/CI.
I think there's a perception that sporadic failures in buildbot are somehow "too difficult to debug". What is important to understand is how much _MUCH_ more difficult a bug like this is to debug in the wild in user's production environment. Or even for the user to report it.
I can't agree more having deals with the difference (of finding bug in BB and on site) for years. That must apply to the most of the engineering developers. From my personal experience some of "sporadic" BB failures seemed to be within a grasp, but I failed to tackle them timely for a number of technical reasons, and not only technical ^ (you acked on the priority list point).
In my experience, these bugs can always be tracked down with reasonable efforts if using a systematic approach. First get the failure to reproduce in some environent, typically with ./mtr --repeat=100000 or something like that, as the last resort by running the exact same build that fails on the actual buildbot machine. Then add debug printouts step by step until the cause is identified. The process can take some time if the failure is rare, but it can be done as a background task.
In many cases --repeat has been a reliable tool, though in a number of cases all the BB env properties must just match and I had to push "printout" commits to that specific BB. This is a tedious process, and to get satisfactory info could take weeks.
And again, asking for advice can help.
Also to recognize the seriousness of that bug may take not just unordinary skills (which we may rely on much more than before :-) :pray:).
This failure did stand out as likely quite serious, since it results in the error "Slave worker thread retried transaction 10 time(s) in vain, giving up" from just a normal query pattern in the test. Normally it
should never be necessary with more than one retry from conflicts in parallel replication itself.
I must say this conclusion occurred to me not at once. Sadly too I did not raise it with you few years back (being "convinced" it's a feature).
In general, it is a fact that a lot of sporadic test failures turn out to be "false positives"; caused by problems of the test, not of the server code. But because of the extreme difficulty of debugging some of these problems in production environments, the end result is still a lot of time saved.
caused by problems of the test
That's a part of the actual "too difficult to debug" 'perception'. And no smile intended.
I'd take this on the chin.
I hope it's clear I'm not playing any blame game here.
And why should you when we're in the same boat..
I wrote most of the MariaDB replication code; if not this one, then many of the other mistakes in replication are mine. I speak openly about what I think can be done better, hoping to constantly improve things.
Thank you for this piece of sobering feedback! That should trigger thinking to what is more reasonable balance for the test failures and their faster resolving. Cheers, Andrei