Andrei via developers <developers@lists.mariadb.org> writes:
The root cause of these failures is MDEV-31655, https://jira.mariadb.org/browse/MDEV-31655. Sad that this serious bug was actually visible as test failures in the CI for years without being fixed. What can we do to improve this in the future?
always a priority list of what we - the "corporate" folks - are to be busy with.
Exactly, priority is one key point. What I hear from users is that it has become very difficult to upgrade even minor versions of MariaDB due to the risk of regressions. Everyone will say that avoiding regressions is important. But only we experienced developers really understand what priorities are needed to really minimise the risk of regressions in releases, and most importantly minor updates to stable releases. Another key point is communication and discussion. I don't recall ever seeing any discussion of the code merge that errorneously removed the code in question. Later, the actual implementation was removed as the only change in a commit titled "remove dead code". Just checking `git blame` and asking if this code should really be dead would have immediately caught this problem. I think it is very important to raise awareness with all developers with push access how critical a piece of software MariaDB is, and how important it is _never_ to push without fully understanding how the affected parts of the server code works. Better ask one time to many than one time to few.
It's a complicated matter. MDEV-28776 was not really neglected. There's
What I meant is - a number of GA releases were made over several years despite this bug being visible as a failure in the buildbot/CI. I think there's a perception that sporadic failures in buildbot are somehow "too difficult to debug". What is important to understand is how much _MUCH_ more difficult a bug like this is to debug in the wild in user's production environment. Or even for the user to report it. In my experience, these bugs can always be tracked down with reasonable efforts if using a systematic approach. First get the failure to reproduce in some environent, typically with ./mtr --repeat=100000 or something like that, as the last resort by running the exact same build that fails on the actual buildbot machine. Then add debug printouts step by step until the cause is identified. The process can take some time if the failure is rare, but it can be done as a background task. And again, asking for advice can help.
Also to recognize the seriousness of that bug may take not just unordinary skills (which we may rely on much more than before :-) :pray:).
This failure did stand out as likely quite serious, since it results in the error "Slave worker thread retried transaction 10 time(s) in vain, giving up" from just a normal query pattern in the test. Normally it should never be necessary with more than one retry from conflicts in parallel replication itself. In general, it is a fact that a lot of sporadic test failures turn out to be "false positives"; caused by problems of the test, not of the server code. But because of the extreme difficulty of debugging some of these problems in production environments, the end result is still a lot of time saved.
I'd take this on the chin.
I hope it's clear I'm not playing any blame game here. I wrote most of the MariaDB replication code; if not this one, then many of the other mistakes in replication are mine. I speak openly about what I think can be done better, hoping to constantly improve things. - Kristian.