Debugging MDEV-28776, the rare rpl.rpl_mark_optimize_tbl_ddl test failure
Hi Brandon, The test failure in https://jira.mariadb.org/browse/MDEV-28776 (the one with 10 failed retries after deadlock, not the freebsd failures) seems like it could be something serious. But it has been very difficult to track down. I believe you have tried different things, and I also tried hard to reproduce it, so far without success. But it's still failing occasionally in actual buildbot runs. What do you think about pushing the below to the different branches? It makes the test case log InnoDB information about deadlocks encountered. This way, when it fails the next time in buildbot we can see 1) what kind of InnoDB deadlock is triggering the problem, if any; and 2) if it actually gets a deadlock inside InnoDB 10 times in a row, or if it is something else. - Kristian. ----------------------------------------------------------------------- Author: Kristian Nielsen <knielsen@knielsen-hq.org> Date: Sun Jul 9 15:18:03 2023 +0200 MDEV-28776: rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master This commit just extends the testcase to include some more information when the test failure occurs. The failure has so far only been reproducible in buildbot runs. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org> ----------------------------------------------------------------------- diff --git a/mysql-test/suite/rpl/t/rpl_mark_optimize_tbl_ddl.opt b/mysql-test/suite/rpl/t/rpl_mark_optimize_tbl_ddl.opt new file mode 100644 index 00000000000..3c53d5257b4 --- /dev/null +++ b/mysql-test/suite/rpl/t/rpl_mark_optimize_tbl_ddl.opt @@ -0,0 +1 @@ +--innodb-print-all-deadlocks=1
Hi Kristian, That sounds like a good idea to me. Did you mean to use the option for both primary and replica though, or did you just want `...-slave.opt`? Brandon On Sun, Jul 9, 2023 at 7:26 AM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Hi Brandon,
The test failure in https://jira.mariadb.org/browse/MDEV-28776 (the one with 10 failed retries after deadlock, not the freebsd failures) seems like it could be something serious. But it has been very difficult to track down. I believe you have tried different things, and I also tried hard to reproduce it, so far without success. But it's still failing occasionally in actual buildbot runs.
What do you think about pushing the below to the different branches? It makes the test case log InnoDB information about deadlocks encountered. This way, when it fails the next time in buildbot we can see 1) what kind of InnoDB deadlock is triggering the problem, if any; and 2) if it actually gets a deadlock inside InnoDB 10 times in a row, or if it is something else.
- Kristian.
----------------------------------------------------------------------- Author: Kristian Nielsen <knielsen@knielsen-hq.org> Date: Sun Jul 9 15:18:03 2023 +0200
MDEV-28776: rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master
This commit just extends the testcase to include some more information when the test failure occurs. The failure has so far only been reproducible in buildbot runs.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org> -----------------------------------------------------------------------
diff --git a/mysql-test/suite/rpl/t/rpl_mark_optimize_tbl_ddl.opt b/mysql-test/suite/rpl/t/rpl_mark_optimize_tbl_ddl.opt new file mode 100644 index 00000000000..3c53d5257b4 --- /dev/null +++ b/mysql-test/suite/rpl/t/rpl_mark_optimize_tbl_ddl.opt @@ -0,0 +1 @@ +--innodb-print-all-deadlocks=1
Brandon Nesterenko via developers <developers@lists.mariadb.org> writes:
That sounds like a good idea to me. Did you mean to use the option for
Thanks Brandon. I actually now managed to reproduce the failure once, after several days running on the buildbot machines. I'll see if I can get any progress that way, otherwise I'll try to push this patch (it will have to be done to all the main trees to be effective, as the buildbot failures are in all the versions...).
both primary and replica though, or did you just want `...-slave.opt`?
Just the slave is fine, as that's where the deadlocks occur. But thanks for checking! - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
I actually now managed to reproduce the failure once, after several days running on the buildbot machines. I'll see if I can get any progress that way, otherwise I'll try to push this patch (it will have to be done to all
I was able to solve this by running on the buildbot machines, so I won't have to push this debugging patch to main trees, which is good. The root cause of these failures is MDEV-31655, https://jira.mariadb.org/browse/MDEV-31655. Sad that this serious bug was actually visible as test failures in the CI for years without being fixed. What can we do to improve this in the future? - Kristian.
Howdy Kristian,
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
I actually now managed to reproduce the failure once, after several days running on the buildbot machines. I'll see if I can get any progress that way, otherwise I'll try to push this patch (it will have to be done to all
I was able to solve this by running on the buildbot machines, so I won't have to push this debugging patch to main trees, which is good.
The root cause of these failures is MDEV-31655,
My congrats on and thanks to MDEV-28776 has been unriddled, by you.
https://jira.mariadb.org/browse/MDEV-31655. Sad that this serious bug was actually visible as test failures in the CI for years without being fixed. What can we do to improve this in the future?
It's a complicated matter. MDEV-28776 was not really neglected. There's always a priority list of what we - the "corporate" folks - are to be busy with. Also to recognize the seriousness of that bug may take not just unordinary skills (which we may rely on much more than before :-) :pray:). I'd take this on the chin. Needless to say it'd be perfect to catch this sort of bugs by mtr tests at the removal time which speaks of self-protected features. Cheers, Andrei
Andrei via developers <developers@lists.mariadb.org> writes:
The root cause of these failures is MDEV-31655, https://jira.mariadb.org/browse/MDEV-31655. Sad that this serious bug was actually visible as test failures in the CI for years without being fixed. What can we do to improve this in the future?
always a priority list of what we - the "corporate" folks - are to be busy with.
Exactly, priority is one key point. What I hear from users is that it has become very difficult to upgrade even minor versions of MariaDB due to the risk of regressions. Everyone will say that avoiding regressions is important. But only we experienced developers really understand what priorities are needed to really minimise the risk of regressions in releases, and most importantly minor updates to stable releases. Another key point is communication and discussion. I don't recall ever seeing any discussion of the code merge that errorneously removed the code in question. Later, the actual implementation was removed as the only change in a commit titled "remove dead code". Just checking `git blame` and asking if this code should really be dead would have immediately caught this problem. I think it is very important to raise awareness with all developers with push access how critical a piece of software MariaDB is, and how important it is _never_ to push without fully understanding how the affected parts of the server code works. Better ask one time to many than one time to few.
It's a complicated matter. MDEV-28776 was not really neglected. There's
What I meant is - a number of GA releases were made over several years despite this bug being visible as a failure in the buildbot/CI. I think there's a perception that sporadic failures in buildbot are somehow "too difficult to debug". What is important to understand is how much _MUCH_ more difficult a bug like this is to debug in the wild in user's production environment. Or even for the user to report it. In my experience, these bugs can always be tracked down with reasonable efforts if using a systematic approach. First get the failure to reproduce in some environent, typically with ./mtr --repeat=100000 or something like that, as the last resort by running the exact same build that fails on the actual buildbot machine. Then add debug printouts step by step until the cause is identified. The process can take some time if the failure is rare, but it can be done as a background task. And again, asking for advice can help.
Also to recognize the seriousness of that bug may take not just unordinary skills (which we may rely on much more than before :-) :pray:).
This failure did stand out as likely quite serious, since it results in the error "Slave worker thread retried transaction 10 time(s) in vain, giving up" from just a normal query pattern in the test. Normally it should never be necessary with more than one retry from conflicts in parallel replication itself. In general, it is a fact that a lot of sporadic test failures turn out to be "false positives"; caused by problems of the test, not of the server code. But because of the extreme difficulty of debugging some of these problems in production environments, the end result is still a lot of time saved.
I'd take this on the chin.
I hope it's clear I'm not playing any blame game here. I wrote most of the MariaDB replication code; if not this one, then many of the other mistakes in replication are mine. I speak openly about what I think can be done better, hoping to constantly improve things. - Kristian.
Kristian, salve. To respond now to just one of the points raised by you,
Andrei via developers <developers@lists.mariadb.org> writes:
...
It's a complicated matter. MDEV-28776 was not really neglected. There's
What I meant is - a number of GA releases were made over several years despite this bug being visible as a failure in the buildbot/CI.
I think there's a perception that sporadic failures in buildbot are somehow "too difficult to debug". What is important to understand is how much _MUCH_ more difficult a bug like this is to debug in the wild in user's production environment. Or even for the user to report it.
I can't agree more having deals with the difference (of finding bug in BB and on site) for years. That must apply to the most of the engineering developers. From my personal experience some of "sporadic" BB failures seemed to be within a grasp, but I failed to tackle them timely for a number of technical reasons, and not only technical ^ (you acked on the priority list point).
In my experience, these bugs can always be tracked down with reasonable efforts if using a systematic approach. First get the failure to reproduce in some environent, typically with ./mtr --repeat=100000 or something like that, as the last resort by running the exact same build that fails on the actual buildbot machine. Then add debug printouts step by step until the cause is identified. The process can take some time if the failure is rare, but it can be done as a background task.
In many cases --repeat has been a reliable tool, though in a number of cases all the BB env properties must just match and I had to push "printout" commits to that specific BB. This is a tedious process, and to get satisfactory info could take weeks.
And again, asking for advice can help.
Also to recognize the seriousness of that bug may take not just unordinary skills (which we may rely on much more than before :-) :pray:).
This failure did stand out as likely quite serious, since it results in the error "Slave worker thread retried transaction 10 time(s) in vain, giving up" from just a normal query pattern in the test. Normally it
should never be necessary with more than one retry from conflicts in parallel replication itself.
I must say this conclusion occurred to me not at once. Sadly too I did not raise it with you few years back (being "convinced" it's a feature).
In general, it is a fact that a lot of sporadic test failures turn out to be "false positives"; caused by problems of the test, not of the server code. But because of the extreme difficulty of debugging some of these problems in production environments, the end result is still a lot of time saved.
caused by problems of the test
That's a part of the actual "too difficult to debug" 'perception'. And no smile intended.
I'd take this on the chin.
I hope it's clear I'm not playing any blame game here.
And why should you when we're in the same boat..
I wrote most of the MariaDB replication code; if not this one, then many of the other mistakes in replication are mine. I speak openly about what I think can be done better, hoping to constantly improve things.
Thank you for this piece of sobering feedback! That should trigger thinking to what is more reasonable balance for the test failures and their faster resolving. Cheers, Andrei
participants (3)
-
andrei.elkin@pp.inet.fi
-
Brandon Nesterenko
-
Kristian Nielsen