[PATCH 0/6] Testcase fixes
A handful of fixes for test failures in buildot. Kristian Nielsen (6): MDEV-34696: do_gco_wait() completes too early on InnoDB dict stats updates Restore skiping rpl.rpl_mdev6020 under Valgrind Fix sporadic test failure in rpl.rpl_create_drop_event Fix sporadic failure of test case rpl.rpl_old_master Skip mariabackup.slave_provision_nolock in --valgrind, it uses a lot of CPU Fix sporadic failure of test case rpl.rpl_start_stop_slave .../mariabackup/slave_provision_nolock.test | 2 ++ mysql-test/suite/rpl/r/rpl_old_master.result | 3 --- .../suite/rpl/t/rpl_create_drop_event.test | 6 ++++++ mysql-test/suite/rpl/t/rpl_mdev6020.test | 2 ++ mysql-test/suite/rpl/t/rpl_old_master.test | 7 ------- .../suite/rpl/t/rpl_start_stop_slave.test | 12 ++++++++++- sql/rpl_parallel.cc | 20 +++++++++++++++---- sql/rpl_rli.cc | 17 ++++++++++++++++ 8 files changed, 54 insertions(+), 15 deletions(-) -- 2.39.2
Before doing mark_start_commit(), check that there is no pending deadlock
kill. If there is a pending kill, we won't commit (we will abort, roll back,
and retry). Then we should not mark the commit as started, since that could
potentially make the following GCO start too early, before we completed the
commit after the retry.
This condition could trigger in some corner cases, where InnoDB would take
temporarily table/row locks that are released again immediately, not held
until the transaction commits. This happens with dict_stats updates and
possibly auto-increment locks.
Such locks can be passed to thd_rpl_deadlock_check() and cause a deadlock
kill to be scheduled in the background. But since the blocking locks are
held only temporarily, they can be released before the background kill
happens. This way, the kill can be delayed until after mark_start_commit()
has been called. Thus we need to check the synchronous indication
rgi->killed_for_retry, not just the asynchroneous thd->killed.
Signed-off-by: Kristian Nielsen
Howdy Kristian.
Before doing mark_start_commit(), check that there is no pending deadlock kill. If there is a pending kill, we won't commit (we will abort, roll back, and retry). Then we should not mark the commit as started, since that could potentially make the following GCO start too early, before we completed the commit after the retry.
This condition could trigger in some corner cases, where InnoDB would take temporarily table/row locks that are released again immediately, not held until the transaction commits. This happens with dict_stats updates and possibly auto-increment locks.
Such locks can be passed to thd_rpl_deadlock_check() and cause a deadlock kill to be scheduled in the background. But since the blocking locks are held only temporarily, they can be released before the background kill happens. This way, the kill can be delayed until after mark_start_commit() has been called. Thus we need to check the synchronous indication rgi->killed_for_retry, not just the asynchroneous thd->killed.
I think I understood what's going on, also thanks to a verbose Jira ticket description.
Signed-off-by: Kristian Nielsen
--- sql/rpl_parallel.cc | 20 ++++++++++++++++---- sql/rpl_rli.cc | 17 +++++++++++++++++ 2 files changed, 33 insertions(+), 4 deletions(-) diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 1cfdf96ee3b..9c4222d7817 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc @@ -1450,11 +1450,23 @@ handle_rpl_parallel_thread(void *arg) after mark_start_commit(), we have to unmark, which has at least a theoretical possibility of leaving a window where it looks like all transactions in a GCO have started committing, while in fact one - will need to rollback and retry. This is not supposed to be possible - (since there is a deadlock, at least one transaction should be - blocked from reaching commit), but this seems a fragile ensurance, - and there were historically a number of subtle bugs in this area. + will need to rollback and retry. + + Normally this will not happen, since the kill is there to resolve a + deadlock that is preventing at least one transaction from proceeding. + One case it can happen is with InnoDB dict stats update, which can + temporarily cause transactions to block each other, but locks are + released immediately, they don't linger until commit. There could be + other similar cases, there were historically a number of subtle bugs + in this area. + + But once we start the commit, we can expect that no new lock + conflicts will be introduced. So by handling any lingering deadlock + kill at this point just before mark_start_commit(), we should be + robust even towards spurious deadlock kills. */ + if (rgi->killed_for_retry != rpl_group_info::RETRY_KILL_NONE) + wait_for_pending_deadlock_kill(thd, rgi);
Assuming my understanding, please correct me if anything, I left a question on this block in https://github.com/MariaDB/server/commit/df0c36a354ffde2766ebed2615642c74b79... quoted further. ... this guard would not let a victim of an optimistic conflict proceed until it clears out itself it is such a victim. For instance in the binlog sequence like T_1, T_2, D_3 where T_2 is the victim, D_3 is a DDL, the victim T_2 won't anymore release the next gco member D_3 into its execution. But what if at this point T_2 is only going to be marked in rgi->killed_for_retry by T_1? That is it is apparently able, with this patch as well, having ...NONE killed status go through. T_1 would reach this very point later and when that is done before T_2 finds itself killed, T_1 would release D_3, and afterward T_2 would finally see itself KILLed into retry. Cheers, Andrei
(Revert a change done by mistake when XtraDB was removed.)
Signed-off-by: Kristian Nielsen
Depending on timing, an extra event run could start just when the event
scheduler is shut down and delay running until after the table has been
dropped; this would cause the test to fail with a "table does not exist"
error in the log.
Signed-off-by: Kristian Nielsen
Remove the test for MDEV-14528. This is supposed to test that parallel
replication from pre-10.0 master will update Seconds_Behind_Master. But
after MDEV-12179 the SQL thread is blocked from even beginning to fetch
events from the relay log due to FLUSH TABLES WITH READ LOCK, so the test
case is no longer testing what is was intended to. And pre-10.0 versions are
long since out of support, so does not seem worthwhile to try to rewrite the
test to work another way.
The root cause of the test failure is MDEV-34778. Briefly, depending on
exact timing during slave stop, the rli->sql_thread_caught_up flag may end
up with different value. If it ends up as "true", this causes
Seconds_Behind_Master to be 0 during next slave start; and this caused test
case timeout as the test was waiting for Seconds_Behind_Master to become
non-zero.
Signed-off-by: Kristian Nielsen
Signed-off-by: Kristian Nielsen
The test was expecting the I/O thread to be in a specific state, but thread
scheduling may cause it to not yet have reached that state. So just have a
loop that waits for the expected state to occur.
Signed-off-by: Kristian Nielsen
participants (2)
-
andrei.elkin@pp.inet.fi
-
Kristian Nielsen