[PATCH 0/6] Testcase fixes
A handful of fixes for test failures in buildot. Kristian Nielsen (6): MDEV-34696: do_gco_wait() completes too early on InnoDB dict stats updates Restore skiping rpl.rpl_mdev6020 under Valgrind Fix sporadic test failure in rpl.rpl_create_drop_event Fix sporadic failure of test case rpl.rpl_old_master Skip mariabackup.slave_provision_nolock in --valgrind, it uses a lot of CPU Fix sporadic failure of test case rpl.rpl_start_stop_slave .../mariabackup/slave_provision_nolock.test | 2 ++ mysql-test/suite/rpl/r/rpl_old_master.result | 3 --- .../suite/rpl/t/rpl_create_drop_event.test | 6 ++++++ mysql-test/suite/rpl/t/rpl_mdev6020.test | 2 ++ mysql-test/suite/rpl/t/rpl_old_master.test | 7 ------- .../suite/rpl/t/rpl_start_stop_slave.test | 12 ++++++++++- sql/rpl_parallel.cc | 20 +++++++++++++++---- sql/rpl_rli.cc | 17 ++++++++++++++++ 8 files changed, 54 insertions(+), 15 deletions(-) -- 2.39.2
Before doing mark_start_commit(), check that there is no pending deadlock kill. If there is a pending kill, we won't commit (we will abort, roll back, and retry). Then we should not mark the commit as started, since that could potentially make the following GCO start too early, before we completed the commit after the retry. This condition could trigger in some corner cases, where InnoDB would take temporarily table/row locks that are released again immediately, not held until the transaction commits. This happens with dict_stats updates and possibly auto-increment locks. Such locks can be passed to thd_rpl_deadlock_check() and cause a deadlock kill to be scheduled in the background. But since the blocking locks are held only temporarily, they can be released before the background kill happens. This way, the kill can be delayed until after mark_start_commit() has been called. Thus we need to check the synchronous indication rgi->killed_for_retry, not just the asynchroneous thd->killed. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org> --- sql/rpl_parallel.cc | 20 ++++++++++++++++---- sql/rpl_rli.cc | 17 +++++++++++++++++ 2 files changed, 33 insertions(+), 4 deletions(-) diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 1cfdf96ee3b..9c4222d7817 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc @@ -1450,11 +1450,23 @@ handle_rpl_parallel_thread(void *arg) after mark_start_commit(), we have to unmark, which has at least a theoretical possibility of leaving a window where it looks like all transactions in a GCO have started committing, while in fact one - will need to rollback and retry. This is not supposed to be possible - (since there is a deadlock, at least one transaction should be - blocked from reaching commit), but this seems a fragile ensurance, - and there were historically a number of subtle bugs in this area. + will need to rollback and retry. + + Normally this will not happen, since the kill is there to resolve a + deadlock that is preventing at least one transaction from proceeding. + One case it can happen is with InnoDB dict stats update, which can + temporarily cause transactions to block each other, but locks are + released immediately, they don't linger until commit. There could be + other similar cases, there were historically a number of subtle bugs + in this area. + + But once we start the commit, we can expect that no new lock + conflicts will be introduced. So by handling any lingering deadlock + kill at this point just before mark_start_commit(), we should be + robust even towards spurious deadlock kills. */ + if (rgi->killed_for_retry != rpl_group_info::RETRY_KILL_NONE) + wait_for_pending_deadlock_kill(thd, rgi); if (!thd->killed) { DEBUG_SYNC(thd, "rpl_parallel_before_mark_start_commit"); diff --git a/sql/rpl_rli.cc b/sql/rpl_rli.cc index 9d4e09a362c..8284b07f7ff 100644 --- a/sql/rpl_rli.cc +++ b/sql/rpl_rli.cc @@ -2518,6 +2518,23 @@ rpl_group_info::unmark_start_commit() e= this->parallel_entry; mysql_mutex_lock(&e->LOCK_parallel_entry); + /* + Assert that we have not already wrongly completed this GCO and signalled + the next one to start, only to now unmark and make the signal invalid. + This is to catch problems like MDEV-34696. + + The error inject rpl_parallel_simulate_temp_err_xid is used to test this + precise situation, that we handle it gracefully if it somehow occurs in a + release build. So disable the assert in this case. + */ +#ifndef DBUG_OFF + bool allow_unmark_after_complete= false; + DBUG_EXECUTE_IF("rpl_parallel_simulate_temp_err_xid", + allow_unmark_after_complete= true;); + DBUG_ASSERT(!gco->next_gco || + gco->next_gco->wait_count > e->count_committing_event_groups || + allow_unmark_after_complete); +#endif --e->count_committing_event_groups; mysql_mutex_unlock(&e->LOCK_parallel_entry); } -- 2.39.2
Howdy Kristian.
Before doing mark_start_commit(), check that there is no pending deadlock kill. If there is a pending kill, we won't commit (we will abort, roll back, and retry). Then we should not mark the commit as started, since that could potentially make the following GCO start too early, before we completed the commit after the retry.
This condition could trigger in some corner cases, where InnoDB would take temporarily table/row locks that are released again immediately, not held until the transaction commits. This happens with dict_stats updates and possibly auto-increment locks.
Such locks can be passed to thd_rpl_deadlock_check() and cause a deadlock kill to be scheduled in the background. But since the blocking locks are held only temporarily, they can be released before the background kill happens. This way, the kill can be delayed until after mark_start_commit() has been called. Thus we need to check the synchronous indication rgi->killed_for_retry, not just the asynchroneous thd->killed.
I think I understood what's going on, also thanks to a verbose Jira ticket description.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org> --- sql/rpl_parallel.cc | 20 ++++++++++++++++---- sql/rpl_rli.cc | 17 +++++++++++++++++ 2 files changed, 33 insertions(+), 4 deletions(-)
diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 1cfdf96ee3b..9c4222d7817 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc @@ -1450,11 +1450,23 @@ handle_rpl_parallel_thread(void *arg) after mark_start_commit(), we have to unmark, which has at least a theoretical possibility of leaving a window where it looks like all transactions in a GCO have started committing, while in fact one - will need to rollback and retry. This is not supposed to be possible - (since there is a deadlock, at least one transaction should be - blocked from reaching commit), but this seems a fragile ensurance, - and there were historically a number of subtle bugs in this area. + will need to rollback and retry. + + Normally this will not happen, since the kill is there to resolve a + deadlock that is preventing at least one transaction from proceeding. + One case it can happen is with InnoDB dict stats update, which can + temporarily cause transactions to block each other, but locks are + released immediately, they don't linger until commit. There could be + other similar cases, there were historically a number of subtle bugs + in this area. + + But once we start the commit, we can expect that no new lock + conflicts will be introduced. So by handling any lingering deadlock + kill at this point just before mark_start_commit(), we should be + robust even towards spurious deadlock kills. */ + if (rgi->killed_for_retry != rpl_group_info::RETRY_KILL_NONE) + wait_for_pending_deadlock_kill(thd, rgi);
Assuming my understanding, please correct me if anything, I left a question on this block in https://github.com/MariaDB/server/commit/df0c36a354ffde2766ebed2615642c74b79... quoted further. ... this guard would not let a victim of an optimistic conflict proceed until it clears out itself it is such a victim. For instance in the binlog sequence like T_1, T_2, D_3 where T_2 is the victim, D_3 is a DDL, the victim T_2 won't anymore release the next gco member D_3 into its execution. But what if at this point T_2 is only going to be marked in rgi->killed_for_retry by T_1? That is it is apparently able, with this patch as well, having ...NONE killed status go through. T_1 would reach this very point later and when that is done before T_2 finds itself killed, T_1 would release D_3, and afterward T_2 would finally see itself KILLed into retry. Cheers, Andrei
(Revert a change done by mistake when XtraDB was removed.) Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org> --- mysql-test/suite/rpl/t/rpl_mdev6020.test | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mysql-test/suite/rpl/t/rpl_mdev6020.test b/mysql-test/suite/rpl/t/rpl_mdev6020.test index ec3fd92f817..314059a5a49 100644 --- a/mysql-test/suite/rpl/t/rpl_mdev6020.test +++ b/mysql-test/suite/rpl/t/rpl_mdev6020.test @@ -1,3 +1,5 @@ +# Test applies a large binlog, takes long under Valgrind with little benefit. +--source include/not_valgrind.inc --source include/have_innodb.inc --source include/have_partition.inc --source include/have_binlog_format_mixed_or_row.inc -- 2.39.2
Depending on timing, an extra event run could start just when the event scheduler is shut down and delay running until after the table has been dropped; this would cause the test to fail with a "table does not exist" error in the log. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org> --- mysql-test/suite/rpl/t/rpl_create_drop_event.test | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/mysql-test/suite/rpl/t/rpl_create_drop_event.test b/mysql-test/suite/rpl/t/rpl_create_drop_event.test index 96a7e82d6f7..79bb0ffec90 100644 --- a/mysql-test/suite/rpl/t/rpl_create_drop_event.test +++ b/mysql-test/suite/rpl/t/rpl_create_drop_event.test @@ -14,6 +14,12 @@ SET GLOBAL event_scheduler=on; let $wait_condition= SELECT count(*)>0 FROM t1; --source include/wait_condition.inc SET GLOBAL event_scheduler=off; +# If the time rolls to the next whole second just at this point, a new event +# run may be scheduled. Wait for this to disappear, otherwise we see occasional +# test failures if the table gets dropped before the extra event run completes. +# Expect 5 connections: default, master, master1, server_1, binlog dump thread +--let $wait_condition= SELECT COUNT(*) = 5 FROM INFORMATION_SCHEMA.PROCESSLIST; +--source include/wait_condition.inc SELECT DISTINCT a FROM t1; DELETE FROM t1; -- 2.39.2
Remove the test for MDEV-14528. This is supposed to test that parallel replication from pre-10.0 master will update Seconds_Behind_Master. But after MDEV-12179 the SQL thread is blocked from even beginning to fetch events from the relay log due to FLUSH TABLES WITH READ LOCK, so the test case is no longer testing what is was intended to. And pre-10.0 versions are long since out of support, so does not seem worthwhile to try to rewrite the test to work another way. The root cause of the test failure is MDEV-34778. Briefly, depending on exact timing during slave stop, the rli->sql_thread_caught_up flag may end up with different value. If it ends up as "true", this causes Seconds_Behind_Master to be 0 during next slave start; and this caused test case timeout as the test was waiting for Seconds_Behind_Master to become non-zero. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org> --- mysql-test/suite/rpl/r/rpl_old_master.result | 3 --- mysql-test/suite/rpl/t/rpl_old_master.test | 7 ------- 2 files changed, 10 deletions(-) diff --git a/mysql-test/suite/rpl/r/rpl_old_master.result b/mysql-test/suite/rpl/r/rpl_old_master.result index f985bee6832..dd3de4d327b 100644 --- a/mysql-test/suite/rpl/r/rpl_old_master.result +++ b/mysql-test/suite/rpl/r/rpl_old_master.result @@ -9,10 +9,7 @@ connection slave; SET @old_parallel= @@GLOBAL.slave_parallel_threads; SET GLOBAL slave_parallel_threads=10; CHANGE MASTER TO master_host='127.0.0.1', master_port=SERVER_MYPORT_1, master_user='root', master_log_file='master-bin.000001', master_log_pos=4; -FLUSH TABLES WITH READ LOCK; include/start_slave.inc -include/wait_for_slave_param.inc [Seconds_Behind_Master] -UNLOCK TABLES; connection master; CREATE TABLE t2 (a INT PRIMARY KEY) ENGINE=InnoDB; INSERT INTO t2 VALUES (1); diff --git a/mysql-test/suite/rpl/t/rpl_old_master.test b/mysql-test/suite/rpl/t/rpl_old_master.test index 6ddc227fc14..97721d5387a 100644 --- a/mysql-test/suite/rpl/t/rpl_old_master.test +++ b/mysql-test/suite/rpl/t/rpl_old_master.test @@ -28,14 +28,7 @@ SET GLOBAL slave_parallel_threads=10; --replace_result $SERVER_MYPORT_1 SERVER_MYPORT_1 eval CHANGE MASTER TO master_host='127.0.0.1', master_port=$SERVER_MYPORT_1, master_user='root', master_log_file='master-bin.000001', master_log_pos=4; -# Block execution yet when the blocked query timestamp has been already accounted -FLUSH TABLES WITH READ LOCK; --source include/start_slave.inc ---let $slave_param = Seconds_Behind_Master ---let $slave_param_value = 1 ---let $slave_param_comparison= >= ---source include/wait_for_slave_param.inc -UNLOCK TABLES; --connection master CREATE TABLE t2 (a INT PRIMARY KEY) ENGINE=InnoDB; -- 2.39.2
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org> --- mysql-test/suite/mariabackup/slave_provision_nolock.test | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mysql-test/suite/mariabackup/slave_provision_nolock.test b/mysql-test/suite/mariabackup/slave_provision_nolock.test index 618f313290c..0253a6c0c2d 100644 --- a/mysql-test/suite/mariabackup/slave_provision_nolock.test +++ b/mysql-test/suite/mariabackup/slave_provision_nolock.test @@ -1,5 +1,7 @@ --source include/have_innodb.inc --source include/have_log_bin.inc +# Test does a lot of queries that take a lot of CPU under Valgrind. +--source include/not_valgrind.inc call mtr.add_suppression("Can't init tc log"); call mtr.add_suppression("Aborting"); -- 2.39.2
The test was expecting the I/O thread to be in a specific state, but thread scheduling may cause it to not yet have reached that state. So just have a loop that waits for the expected state to occur. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org> --- mysql-test/suite/rpl/t/rpl_start_stop_slave.test | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/mysql-test/suite/rpl/t/rpl_start_stop_slave.test b/mysql-test/suite/rpl/t/rpl_start_stop_slave.test index 23b25b1bf85..ce7d51ca43d 100644 --- a/mysql-test/suite/rpl/t/rpl_start_stop_slave.test +++ b/mysql-test/suite/rpl/t/rpl_start_stop_slave.test @@ -19,7 +19,17 @@ --source include/master-slave.inc connection slave; ---let $connection_id=`SELECT id FROM information_schema.processlist where state LIKE 'Waiting for master to send event'` +--let $i= 100 +while ($i > 0) { + dec $i; + --let $connection_id=`SELECT id FROM information_schema.processlist where state LIKE 'Waiting for master to send event'` + if ($connection_id) { + let $i= 0; + } + if ($i > 0) { + --sleep 0.1 + } +} if(!$connection_id) { -- 2.39.2
participants (2)
-
andrei.elkin@pp.inet.fi
-
Kristian Nielsen