Re: [Commits] cb06612a9da: MDEV-31655: Parallel replication deadlock victim preference code errorneously removed
Hi Marko, I'd like to ask you to review the below patch for MDEV-31655. https://jira.mariadb.org/browse/MDEV-31655 https://github.com/MariaDB/server/commit/cb06612a9da09a7981ada84768793f2ff3e... This patch restores code in InnoDB that was originally part of the parallel replication implementation. It makes InnoDB choose the right victim when two transactions in parallel replication deadlock with each other. The code was erroneously removed as part of the VATS work in InnoDB. I saw that you later deprecated and eventually removed the VATS code. I saw that the InnoDB deadlock code is substantially changed in 10.6, so I will need to rework the patch in that version. But I wanted to ask your opinion about this 10.4 version first. In the patch, note in particular the modification to has_higher_priority(), which I think has the effect of choosing which of two waiting transactions to grant a lock to. I don't remember adding this myself, so it might have been added as part of VATS work. But it sounds appropriate; if T1 and T2 are both waiting for a lock, there is no point in granting it to T2 over T1, as then T2 will just be immediately killed and rolled back by replication. - Kristian. Kristian Nielsen via commits <commits@lists.mariadb.org> writes:
revision-id: cb06612a9da09a7981ada84768793f2ff3ef842c (mariadb-10.4.30-3-gcb06612a9da) parent(s): dbe5c20b755b87f67d87990c3baabc866667e41b author: Kristian Nielsen committer: Kristian Nielsen timestamp: 2023-07-11 00:31:29 +0200 message:
MDEV-31655: Parallel replication deadlock victim preference code errorneously removed
Restore code to make InnoDB choose the second transaction as a deadlock victim if two transactions deadlock that need to commit in-order for parallel replication. This code was erroneously removed when VATS was implemented in InnoDB.
Also add a test case for InnoDB choosing the right deadlock victim.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
--- .../suite/binlog_encryption/rpl_parallel.result | 42 ++++++++++++- mysql-test/suite/rpl/r/rpl_parallel.result | 42 ++++++++++++- mysql-test/suite/rpl/t/rpl_parallel.test | 71 +++++++++++++++++++++- sql/rpl_parallel.cc | 7 ++- sql/sql_class.cc | 43 +++++++++++++ storage/innobase/lock/lock0lock.cc | 12 ++++ storage/innobase/trx/trx0trx.cc | 12 ++++ 7 files changed, 225 insertions(+), 4 deletions(-)
diff --git a/mysql-test/suite/binlog_encryption/rpl_parallel.result b/mysql-test/suite/binlog_encryption/rpl_parallel.result index b75a66a634a..b24ff7ba53d 100644 --- a/mysql-test/suite/binlog_encryption/rpl_parallel.result +++ b/mysql-test/suite/binlog_encryption/rpl_parallel.result @@ -2,6 +2,7 @@ include/master-slave.inc [connection master] connection server_2; SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads; +SET @old_parallel_mode=@@GLOBAL.slave_parallel_mode; SET GLOBAL slave_parallel_threads=10; ERROR HY000: This operation cannot be performed as you have a running slave ''; run STOP SLAVE '' first include/stop_slave.inc @@ -1680,13 +1681,52 @@ a 2000 SELECT * FROM t2 WHERE a>=2000 ORDER BY a; a +MDEV-31655: Parallel replication deadlock victim preference code erroneously removed +connection server_1; +CREATE TABLE t7 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB; +BEGIN; +COMMIT; +include/save_master_gtid.inc +connection server_2; +include/sync_with_master_gtid.inc +include/stop_slave.inc +set @@global.slave_parallel_threads= 5; +set @@global.slave_parallel_mode= conservative; +SET @old_dbug= @@GLOBAL.debug_dbug; +SET GLOBAL debug_dbug= "+d,rpl_mdev31655_zero_retries"; +connection master; +SET @old_dbug= @@SESSION.debug_dbug; +SET SESSION debug_dbug="+d,binlog_force_commit_id"; +SET @commit_id= 1+1000; +SET @commit_id= 2+1000; +SET @commit_id= 3+1000; +SET @commit_id= 4+1000; +SET @commit_id= 5+1000; +SET @commit_id= 6+1000; +SET @commit_id= 7+1000; +SET @commit_id= 8+1000; +SET @commit_id= 9+1000; +SET @commit_id= 10+1000; +SET SESSION debug_dbug= @old_dbug; +SELECT COUNT(*), SUM(a*100*b) FROM t7; +COUNT(*) SUM(a*100*b) +10 225000 +include/save_master_gtid.inc +connection server_2; +include/start_slave.inc +include/sync_with_master_gtid.inc +SET GLOBAL debug_dbug= @old_dbug; +SELECT COUNT(*), SUM(a*100*b) FROM t7; +COUNT(*) SUM(a*100*b) +10 225000 connection server_2; include/stop_slave.inc SET GLOBAL slave_parallel_threads=@old_parallel_threads; +SET GLOBAL slave_parallel_mode=@old_parallel_mode; include/start_slave.inc SET DEBUG_SYNC= 'RESET'; connection server_1; DROP function foo; -DROP TABLE t1,t2,t3,t4,t5,t6; +DROP TABLE t1,t2,t3,t4,t5,t6,t7; SET DEBUG_SYNC= 'RESET'; include/rpl_end.inc diff --git a/mysql-test/suite/rpl/r/rpl_parallel.result b/mysql-test/suite/rpl/r/rpl_parallel.result index 9b2e68d366e..ef89d954faa 100644 --- a/mysql-test/suite/rpl/r/rpl_parallel.result +++ b/mysql-test/suite/rpl/r/rpl_parallel.result @@ -2,6 +2,7 @@ include/master-slave.inc [connection master] connection server_2; SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads; +SET @old_parallel_mode=@@GLOBAL.slave_parallel_mode; SET GLOBAL slave_parallel_threads=10; ERROR HY000: This operation cannot be performed as you have a running slave ''; run STOP SLAVE '' first include/stop_slave.inc @@ -1679,13 +1680,52 @@ a 2000 SELECT * FROM t2 WHERE a>=2000 ORDER BY a; a +MDEV-31655: Parallel replication deadlock victim preference code erroneously removed +connection server_1; +CREATE TABLE t7 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB; +BEGIN; +COMMIT; +include/save_master_gtid.inc +connection server_2; +include/sync_with_master_gtid.inc +include/stop_slave.inc +set @@global.slave_parallel_threads= 5; +set @@global.slave_parallel_mode= conservative; +SET @old_dbug= @@GLOBAL.debug_dbug; +SET GLOBAL debug_dbug= "+d,rpl_mdev31655_zero_retries"; +connection master; +SET @old_dbug= @@SESSION.debug_dbug; +SET SESSION debug_dbug="+d,binlog_force_commit_id"; +SET @commit_id= 1+1000; +SET @commit_id= 2+1000; +SET @commit_id= 3+1000; +SET @commit_id= 4+1000; +SET @commit_id= 5+1000; +SET @commit_id= 6+1000; +SET @commit_id= 7+1000; +SET @commit_id= 8+1000; +SET @commit_id= 9+1000; +SET @commit_id= 10+1000; +SET SESSION debug_dbug= @old_dbug; +SELECT COUNT(*), SUM(a*100*b) FROM t7; +COUNT(*) SUM(a*100*b) +10 225000 +include/save_master_gtid.inc +connection server_2; +include/start_slave.inc +include/sync_with_master_gtid.inc +SET GLOBAL debug_dbug= @old_dbug; +SELECT COUNT(*), SUM(a*100*b) FROM t7; +COUNT(*) SUM(a*100*b) +10 225000 connection server_2; include/stop_slave.inc SET GLOBAL slave_parallel_threads=@old_parallel_threads; +SET GLOBAL slave_parallel_mode=@old_parallel_mode; include/start_slave.inc SET DEBUG_SYNC= 'RESET'; connection server_1; DROP function foo; -DROP TABLE t1,t2,t3,t4,t5,t6; +DROP TABLE t1,t2,t3,t4,t5,t6,t7; SET DEBUG_SYNC= 'RESET'; include/rpl_end.inc diff --git a/mysql-test/suite/rpl/t/rpl_parallel.test b/mysql-test/suite/rpl/t/rpl_parallel.test index 9ba7a30f2eb..d43cec4df34 100644 --- a/mysql-test/suite/rpl/t/rpl_parallel.test +++ b/mysql-test/suite/rpl/t/rpl_parallel.test @@ -13,6 +13,7 @@
--connection server_2 SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads; +SET @old_parallel_mode=@@GLOBAL.slave_parallel_mode; --error ER_SLAVE_MUST_STOP SET GLOBAL slave_parallel_threads=10; --source include/stop_slave.inc @@ -2203,16 +2204,84 @@ SELECT * FROM t1 WHERE a>=2000 ORDER BY a; SELECT * FROM t2 WHERE a>=2000 ORDER BY a;
+--echo MDEV-31655: Parallel replication deadlock victim preference code erroneously removed +# The problem was that InnoDB would choose the wrong deadlock victim. +# Create a lot of transactions that can cause deadlocks, and use error +# injection to check that the first transactions in each group is never +# selected as deadlock victim. +--let $rows= 10 +--let $transactions= 5 +--let $gcos= 10 + +--connection server_1 +CREATE TABLE t7 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB; +BEGIN; +--disable_query_log +--let $i= 0 +while ($i < 10) { + eval INSERT INTO t7 VALUES ($i, 0); + inc $i; +} +--enable_query_log +COMMIT; +--source include/save_master_gtid.inc + +--connection server_2 +--source include/sync_with_master_gtid.inc +--source include/stop_slave.inc +eval set @@global.slave_parallel_threads= $transactions; +set @@global.slave_parallel_mode= conservative; +SET @old_dbug= @@GLOBAL.debug_dbug; +# This error injection will allow no retries for GTIDs divisible by 1000. +SET GLOBAL debug_dbug= "+d,rpl_mdev31655_zero_retries"; + +--connection master +SET @old_dbug= @@SESSION.debug_dbug; +SET SESSION debug_dbug="+d,binlog_force_commit_id"; + +--let $j= 1 +while ($j <= $gcos) { + eval SET @commit_id= $j+1000; + --let $i= 0 + while ($i < $transactions) { + --disable_query_log + eval SET SESSION gtid_seq_no= 1000 + 1000*$j + $i; + BEGIN; + --let $k= 0 + while ($k < $rows) { + eval UPDATE t7 SET b=b+1 WHERE a=(($i+$k) MOD $rows); + inc $k; + } + COMMIT; + --enable_query_log + inc $i; + } + inc $j; +} + +SET SESSION debug_dbug= @old_dbug; +SELECT COUNT(*), SUM(a*100*b) FROM t7; + +--source include/save_master_gtid.inc + +--connection server_2 +--source include/start_slave.inc +--source include/sync_with_master_gtid.inc +SET GLOBAL debug_dbug= @old_dbug; +SELECT COUNT(*), SUM(a*100*b) FROM t7; + + # Clean up. --connection server_2 --source include/stop_slave.inc SET GLOBAL slave_parallel_threads=@old_parallel_threads; +SET GLOBAL slave_parallel_mode=@old_parallel_mode; --source include/start_slave.inc SET DEBUG_SYNC= 'RESET';
--connection server_1 DROP function foo; -DROP TABLE t1,t2,t3,t4,t5,t6; +DROP TABLE t1,t2,t3,t4,t5,t6,t7; SET DEBUG_SYNC= 'RESET';
--source include/rpl_end.inc diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index b550315d69f..1aeb1257c4a 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc @@ -1385,8 +1385,13 @@ handle_rpl_parallel_thread(void *arg) err= dbug_simulate_tmp_error(rgi, thd);); if (unlikely(err)) { + ulong max_retries= slave_trans_retries; convert_kill_to_deadlock_error(rgi); - if (has_temporary_error(thd) && slave_trans_retries > 0) + DBUG_EXECUTE_IF("rpl_mdev31655_zero_retries", + if ((rgi->current_gtid.seq_no % 1000) == 0) + max_retries= 0; + ); + if (has_temporary_error(thd) && max_retries > 0) err= retry_event_group(rgi, rpt, qev); } } diff --git a/sql/sql_class.cc b/sql/sql_class.cc index 8ed3f8a9c5e..e6ed7ca1cc4 100644 --- a/sql/sql_class.cc +++ b/sql/sql_class.cc @@ -5247,6 +5247,49 @@ thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd) return 0; }
+ +/* + If the storage engine detects a deadlock, and needs to choose a victim + transaction to roll back, it can call this function to ask the upper + server layer for which of two possible transactions is prefered to be + aborted and rolled back. + + In parallel replication, if two transactions are running in parallel and + one is fixed to commit before the other, then the one that commits later + will be prefered as the victim - chosing the early transaction as a victim + will not resolve the deadlock anyway, as the later transaction still needs + to wait for the earlier to commit. + + The return value is -1 if the first transaction is prefered as a deadlock + victim, 1 if the second transaction is prefered, or 0 for no preference (in + which case the storage engine can make the choice as it prefers). +*/ +extern "C" int +thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2) +{ + rpl_group_info *rgi1, *rgi2; + + if (!thd1 || !thd2) + return 0; + + /* + If the transactions are participating in the same replication domain in + parallel replication, then request to select the one that will commit + later (in the fixed commit order from the master) as the deadlock victim. + */ + rgi1= thd1->rgi_slave; + rgi2= thd2->rgi_slave; + if (rgi1 && rgi2 && + rgi1->is_parallel_exec && + rgi1->rli == rgi2->rli && + rgi1->current_gtid.domain_id == rgi2->current_gtid.domain_id) + return rgi1->gtid_sub_id < rgi2->gtid_sub_id ? 1 : -1; + + /* No preferences, let the storage engine decide. */ + return 0; +} + + extern "C" int thd_non_transactional_update(const MYSQL_THD thd) { return(thd->transaction.all.modified_non_trans_table); diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc index b1cf2152cd6..469d03eaa06 100644 --- a/storage/innobase/lock/lock0lock.cc +++ b/storage/innobase/lock/lock0lock.cc @@ -49,6 +49,8 @@ Created 5/7/1996 Heikki Tuuri #include <mysql/service_wsrep.h> #endif /* WITH_WSREP */
+extern "C" int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2); + /** Lock scheduling algorithm */ ulong innodb_lock_schedule_algorithm;
@@ -1538,6 +1540,16 @@ static bool has_higher_priority(lock_t *lock1, lock_t *lock2) } else if (!lock_get_wait(lock2)) { return false; } + // Ask the upper server layer if any of the two trx should be prefered. + int preference = thd_deadlock_victim_preference(lock1->trx->mysql_thd, + lock2->trx->mysql_thd); + if (preference == -1) { + // lock1 is preferred as a victim, so lock2 has higher priority + return false; + } else if (preference == 1) { + // lock2 is preferred as a victim, so lock1 has higher priority + return true; + } return lock1->trx->start_time_micro <= lock2->trx->start_time_micro; }
diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 7cd95878b0c..0771d764fb6 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -52,6 +52,9 @@ Created 3/26/1996 Heikki Tuuri #include <set> #include <new>
+extern "C" +int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2); + /** The bit pattern corresponding to TRX_ID_MAX */ const byte trx_id_max_bytes[8] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff @@ -1906,6 +1909,15 @@ trx_weight_ge( { ibool a_notrans_edit; ibool b_notrans_edit; + int pref; + + /* First ask the upper server layer if it has any preference for which + to prefer as a deadlock victim. */ + pref= thd_deadlock_victim_preference(a->mysql_thd, b->mysql_thd); + if (pref < 0) + return FALSE; + else if (pref > 0) + return TRUE;
/* If mysql_thd is NULL for a transaction we assume that it has not edited non-transactional tables. */ _______________________________________________ commits mailing list -- commits@lists.mariadb.org To unsubscribe send an email to commits-leave@lists.mariadb.org
Hi Kristian, On Tue, Jul 11, 2023 at 1:35 AM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Hi Marko,
I'd like to ask you to review the below patch for MDEV-31655.
https://jira.mariadb.org/browse/MDEV-31655 https://github.com/MariaDB/server/commit/cb06612a9da09a7981ada84768793f2ff3e...
I am sorry for the delay; last month I was mostly on vacation. I sent some minor comments that apply to the InnoDB code changes. I think that this needs to be reviewed by the replication team as well. Whoever reviews that should pay attention to the stability of the tests. We have a few replication tests that fail rather often.
This patch restores code in InnoDB that was originally part of the parallel replication implementation. It makes InnoDB choose the right victim when two transactions in parallel replication deadlock with each other. The code was erroneously removed as part of the VATS work in InnoDB. I saw that you later deprecated and eventually removed the VATS code.
As far as I can tell, innodb_lock_schedule_algorithm=VATS consistently reproduced debug assertion failures during stress testing. It was finally removed from 10.6. We might introduce something similar later, provided that it passes tests. In MySQL 8.0.20, https://dev.mysql.com/worklog/task/?id=13468 reimplemented this as CATS (Contention-Aware Transaction Scheduling).
I saw that the InnoDB deadlock code is substantially changed in 10.6, so I will need to rework the patch in that version. But I wanted to ask your opinion about this 10.4 version first.
I must trust your judgement here; I have no fundamental objection. For better testing of this, I think that we really need a 10.6 based branch that contains both this and MDEV-31482.
In the patch, note in particular the modification to has_higher_priority(), which I think has the effect of choosing which of two waiting transactions to grant a lock to. I don't remember adding this myself, so it might have been added as part of VATS work. But it sounds appropriate; if T1 and T2 are both waiting for a lock, there is no point in granting it to T2 over T1, as then T2 will just be immediately killed and rolled back by replication.
Because innodb_lock_schedule_algorithm=VATS is known to be broken even without any involvement of replication, I do not think that we have to care too much about it. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Marko Mäkelä via developers <developers@lists.mariadb.org> writes:
For better testing of this, I think that we really need a 10.6 based branch that contains both this and MDEV-31482.
I did an initial merge to 10.6. The victim selection code is in Deadlock::report() in lock0lock.cc. I noticed one thing in the current 10.6 code: if (next == victim) trx_pos= l; ... if (trx_pos && trx_weight == victim_weight) { victim= trx; victim_pos= trx_pos; IIUC, the intention with this code is that if the current transaction and another transaction are both candidates for being the deadlock victim, select the current transaction as the preferred candidate? If so, seems there's a small mistake in the code. The current condition will always be true for each new value of victim, so trx_pos will always be set != 0. The first condition should probably have been "if (trx == victim)" ? In theory I think this means we could wrongly select the current trx as the deadlock victim in the case of a "P-shaped" path, and thus the deadlock would not be broken. I thought about coming up with a test case to show this, but it looks like it's not easy to get the same TRX_WEIGHT for trx and victim. If so, maybe this part of the code can just be removed, or what do you think? Another important issue I noticed, though not directly related to MDEV-31655, is this: /* Even though lock_wait_rpl_report() has nothing to do with deadlock detection, it was always disabled by innodb_deadlock_detect=OFF. We will keep it in that way, because unfortunately thd_need_wait_reports() will hold even if parallel (or any) replication is not being used. We want to be allow the user to skip lock_wait_rpl_report(). */ const bool rpl= !(type_mode & LOCK_AUTO_INC) && trx->mysql_thd && innodb_deadlock_detect && thd_need_wait_reports(trx->mysql_thd); So this is a bug I introduced in my original implementation. The reporting of lock waits is essential for optimistic parallel replication to work at all. Without it, the following will happen: - Transactions T1 and T2 replicate in parallel, run out-of-order. - T2 updates some row, goes to commit and wait_for_prior_commit(T1). - T1 goes to update the same row, gets a lock wait on T2. - T1 eventually gets a lock wait timeout and is retried. - The retry of T1 will get the same problem, and replication eventually breaks when the --slave-transactions-retries value is reached. I think for parallel replication, we have to do the lock wait reporting always to avoid this. Since we are anyway going to do a lock wait, an expensive operation, I think it is reasonable to accept the small overhead of a lock wait report to the server layer. However, the reason thd_need_wait_reports() also returns true for non-replication threads (when the binlog is enabled) is to write a hint in the GTID event that a transaction had a wait on the master. This is non-critical and could be avoided if we want. Maybe something we should look into/discuss.
I sent some minor comments that apply to the InnoDB code changes.
Thanks! Fixed, except for this one:
+ extern "C" int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2);
Usually, server function prototypes for InnoDB are declared in storage/innobase/include/ha_prototypes.h. Please move this there
The other similar functions (thd_rpl_deadlock_check()) are declared in a section in lock0lock.cc, so I moved this declaration there as well for consistency (but let me know if I you still prefer ha_prototypes.h for this or maybe all of them).
I think that this needs to be reviewed by the replication team as well.
Yes, Andrei said he would review also.
Whoever reviews that should pay attention to the stability of the tests. We have a few replication tests that fail rather often.
Tell me about it ... replication tests, and in general any test where many parallel threads of operation are involved, are notoriously hard to get stable. I'll make sure to check for any failures in Buildbot. I 100% agree with the need to be careful with this. In fact this problem was found because I investigated a sporadic failure in the test case rpl_mark_optimize_tbl_ddl which was caused by the errorneous removal of this code.
In the patch, note in particular the modification to has_higher_priority(), which I think has the effect of choosing which of two waiting transactions to grant a lock to. I don't remember adding this myself, so it might have
Because innodb_lock_schedule_algorithm=VATS is known to be broken even without any involvement of replication, I do not think that we have to care too much about it.
Ok.
I must trust your judgement here; I have no fundamental objection.
Thanks, Marko. I pushed an initial merge to 10.6 of this, but I'll refine it once I fully understand the code around trx_pos: https://github.com/MariaDB/server/tree/knielsen_bugfix_10.6 https://github.com/MariaDB/server/commit/b475257dc2297a8a7271367ad4a8d659297... The code with review comments fixed for 10.4 is here: https://github.com/MariaDB/server/tree/knielsen_bugfix https://github.com/MariaDB/server/commit/24679f3daf57c8c62faf019f732ef263472... - Kristian.
Hi Kristian, On Sun, Aug 6, 2023 at 6:59 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I noticed one thing in the current 10.6 code:
if (next == victim) trx_pos= l; ... if (trx_pos && trx_weight == victim_weight) { victim= trx; victim_pos= trx_pos;
IIUC, the intention with this code is that if the current transaction and another transaction are both candidates for being the deadlock victim, select the current transaction as the preferred candidate?
If so, seems there's a small mistake in the code. The current condition will always be true for each new value of victim, so trx_pos will always be set != 0. The first condition should probably have been "if (trx == victim)" ?
As far as I remember, the original reason for the weight calculation was that the deadlock resolution would roll back the transaction that has done the least amount of work, expressed as the "weight" of the transaction. The weight is computed based on whether thd_has_edited_nontrans_tables() holds (the transaction has modified any non-transactional tables, such as MyISAM tables), on the number of undo log records written so far (trx_t::undo_no), and the number of InnoDB locks held by the transaction. I rewrote the deadlock detector in https://jira.mariadb.org/browse/MDEV-24738 (MariaDB Server 10.6). The old one seemed to assume that the all deadlocks are simple cycles (P-shaped with an empty "foot", that is, O-shaped), and the reporting assumed that deadlocks involve only 2 transactions. The variable "victim_pos" identifies the "victim" transaction in the deadlock report. If we have a P-shaped path from the current transaction to a deadlock of transactions, then Deadlock::find_cycle() should have returned the O-shaped portion of it, which the current transaction would not be part of. According to a comment in find_cycle(), this probably means that the cycle had been formed while innodb_deadlock_detect=OFF had been in effect in the recent past (before innodb_lock_wait_timeout has broken the cycle) and the current transaction is waiting for locks held by one participant of the cycle.
In theory I think this means we could wrongly select the current trx as the deadlock victim in the case of a "P-shaped" path, and thus the deadlock would not be broken. I thought about coming up with a test case to show this, but it looks like it's not easy to get the same TRX_WEIGHT for trx and victim. If so, maybe this part of the code can just be removed, or what do you think?
Can you check it once again?
Another important issue I noticed, though not directly related to MDEV-31655, is this:
/* Even though lock_wait_rpl_report() has nothing to do with deadlock detection, it was always disabled by innodb_deadlock_detect=OFF. We will keep it in that way, because unfortunately thd_need_wait_reports() will hold even if parallel (or any) replication is not being used. We want to be allow the user to skip lock_wait_rpl_report(). */ const bool rpl= !(type_mode & LOCK_AUTO_INC) && trx->mysql_thd && innodb_deadlock_detect && thd_need_wait_reports(trx->mysql_thd);
So this is a bug I introduced in my original implementation. The reporting of lock waits is essential for optimistic parallel replication to work at all. Without it, the following will happen:
- Transactions T1 and T2 replicate in parallel, run out-of-order. - T2 updates some row, goes to commit and wait_for_prior_commit(T1). - T1 goes to update the same row, gets a lock wait on T2. - T1 eventually gets a lock wait timeout and is retried. - The retry of T1 will get the same problem, and replication eventually breaks when the --slave-transactions-retries value is reached.
I think for parallel replication, we have to do the lock wait reporting always to avoid this. Since we are anyway going to do a lock wait, an expensive operation, I think it is reasonable to accept the small overhead of a lock wait report to the server layer.
However, the reason thd_need_wait_reports() also returns true for non-replication threads (when the binlog is enabled) is to write a hint in the GTID event that a transaction had a wait on the master. This is non-critical and could be avoided if we want. Maybe something we should look into/discuss.
Thank you for the explanation. Can you please also post it to https://jira.mariadb.org/browse/MDEV-24948 for future reference? It would also be useful if you could check the following changes in MySQL 8.0 that are linked from MDEV-24948: https://github.com/mysql/mysql-server/commit/30ead1f6966128cbcd32c7b6029ea21... https://github.com/mysql/mysql-server/commit/3859219875b62154b921e8c6078c751...
The other similar functions (thd_rpl_deadlock_check()) are declared in a section in lock0lock.cc, so I moved this declaration there as well for consistency (but let me know if I you still prefer ha_prototypes.h for this or maybe all of them).
That makes sense too. The main thing for me would be that the function declaration is not duplicated in multiple *.cc files.
I 100% agree with the need to be careful with this. In fact this problem was found because I investigated a sporadic failure in the test case rpl_mark_optimize_tbl_ddl which was caused by the errorneous removal of this code.
I really appreciate your recent efforts of investigating the root causes of sporadic test failures. In InnoDB, apart from locking, a prominent source of sporadic test failures has been statistics: https://jira.mariadb.org/browse/MDEV-10682 https://jira.mariadb.org/browse/MDEV-21136
In the patch, note in particular the modification to has_higher_priority(), which I think has the effect of choosing which of two waiting transactions to grant a lock to. I don't remember adding this myself, so it might have
Because innodb_lock_schedule_algorithm=VATS is known to be broken even without any involvement of replication, I do not think that we have to care too much about it.
Ok.
I must trust your judgement here; I have no fundamental objection.
Thanks, Marko.
I pushed an initial merge to 10.6 of this, but I'll refine it once I fully understand the code around trx_pos:
https://github.com/MariaDB/server/tree/knielsen_bugfix_10.6 https://github.com/MariaDB/server/commit/b475257dc2297a8a7271367ad4a8d659297...
I think that thd_deadlock_victim_preference() needs to be invoked on an array of THD* and return a pointer to the victim THD, or nullptr if there is no preference. In your current 10.6 patch, we may have victim==nullptr for a large portion of the loop, and therefore, we might choose something else than the actual preferred victim. Do you have an idea how to stress test this at a larger scale than is feasible within the regression test suite? Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Marko Mäkelä <marko.makela@mariadb.com> writes:
On Sun, Aug 6, 2023 at 6:59 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
If so, seems there's a small mistake in the code. The current condition will always be true for each new value of victim, so trx_pos will always be set != 0. The first condition should probably have been "if (trx == victim)" ?
As far as I remember, the original reason for the weight calculation was that the deadlock resolution would roll back the transaction that has done the least amount of work, expressed as the "weight" of the transaction. The weight is computed based on whether thd_has_edited_nontrans_tables() holds (the transaction has modified any non-transactional tables, such as MyISAM tables), on the number of undo log records written so far (trx_t::undo_no), and the number of InnoDB locks held by the transaction.
Yes. Consider this P-shaped path, with trx=T1 and cycle=T3: T1 -> T2 -> T3 -> T2 T1 undo_no=100 lock.trx_locks length= 4 TRX_WEIGHT(T1)=104 T2 undo_no=101 lock.trx_locks length=10 TRX_WEIGHT(T2)=115 T3 undo_no=102 lock.trx_locks length= 2 TRX_WEIGHT(T3)=104 In the first loop iteration: - next=T2 - 115 < ~0ULL, so we assign victim=T2 and victim_weight=115 - next==victim, so we assign trx_pos=1 (this is wrong, I think). In the second iteration: - next=T3 - 104 < 115, so we assign victim=T3 and victim_weight=104 - next==victim, so we assign trx_pos=2 (wrong) - next==cycle, so we end the loop. At the end of the loop, trx_pos != 0 and trx_weight==victim_weight (==104), so we assign victim=T1. The result is that we select T1 as the deadlock victim, even though this does not break the deadlock. I think it's just a small mistake in this code: if (next == victim) trx_pos= l; which should have been this: if (next == trx) trx_pos= l; My question is, if instead of fixing we should just remove it along with this code? if (trx_pos && trx_weight == victim_weight) { victim= trx; victim_pos= trx_pos; I'm thinking it will be rare to have the same value of (undo_no + LENGTH(lock.trx_locks)) for two transactions, which is the only way this code can trigger. But I'm not familiar with the original reason for this piece of the code, that's why I asked.
would not be broken. I thought about coming up with a test case to show this, but it looks like it's not easy to get the same TRX_WEIGHT for trx and victim. If so, maybe this part of the code can just be removed, or what do you think?
Can you check it once again?
My problem is, I'm not sufficiently familiar with InnoDB internals to know how to arrange the values of lock.trx_locks and undo_no in particular to make the condition (trx_weight == victim_weight) be true in a situation where we have a P-shaped path. It would be fun to try, I might try it, but it will take some time.
In your current 10.6 patch, we may have victim==nullptr for a large portion of the loop, and therefore, we might choose something else than the actual preferred victim.
No, victim_weight is initialized to ~0ULL, so the condition (next_weight < victim_weight) will be true and victim will be initialized =next (and =cycle) on the first iteration. Thus victim!=nullptr on every iteration except the first.
I think that thd_deadlock_victim_preference() needs to be invoked on an array of THD* and return a pointer to the victim THD, or nullptr if there is no preference.
I do not think that is necessary. The fundamental issue is in-order parallel replication, where we have a series of transactions T1, T2, T3, ... that we know a-priori _must_ commit in this order. In case of a deadlock between them, we must avoid T1 being chosen as a deadlock victim and immediately causing the same deadlock again, repeatedly. This is the bug seen here. But as long as we prevent T1 being chosen as the victim, replication will be able to proceed. Chosing T3 is better, but T2 also works. Thus, we do not need InnoDB to choose the _best_ deadlock victim. We just need to avoid it choosing the _worst_ victim, and the thd_deadlock_victim_preference() facility should ensure that. That said, the current thd_deadlock_victim_preference() mechanism is not ideal. The dependencies involved are split between the InnoDB and the replication layer, neither can solve it fully on their own. Similar problems must exist eg. for deadlock involving cross-engine transactions. Maybe a full solution requires server-wide deadlock detection. (For in-order parallel replication itself, the deadlock detection problem is much simpler, because we _know_ that any T_(i+1) will eventually have to wait for T_(i). So a deadlock will occur if-and-only-if there is some T_j that waits for T_k, with j < k. But this is not true if considering other transactions that the in-order sequence within one replication domain_id). The thd_deadlock_victim_preference() is really just a comparison of thd->rgi_slave->gtid_sub_id. Maybe it would be simpler and more efficient to expose that to InnoDB, and simply use the gtid_sub_id as the TRX_WEIGHT, if present (with a few bit operations to make the comparison work out, similar to the (1ULL << 63) in current code)? It exposes internal replication detail to InnoDB, but sometimes that is necessary to get the simplest and most efficient code...
I think for parallel replication, we have to do the lock wait reporting always to avoid this. Since we are anyway going to do a lock wait, an expensive operation, I think it is reasonable to accept the small overhead of a lock wait report to the server layer.
Thank you for the explanation. Can you please also post it to https://jira.mariadb.org/browse/MDEV-24948 for future reference?
Ok, will do.
It would also be useful if you could check the following changes in MySQL 8.0 that are linked from MDEV-24948: https://github.com/mysql/mysql-server/commit/30ead1f6966128cbcd32c7b6029ea21... https://github.com/mysql/mysql-server/commit/3859219875b62154b921e8c6078c751...
Ok, I will check it.
I pushed an initial merge to 10.6 of this, but I'll refine it once I fully understand the code around trx_pos:
In your current 10.6 patch, we may have victim==nullptr for a large portion of the loop, and therefore, we might choose something else than the actual preferred victim.
No, victim_weight is initialized to ~0ULL, so the condition (next_weight < victim_weight) will be true and victim will be initialized =next (and =cycle) on the first iteration. Thus victim!=nullptr on every iteration except the first.
I think that thd_deadlock_victim_preference() needs to be invoked on an array of THD* and return a pointer to the victim THD, or nullptr if there is no preference.
I do not think that is necessary. The fundamental issue is in-order parallel replication, where we have a series of transactions T1, T2, T3, ... that we know a-priori _must_ commit in this order. In case of a deadlock between them, we must avoid T1 being chosen as a deadlock victim and immediately causing the same deadlock again, repeatedly. This is the bug seen here. But as long as we prevent T1 being chosen as the victim, replication will be able to proceed. Chosing T3 is best, but T2 also works. Thus, we do not need InnoDB to choose the _best_ deadlock victim. We just need to avoid it choosing the _worst_ victim, and the thd_deadlock_victim_preference() facility should ensure that. That said, the current thd_deadlock_victim_preference() mechanism is not ideal. The dependencies involved are split between the InnoDB and the replication layer, neither can solve it fully on their own. Similar problems must exist eg. for deadlock involving cross-engine transactions. Maybe a full solution requires server-wide deadlock detection. (For in-order parallel replication itself, the deadlock detection problem is much simpler, because we _know_ that any T_(i+1) will eventually have to wait for T_(i). So a deadlock will occur if-and-only-if there is some T_j that waits for T_k, with j < k. But this is not true if considering other transactions that the on-order sequence within one replication domain_id). The thd_deadlock_victim_preference() is really just a comparison of thd->rgi_slave->gtid_sub_id. Maybe it would be simpler and more efficient to expose that to InnoDB, and simply use the gtid_sub_id as the TRX_WEIGHT, if present (with a few bit operations to make the comparison work out, similar to the (1ULL << 63) in current code)? It exposes internal replication detail to InnoDB, but sometimes that is necessary to get the simplest and most efficient code...
Do you have an idea how to stress test this at a larger scale than is feasible within the regression test suite?
We can set --slave-transaction-retries=1 and run optimistic parallel replication on a workload with many row lock conflicts. If a transaction gets a temporary error (like a deadlock), it will rollback and wait for all prior transactions to commit before re-trying. Thus, on retry, all other transactions will be prefered as deadlock victims. If more than one retry is required, it indicates a bug in the deadlock victim selection. The test rpl_parallel_deadlock_victim.test is kind of a stress-test, but using error injection to check against the wrong victim being selected. - Kristian.
Kristian Nielsen via developers <developers@lists.mariadb.org> writes:
(For in-order parallel replication itself, the deadlock detection problem is much simpler, because we _know_ that any T_(i+1) will eventually have to wait for T_(i). So a deadlock will occur if-and-only-if there is some T_j that waits for T_k, with j < k. But this is not true if considering other transactions that the in-order sequence within one replication domain_id).
It's good to consider these issues again. It's a pity we didn't have this discussion in 10.0/10.1 when I implemented parallel replication, or later in 10.6 when you improved the InnoDB locking. Maybe we could do things much simpler? If we expose the gtid_sub_id directly, the InnoDB deadlock detector can just look if the sub_id increases while traversing the wait-for path. If it does, it can just stop there and declare a deadlock and choose the transaction with the higher sub_id as the victim. We will have to roll back that transaction anyway at some point to preserve commit order, and doing so now will break any cycle if it's there. Then we get rid of trying to hint InnoDB which victim to prefer with thd_deadlock_victim_preference(). And we don't need the thd_rpl_deadlock_check() for every wait, which then has to queue a background task to later kill the waited-for transaction. Needs some more thought to be sure all details are right. I think I should still fix just the regression with missing thd_deadlock_victim_preference() for now, but something to think about to make the code both simpler and more efficient.
Thank you for the explanation. Can you please also post it to https://jira.mariadb.org/browse/MDEV-24948 for future reference?
I added some more explanation there, and ideas for how to mostly eliminate the overhead.
It would also be useful if you could check the following changes in MySQL 8.0 that are linked from MDEV-24948:
https://github.com/mysql/mysql-server/commit/30ead1f6966128cbcd32c7b6029ea21...
Interesting. This seems similar to https://jira.mariadb.org/browse/MDEV-31840, which I discovered recently. The MySQL code uses a function thd_report_lock_wait(), which looks very similar to the MariaDB thd_rpl_deadlock_check().
https://github.com/mysql/mysql-server/commit/3859219875b62154b921e8c6078c751...
This is a large patch, I did not yet manage to read through it all. But from reading some of the comments, it sounds like they move the deadlock detection to be done asynchronously as a background task, rather than synchronously whenever a wait is needed. This is quite interesting, and something I've earlier idly speculated could reduce the cost of deadlock detection. - Kristian.
Kristian Nielsen via developers <developers@lists.mariadb.org> writes:
I'm thinking it will be rare to have the same value of (undo_no + LENGTH(lock.trx_locks)) for two transactions, which is the only way this code can trigger. But I'm not familiar with the original reason for this piece of the code, that's why I asked.
Ok, now I understand that the TRX_WEIGHT is approximately the number of rows modified + the number of locks held, so it will not be _that_ rare for them to be equal. So I have kept the functionality that all else being equal, the current trx will be preferred over another transaction in the cycle. I assume this is because it's cheaper to roll back the current transaction, than to mark another transaction as the victim, go to sleep, and then wake back up once the other transaction(s) has released the lock waited for? I refined my patch with bitwise operations to keep using a single comparison between weights in the cycle loop. I use the top bit for the thd_deadlock_victim_preference() and the lowest bit to prefer the current transaction over another with same weight. The result is here. If you agree and Andrei also reviews ok, I will use this as the 10.6 version of the 10.4 patch (which won't merge to 10.6). This should fix several sporadic Buildbot failures as well. https://github.com/MariaDB/server/tree/knielsen_bugfix_10.6 https://github.com/MariaDB/server/commit/5216296b2f85290ee5ed760c62e18b075e3... - Kristian.
Hi Kristian, Sorry for the late reply. On Wed, Aug 9, 2023 at 4:25 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I assume this is because it's cheaper to roll back the current transaction, than to mark another transaction as the victim, go to sleep, and then wake back up once the other transaction(s) has released the lock waited for?
Yes, definitely, rolling back the current transaction is easier and should be preferred. Actually, some time ago we tried to simplify the asynchronous rollback in https://jira.mariadb.org/browse/MDEV-29860 but in the end, decided against it.
I refined my patch with bitwise operations to keep using a single comparison between weights in the cycle loop. I use the top bit for the thd_deadlock_victim_preference() and the lowest bit to prefer the current transaction over another with same weight.
This makes sense, thank you. For the non-HAVE_REPLICATION code path, I might have defined the dummy variables victim_not_pref, next_not_pref as constexpr. But, the oldest compiler that I found on godbolt.org (GCC 4.1.2) is able to optimize away the bitwise OR with a non-const variable that is 0, starting with -O1. The oldest compiler that we aim to support is GCC 4.8.5. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
participants (2)
-
Kristian Nielsen
-
Marko Mäkelä