Hi Kristian, On Sun, Aug 6, 2023 at 6:59 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I noticed one thing in the current 10.6 code:
if (next == victim) trx_pos= l; ... if (trx_pos && trx_weight == victim_weight) { victim= trx; victim_pos= trx_pos;
IIUC, the intention with this code is that if the current transaction and another transaction are both candidates for being the deadlock victim, select the current transaction as the preferred candidate?
If so, seems there's a small mistake in the code. The current condition will always be true for each new value of victim, so trx_pos will always be set != 0. The first condition should probably have been "if (trx == victim)" ?
As far as I remember, the original reason for the weight calculation was that the deadlock resolution would roll back the transaction that has done the least amount of work, expressed as the "weight" of the transaction. The weight is computed based on whether thd_has_edited_nontrans_tables() holds (the transaction has modified any non-transactional tables, such as MyISAM tables), on the number of undo log records written so far (trx_t::undo_no), and the number of InnoDB locks held by the transaction. I rewrote the deadlock detector in https://jira.mariadb.org/browse/MDEV-24738 (MariaDB Server 10.6). The old one seemed to assume that the all deadlocks are simple cycles (P-shaped with an empty "foot", that is, O-shaped), and the reporting assumed that deadlocks involve only 2 transactions. The variable "victim_pos" identifies the "victim" transaction in the deadlock report. If we have a P-shaped path from the current transaction to a deadlock of transactions, then Deadlock::find_cycle() should have returned the O-shaped portion of it, which the current transaction would not be part of. According to a comment in find_cycle(), this probably means that the cycle had been formed while innodb_deadlock_detect=OFF had been in effect in the recent past (before innodb_lock_wait_timeout has broken the cycle) and the current transaction is waiting for locks held by one participant of the cycle.
In theory I think this means we could wrongly select the current trx as the deadlock victim in the case of a "P-shaped" path, and thus the deadlock would not be broken. I thought about coming up with a test case to show this, but it looks like it's not easy to get the same TRX_WEIGHT for trx and victim. If so, maybe this part of the code can just be removed, or what do you think?
Can you check it once again?
Another important issue I noticed, though not directly related to MDEV-31655, is this:
/* Even though lock_wait_rpl_report() has nothing to do with deadlock detection, it was always disabled by innodb_deadlock_detect=OFF. We will keep it in that way, because unfortunately thd_need_wait_reports() will hold even if parallel (or any) replication is not being used. We want to be allow the user to skip lock_wait_rpl_report(). */ const bool rpl= !(type_mode & LOCK_AUTO_INC) && trx->mysql_thd && innodb_deadlock_detect && thd_need_wait_reports(trx->mysql_thd);
So this is a bug I introduced in my original implementation. The reporting of lock waits is essential for optimistic parallel replication to work at all. Without it, the following will happen:
- Transactions T1 and T2 replicate in parallel, run out-of-order. - T2 updates some row, goes to commit and wait_for_prior_commit(T1). - T1 goes to update the same row, gets a lock wait on T2. - T1 eventually gets a lock wait timeout and is retried. - The retry of T1 will get the same problem, and replication eventually breaks when the --slave-transactions-retries value is reached.
I think for parallel replication, we have to do the lock wait reporting always to avoid this. Since we are anyway going to do a lock wait, an expensive operation, I think it is reasonable to accept the small overhead of a lock wait report to the server layer.
However, the reason thd_need_wait_reports() also returns true for non-replication threads (when the binlog is enabled) is to write a hint in the GTID event that a transaction had a wait on the master. This is non-critical and could be avoided if we want. Maybe something we should look into/discuss.
Thank you for the explanation. Can you please also post it to https://jira.mariadb.org/browse/MDEV-24948 for future reference? It would also be useful if you could check the following changes in MySQL 8.0 that are linked from MDEV-24948: https://github.com/mysql/mysql-server/commit/30ead1f6966128cbcd32c7b6029ea21... https://github.com/mysql/mysql-server/commit/3859219875b62154b921e8c6078c751...
The other similar functions (thd_rpl_deadlock_check()) are declared in a section in lock0lock.cc, so I moved this declaration there as well for consistency (but let me know if I you still prefer ha_prototypes.h for this or maybe all of them).
That makes sense too. The main thing for me would be that the function declaration is not duplicated in multiple *.cc files.
I 100% agree with the need to be careful with this. In fact this problem was found because I investigated a sporadic failure in the test case rpl_mark_optimize_tbl_ddl which was caused by the errorneous removal of this code.
I really appreciate your recent efforts of investigating the root causes of sporadic test failures. In InnoDB, apart from locking, a prominent source of sporadic test failures has been statistics: https://jira.mariadb.org/browse/MDEV-10682 https://jira.mariadb.org/browse/MDEV-21136
In the patch, note in particular the modification to has_higher_priority(), which I think has the effect of choosing which of two waiting transactions to grant a lock to. I don't remember adding this myself, so it might have
Because innodb_lock_schedule_algorithm=VATS is known to be broken even without any involvement of replication, I do not think that we have to care too much about it.
Ok.
I must trust your judgement here; I have no fundamental objection.
Thanks, Marko.
I pushed an initial merge to 10.6 of this, but I'll refine it once I fully understand the code around trx_pos:
https://github.com/MariaDB/server/tree/knielsen_bugfix_10.6 https://github.com/MariaDB/server/commit/b475257dc2297a8a7271367ad4a8d659297...
I think that thd_deadlock_victim_preference() needs to be invoked on an array of THD* and return a pointer to the victim THD, or nullptr if there is no preference. In your current 10.6 patch, we may have victim==nullptr for a large portion of the loop, and therefore, we might choose something else than the actual preferred victim. Do you have an idea how to stress test this at a larger scale than is feasible within the regression test suite? Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc