Marko Mäkelä <marko.makela@mariadb.com> writes:
On Sun, Aug 6, 2023 at 6:59 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
If so, seems there's a small mistake in the code. The current condition will always be true for each new value of victim, so trx_pos will always be set != 0. The first condition should probably have been "if (trx == victim)" ?
As far as I remember, the original reason for the weight calculation was that the deadlock resolution would roll back the transaction that has done the least amount of work, expressed as the "weight" of the transaction. The weight is computed based on whether thd_has_edited_nontrans_tables() holds (the transaction has modified any non-transactional tables, such as MyISAM tables), on the number of undo log records written so far (trx_t::undo_no), and the number of InnoDB locks held by the transaction.
Yes. Consider this P-shaped path, with trx=T1 and cycle=T3: T1 -> T2 -> T3 -> T2 T1 undo_no=100 lock.trx_locks length= 4 TRX_WEIGHT(T1)=104 T2 undo_no=101 lock.trx_locks length=10 TRX_WEIGHT(T2)=115 T3 undo_no=102 lock.trx_locks length= 2 TRX_WEIGHT(T3)=104 In the first loop iteration: - next=T2 - 115 < ~0ULL, so we assign victim=T2 and victim_weight=115 - next==victim, so we assign trx_pos=1 (this is wrong, I think). In the second iteration: - next=T3 - 104 < 115, so we assign victim=T3 and victim_weight=104 - next==victim, so we assign trx_pos=2 (wrong) - next==cycle, so we end the loop. At the end of the loop, trx_pos != 0 and trx_weight==victim_weight (==104), so we assign victim=T1. The result is that we select T1 as the deadlock victim, even though this does not break the deadlock. I think it's just a small mistake in this code: if (next == victim) trx_pos= l; which should have been this: if (next == trx) trx_pos= l; My question is, if instead of fixing we should just remove it along with this code? if (trx_pos && trx_weight == victim_weight) { victim= trx; victim_pos= trx_pos; I'm thinking it will be rare to have the same value of (undo_no + LENGTH(lock.trx_locks)) for two transactions, which is the only way this code can trigger. But I'm not familiar with the original reason for this piece of the code, that's why I asked.
would not be broken. I thought about coming up with a test case to show this, but it looks like it's not easy to get the same TRX_WEIGHT for trx and victim. If so, maybe this part of the code can just be removed, or what do you think?
Can you check it once again?
My problem is, I'm not sufficiently familiar with InnoDB internals to know how to arrange the values of lock.trx_locks and undo_no in particular to make the condition (trx_weight == victim_weight) be true in a situation where we have a P-shaped path. It would be fun to try, I might try it, but it will take some time.
In your current 10.6 patch, we may have victim==nullptr for a large portion of the loop, and therefore, we might choose something else than the actual preferred victim.
No, victim_weight is initialized to ~0ULL, so the condition (next_weight < victim_weight) will be true and victim will be initialized =next (and =cycle) on the first iteration. Thus victim!=nullptr on every iteration except the first.
I think that thd_deadlock_victim_preference() needs to be invoked on an array of THD* and return a pointer to the victim THD, or nullptr if there is no preference.
I do not think that is necessary. The fundamental issue is in-order parallel replication, where we have a series of transactions T1, T2, T3, ... that we know a-priori _must_ commit in this order. In case of a deadlock between them, we must avoid T1 being chosen as a deadlock victim and immediately causing the same deadlock again, repeatedly. This is the bug seen here. But as long as we prevent T1 being chosen as the victim, replication will be able to proceed. Chosing T3 is better, but T2 also works. Thus, we do not need InnoDB to choose the _best_ deadlock victim. We just need to avoid it choosing the _worst_ victim, and the thd_deadlock_victim_preference() facility should ensure that. That said, the current thd_deadlock_victim_preference() mechanism is not ideal. The dependencies involved are split between the InnoDB and the replication layer, neither can solve it fully on their own. Similar problems must exist eg. for deadlock involving cross-engine transactions. Maybe a full solution requires server-wide deadlock detection. (For in-order parallel replication itself, the deadlock detection problem is much simpler, because we _know_ that any T_(i+1) will eventually have to wait for T_(i). So a deadlock will occur if-and-only-if there is some T_j that waits for T_k, with j < k. But this is not true if considering other transactions that the in-order sequence within one replication domain_id). The thd_deadlock_victim_preference() is really just a comparison of thd->rgi_slave->gtid_sub_id. Maybe it would be simpler and more efficient to expose that to InnoDB, and simply use the gtid_sub_id as the TRX_WEIGHT, if present (with a few bit operations to make the comparison work out, similar to the (1ULL << 63) in current code)? It exposes internal replication detail to InnoDB, but sometimes that is necessary to get the simplest and most efficient code...
I think for parallel replication, we have to do the lock wait reporting always to avoid this. Since we are anyway going to do a lock wait, an expensive operation, I think it is reasonable to accept the small overhead of a lock wait report to the server layer.
Thank you for the explanation. Can you please also post it to https://jira.mariadb.org/browse/MDEV-24948 for future reference?
Ok, will do.
It would also be useful if you could check the following changes in MySQL 8.0 that are linked from MDEV-24948: https://github.com/mysql/mysql-server/commit/30ead1f6966128cbcd32c7b6029ea21... https://github.com/mysql/mysql-server/commit/3859219875b62154b921e8c6078c751...
Ok, I will check it.
I pushed an initial merge to 10.6 of this, but I'll refine it once I fully understand the code around trx_pos:
In your current 10.6 patch, we may have victim==nullptr for a large portion of the loop, and therefore, we might choose something else than the actual preferred victim.
No, victim_weight is initialized to ~0ULL, so the condition (next_weight < victim_weight) will be true and victim will be initialized =next (and =cycle) on the first iteration. Thus victim!=nullptr on every iteration except the first.
I think that thd_deadlock_victim_preference() needs to be invoked on an array of THD* and return a pointer to the victim THD, or nullptr if there is no preference.
I do not think that is necessary. The fundamental issue is in-order parallel replication, where we have a series of transactions T1, T2, T3, ... that we know a-priori _must_ commit in this order. In case of a deadlock between them, we must avoid T1 being chosen as a deadlock victim and immediately causing the same deadlock again, repeatedly. This is the bug seen here. But as long as we prevent T1 being chosen as the victim, replication will be able to proceed. Chosing T3 is best, but T2 also works. Thus, we do not need InnoDB to choose the _best_ deadlock victim. We just need to avoid it choosing the _worst_ victim, and the thd_deadlock_victim_preference() facility should ensure that. That said, the current thd_deadlock_victim_preference() mechanism is not ideal. The dependencies involved are split between the InnoDB and the replication layer, neither can solve it fully on their own. Similar problems must exist eg. for deadlock involving cross-engine transactions. Maybe a full solution requires server-wide deadlock detection. (For in-order parallel replication itself, the deadlock detection problem is much simpler, because we _know_ that any T_(i+1) will eventually have to wait for T_(i). So a deadlock will occur if-and-only-if there is some T_j that waits for T_k, with j < k. But this is not true if considering other transactions that the on-order sequence within one replication domain_id). The thd_deadlock_victim_preference() is really just a comparison of thd->rgi_slave->gtid_sub_id. Maybe it would be simpler and more efficient to expose that to InnoDB, and simply use the gtid_sub_id as the TRX_WEIGHT, if present (with a few bit operations to make the comparison work out, similar to the (1ULL << 63) in current code)? It exposes internal replication detail to InnoDB, but sometimes that is necessary to get the simplest and most efficient code...
Do you have an idea how to stress test this at a larger scale than is feasible within the regression test suite?
We can set --slave-transaction-retries=1 and run optimistic parallel replication on a workload with many row lock conflicts. If a transaction gets a temporary error (like a deadlock), it will rollback and wait for all prior transactions to commit before re-trying. Thus, on retry, all other transactions will be prefered as deadlock victims. If more than one retry is required, it indicates a bug in the deadlock victim selection. The test rpl_parallel_deadlock_victim.test is kind of a stress-test, but using error injection to check against the wrong victim being selected. - Kristian.