Hi Andrei, thanks a lot for looking at this patch! andrei.elkin@pp.inet.fi writes:
+ if (rgi->killed_for_retry != rpl_group_info::RETRY_KILL_NONE) + wait_for_pending_deadlock_kill(thd, rgi);
Assuming my understanding, please correct me if anything, I left a question on this block in https://github.com/MariaDB/server/commit/df0c36a354ffde2766ebed2615642c74b79... quoted further.
... this guard would not let a victim of an optimistic conflict proceed until it clears out itself it is such a victim. For instance in the binlog sequence like
T_1, T_2, D_3
where T_2 is the victim, D_3 is a DDL, the victim T_2 won't anymore release the next gco member D_3 into its execution.
But what if at this point T_2 is only going to be marked in rgi->killed_for_retry by T_1?
That is it is apparently able, with this patch as well, having ...NONE killed status go through. T_1 would reach this very point later and when that is done before T_2 finds itself killed, T_1 would release D_3, and afterward T_2 would finally see itself KILLed into retry.
I am not entirely sure I understood your question correctly, hopefully I got it right. I think your question is that this patch only helps if T_1 sets rgi->killed_for_retry _before_ T_2 executes the above code that does wait_for_pending_deadlock_kill(). So what if T_1 does so a bit later, after T_2 has already proceeded beyond this check? This is a very good question, and the answer is complex, hopefylly I can explain it. The problem addressed by this patch in fact only occurs in the first case, where T_1 sets rgi->killed_for_retry _before_ T_2 executes the above wait_for_pending_deadlock_kill(). What happens when the problem occurs is the following: 1. T_2 gets a lock inside InnoDB, on the dict_stats table in this case. 2. T_1 requests a conflicting lock on dict_stats. It calls thd_rpl_deadlock_check() which sets rgi->killed_for_retry and schedules a task for the manager thread to kill T_2. But the manager thread does not service the request yet. 3. T_1 goes to wait on the lock. 4. T_2 later releases the lock on dict_stats - this is a special case, normally DML locks are held until COMMIT. 5. T_2 now gets ready to commit, and reaches the above code added in the patch. The manager thread has not yet serviced the request, so thd->killed is not set. But rgi->killed_for_retry _is_ set, as this is done synchroneously in thd_rpl_deadlock_check(). 6. T_1 now wakes up, since the lock on dict_stats was released earlier by T_2. 7. T_1 can now proceed to commit and call mark_start_commit() 8. Only now does the manager thread wake up to service the request, and it kills the T_2, setting thd->killed. Before the patch, at this point T_2 could run mark_start_commit(). This could allow D_3 to start too early, if T_2 later sees the thd->killed flag being set and rolls back and retries. With the patch, this is no longer possible, as T_2 will in (5) run wait_for_pending_deadlock_kill(). This will wait for the manager thread to complete, and then thd->killed will be set and T_2 will rollback and retry _before_ calling mark_start_commit(). Now consider the second case, where T_1 only sets rgi->killed_for_retry _after_ T_2 has passed the above code point: 1. T_2 acquires some lock inside InnoDB 2. T_2 passes the above code point with the check for rgi->killed_for_retry, it calls mark_start_commit() and starts the commit process. 3. T_1 only now gets a lock conflict with T_2, it calls thd_rpl_deadlock_check(), sets rgi->killed_for_retry, schedules a task for the manager thread to kill T_2 4. T_1 goes to wait on the lock. In this situation, the lock in InnoDB was still held by T_2 when it ran mark_start_commit(). This will now prevent T_1 from waking up and itself doing mark_start_commit(), until one of two happens: (A) T_2 commits; or (B) T_2 sees the thd->killed and rolls back, in which case T_2 will first call unmark_start_commit(), which avoids D_3 starting too early. So the problem here is the special case of InnoDB locks on dict_stats, which are held only temporarily and released _before_ commit/rollback (autoinc is another such case). Normal locks held until commit/rollback are not the problem, as these will prevent the other transaction T_1 from calling mark_start_commit() too early. And the patch works under the assumption that such special case locks that will be released before commit/rollback, can no longer be acquired after this point in the code, where only the commit step of the transaction remains. This part of the parallel replication code is something that I was always unhappy with. It feels very fragile to rely on this property, that _if_ T_2 gets killed after mark_start_commit(), we can be sure that another transaction will then always be blocked on some lock held by T_2 until after T_2 has time to do unmark_start_commit(). The patch doesn't fundamentally change this, but it does solve a real bug that occurs in the scenario explained above. - Kristian.