andrei.elkin@pp.inet.fi writes:
Now consider the second case, where T_1 only sets rgi->killed_for_retry _after_ T_2 has passed the above code point:
1. T_2 acquires some lock inside InnoDB
2. T_2 passes the above code point with the check for rgi->killed_for_retry, it calls mark_start_commit() and starts the commit process.
3. T_1 only now gets a lock conflict with T_2, it calls thd_rpl_deadlock_check(), sets rgi->killed_for_retry, schedules a task for the manager thread to kill T_2
4. T_1 goes to wait on the lock.
In this situation, the lock in InnoDB was still held by T_2 when it ran mark_start_commit().
Hmm, this is unclear. I thought you had a case in which T_2 releases an innodb stats lock at the end of its current statement. And T_1 would try to acquire the lock within that timeframe.
I was describing two different cases, "first case" and "second case".
The "first case" is the case where the bug occurred, and the one my patch fixes. It is just as you describe, T_2 releases an innodb stats lock at the end of its current statement.
The above "second case" describes a different scenario, trying to explain why this scenario does not cause a bug. This case differs from the "first case" in that the lock taken by T_2 is _not_ released at the end of its current statement, it is only released at commit/rollback.
Sorry if this was not clear.
Am I confused in the above? It'd be great to have a T_2 worker stack at time of the lock is granted to it, but it's not provided..
I have included below a stack trace from my notes, where an insert goes to take a dict stats lock.
Obvisoulsy under this perception T_2 would execute mark_start_commit() after the lock is released.
For the scenario where T_2 executes mark_start_commit() after the lock is released, and the bug occurred, you should look at the "first case", described in points 1-8.
The "second case" that you found unclear you can just ignore.
Got it. Thanks for spelling out! I see I should've asked specifically on the first case. I meant the following. What if p.4 lock release by T_2 happened within p.2, and before T_2::rgi->killed_for_retry is changed. Something like 1. T_2 gets a lock inside InnoDB, on the dict_stats table in this case. 2. T_1 requests a conflicting lock on dict_stats. It calls thd_rpl_deadlock_check() ... + 4 ... which sets rgi->killed_for_retry and schedules a task for the manager thread to kill T_2. But the manager thread does not service the request yet. - 3. T_1 goes to wait on the lock. + 3. T_1 is granted the lock immediately. - 4. T_2 later releases the lock on dict_stats - this is a special case, normally DML locks are held until COMMIT. But I see now that the step 4 relocation is impossible due to Innodb locking protocol. So that proves T_2 can't miss to find its killing status at mark_start_commit().
that was to explain what happens in a different scenario, where T_2 does not release its lock until the transaction is rolled back. In this case we avoid starting D_3 too early, because T_2 does unmark_start_commit() before rolling back and releasing its lock, so there is no bug (with or without the patch).
Please clear it out.
I hope this helps.
No more question, as we're done :-). Thanks for this piece of analysis and work! Andrei
- Kristian.