Sergei Golubchik <serg@mariadb.org> writes:
I'm kind of lost here. I really don't like requiring engines to do more obligatory things, but I don't see a good way around it either.
Agree, that's how I felt while trying to think of _any_ way to solve these problems. I ended up with current solution, but not entirely happy about it. At least, we have some more ideas now to improve things. What should be the next step? I suggest that I could change the API to use the new thd_get_commit_order() instead (see below for detailed suggestion). And try to implement the method where InnoDB gets the information about commit order and does the kill of the other transaction itself. And I'll read up on services and use it for the new API. And then I could prepare a new patch for you to look at. Does this sounds reasonable? Or do you have another suggestion? More detailed discussions below:
I thought that if a transaction is rolled back, you can, technically force *all* later transactions to roll back too and redo them all. But that'd be kind of expensive.
It's not the expensive that worries me. The problem is that some of the following transactions may not be possible to roll back. Suppose we have T1, T2, T3. T1 and T2 are InnoDB DML, T3 is DDL or MyISAM. If now we get a deadlock between T1 and T2 and kill T3 to resolve it, we can seriously corrupt the replication, because we will be left with T3 half-way executed. This is worse because there might not really have been any deadlock - it could be just a too-low deadlock timeout. This is the reason I did the fix as I did. In my patch, we will kill and retry only T2, so T3 will be unaffected.
If we could know whether an engine uses this new API, we could - like I wrote above - default to rolling back all transactions in a group after the failed one. And then the new API would've been merely an optimization. *That* would've been a big improvement.
Suppose with the current patch, the engine does not support the new API. T1 will get a lock timeout. It will retry, but get the lock timeout again, and after slave_transaction_retries it will give up, and replication will stop. So seems we have two alternative default behaviours (I mean if the storage engine does not implement the new API): 1. As in current patch, replication will stop and the DBA will be made aware of the issue and will have to handle manually. 2. As you suggest, we will kill T3, slave will continue automatically (with some lock timeout delays), but we might have silently corrupted the slave state. (1) seems safer, while (2) will be a lot more convenient for the DBA in many cases. Which do you think we should prefer? I wonder if we could somehow check if T3 is safe to roll back? If it is, (2) is both convenient and safe, and could be used. Many users use only InnoDB normally and use DDL rarely. But I'm not sure how hard it would be to do this check in the general case. Maybe it would be worth to look into, as you said it would be much preferable if the new storage engine API was just an optional optimisation.
2. thd_deadlock_victim_preference() is only used when the _storage engine_ detects a deadlock, replication related or not.
3. thd_need_ordering_with() is used by the storage engine to decide on lock wait semantics.
Which is, basically, the same as 2, the concept is the same. One function is used when deadlock is detected, the other is earlier, but it doesn't make them suficiently different.
I do not think it is the same, let me try to explain once more and maybe it will become more clear one way or the other. With just (1), thd_report_wait_for() or whatever we replace it with, the basic problem is solved. In case of a deadlock between T1 and T2, we will kill T2, let T1 continue, and later re-try T2. This works even without thd_need_ordering_with(). But in some cases the deadlock, kill, and retry is unnecessary. This is about the issue in MDEV-5914, for example: T1: UPDATE t1 SET secondary=NULL WHERE primary=1 T2: DELETE t1 WHERE secondary <= 3 If T1 and T2 were group-committed together on the master, we know that it is safe to run them in parallel, so we can relax the gap lock taken by T2 that can block T1. Then we avoid having to rollback and retry T2. But if T2 committed in a different group commit on the master, we do not know that this relaxing of gap locks is safe. Because T2 may still run in parallel with the COMMIT of T1, we need to keep the gap locks, so that T2 will wait if it needs to access any row modified by T1. So thd_need_ordering_with() really _is_ just an optimisation. We might omit it completely from the patch, or we could keep it as a maybe useful optimisation. One way to use the same function might be: thd_get_commit_order(thd, other_thd) -2 - thd commits before, and it is not in same group commit as other_thd -1 - thd commits before, but it is in same group commit as other_thd 0 - thd and other_thd can commit in either order 1 - other_thd commits before, but it is in same group commit as thd 2 - other_thd commits before, and it is not in same group commit as thd Is this something like what you want?
I think it'll be much simpler with only one function. Why? I think it's quite the opposite...
Why? Because there's only one function with very simple and clear semantics - the server informs the engine what transaction must be committed first. The engine uses this information to choose deadlock victims, to resolve waits, etc. This is just one high-level function, that doesn't use concepts of gap locks, statement-based replication, locks that are released before commit, locks that don't cause waits, and whatever else one must keep in mind when using lower-level API.
I think it's much simpler to use. It is more complex if you think of everything it might mean and do in different situation. Compare a modern smartphone and an old punch-card computer. An old computer is simpler internally, a smartphone is easier to use.
So I'm fine with whatever way we agree with in the end. The reason I think my original API is simpler is that there is just one very simple thing that the engine must do: call thd_report_wait_for(T1,T2) whenever T1 is about to wait until T2 commits. Everything else is optional optimisations that can be safely ignored. It seems that any conceivable engine must already have easy access to knowledge about one transaction needing to wait for another. And indeed, in InnoDB only a couple of lines need to be added to handle this. So it is very simple to use. With your suggestion, however, things are more complex for the engine. It needs to implement additional deadlock detection, based on some enforced commit order. The very concept of "enforced commit order" is probably something new. And it needs to implement some way to resolve deadlocks with another transaction T2 that may or may not be running inside the engine. So again, a new concept is introduced, "Is transaction T running inside this engine"? So I still do not understand why you think the second is simpler, or easier to use? But I can try to see if I can implement the second idea. The main complication I see is how to be sure if the T2 is running inside the engine or not. Maybe like this: 1. Call thd_get_commit_order(T1, T2), we learn that we have a deadlock, so need to kill T2. 2. Set a flag in T2 that it has become a deadlock victim. 3. Call the new thd_plugin_awake(), it will kill T2 if it is outside of the engine. 4. It could be that T2 was executing outside the engine at point (2), but had time to enter the engine while we were traversing the list of handlertons in (3). But then we can add code at appropriate places in the engine to check for the flag set in (2), and make sure T2 gets killed also in this case. (Maybe the necessary logic is already present in innobase_kill_query(), I will check). Have I understood your suggestions correctly now, do you think?
So, either one needs two functions (thd_deadlock_victim_preference and thd_need_ordering_with) or modified_non_trans_table should be moved out of it where it was. I'd prefer two functions, it was good that you put modified_non_trans_table inside thd_deadlock_victim_preference.
Ok, right, modified_non_trans_table is unrelated but it felt natural to move it out of the storage engine and into a more general function.
This is O(N**2) of function calls that immediately return, it's neglectable on the total cost of graph traversal. And anyway, it's only called before waits - the engine is going to wait anyway, one function call won't make this wait any noticeably longer.
You are assuming that the engine will need to do this O(N**2) graph traversal anyway. That's true in InnoDB, which has a very primitive algorithm for deadlock detection. But it might not be true in another engine. The idea with thd_need_wait_for() was that it would be used by a storage engine that would not normally need to loop over all other transactions. So it could avoid needlessly implementing such a loop for _all_ transactions, where it would only be needed for replication transactions (which will usually have few waits, if any). But let's omit it for now, we do not know of any engine that might need it, and if someday we find one we can easily add something at that time.
One way would be to document that hton->kill_query must be safe to call from within thd_report_wait_for(). Maybe that is the simplest way. It should not be hard to fix the InnoDB code to make this work. For example, InnoDB could just set a flag in the transaction that it already has lock_sys->mutex held, thus avoiding the recursive mutex lock.
Yes, but it'd make a pretty difficult-to-document API, I'd rather avoid it, if possible.
Agree, it's not nice, I'd also like to avoid it. Thanks, - Kristian.