Hi, Kristian! On Jul 01, Kristian Nielsen wrote:
+ The storage engine can report such conflicting locks using this call. This + will allow parallel replication to detect such conflicts and resolve the + deadlock (by killing the second transaction to release the locks that the + first is waiting for, and then later re-try the second killed transaction).
What will happen if the engine doesn't? Deadlocks will timeout. Which is annoying, but it only affects the performance, not the correctness.
No, unfortunately that is not so.
Let's say we have T1 and T2. T1 goes to wait on a row lock held by T2. Then T2 completes its prepare phase, it goes to the group commit queue and waits for T1 to complete and commit for both of them. T1 gets lock wait timeout, so T1 will roll back and try again. But T1 will just go to wait on the same row lock again that is held by T2. So T1 will retry over and over until slave_transaction_retries is exceeded, at which point replication will abort with an error.
There is no lock wait timeout when T2 waits in the group commit queue.
Right. I thought that if a transaction is rolled back, you can, technically force *all* later transactions to roll back too and redo them all. But that'd be kind of expensive. On the other hand, requiring *every* engine to report waits... Not good either. And you cannot really enforce that, there's no way to know whether an engine does it (compare with handler methods - a required method can be simply declared abstract and the engine cannot possibly miss it). I'm kind of lost here. I really don't like requiring engines to do more obligatory things, but I don't see a good way around it either. If we could know whether an engine uses this new API, we could - like I wrote above - default to rolling back all transactions in a group after the failed one. And then the new API would've been merely an optimization. *That* would've been a big improvement. Can we deduce this somehow, without requiring the engine to declare a flag, like, HA_CAN_SUPPORT_TRX_COMMIT_ORDER_IN_PARALLEL_REPLICATION ?
+void thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd);
You kill the second thread in a background. What if you convey the ordering to the engine and let it know about a deadlock? Like, if you return 1 from thd_report_wait_for() it means a deadlock and the engine will kill the transaction just normally as for any normal deadlock case.
The problem is that T2 could be executing anywhere in the server, including server code or another storage engine, where InnoDB does not really have any possibility to kill it.
This is different from a normal InnoDB deadlock, which only happens if both T1 and T2 are doing an actual row lock wait inside the storage engine itself. But in this case, only T1 is waiting, T2 is continuing whatever processing it is doing, eventually it will end up in the group commit queue in binlog or in wait_for_prior_commit(), and wait for T1 to commit first.
I don't know whether you've seen what I suggested on irc, so I'll repeat. Provide a thd_kill(thd) helper that an engine can call to kill a specific thd. It will *not* call ha_kill_query for this particular engine, but will call for other engines. The logic is - the engine knows how to kill a thread that's stuck somewhere in this very engine. But it needs help to kill a thread that's somewhere else. Also it'll help to avoid mutex issues that you were seeing.
+int thd_need_wait_for(const MYSQL_THD thd);
I'd suggest to remove it. If you think it's useful, you can call it from your thd_report_wait_for() and spare storage engine the knowledge of yet another API function.
See discussion above. The intention is to avoid lots of calls to thd_report_wait_for() from the storage engine for non-replication threads. So calling it from inside thd_report_wait_for() makes no sense.
If it's on the very top of thd_report_wait_for() then it'll be one function that immediately returns.
I don't understand the benefits of reducing the functionality available in the API when that functionality is completely optional?
The benefit is the keeping API as simple as possible, while still fully functional. It's very important to have the simple API, even if it's doing something complex underneath. See pluggable authentication, for example, the big comment inside native_password_authenticate() function.
I included this function because without it, there seems no way for a storage engine to avoid introducing O(N**2) overhead for N transactions contending for a row lock. That seemed unacceptable to me.
This is O(N**2) of function calls that immediately return, it's neglectable on the total cost of graph traversal. And anyway, it's only called before waits - the engine is going to wait anyway, one function call won't make this wait any noticeably longer.
+int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2);
Okay, I see why this one is needed.
On the other hand... It seems like it's the only function that you realy need. Because thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd); can be replaced with if (thd_deadlock_victim_preference(thd, other_thd) < 0) goto deadlock; And thd_need_wait_for(const MYSQL_THD thd); can be removed (replaced with noop). And thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd); can be replaced with thd_deadlock_victim_preference(thd, other_thd) != 0
I don't understand the purpose of using the same function for these? They do completely different things, using a single function will only cause confusion, it seems to me:
1. thd_report_wait_for() is used to allow the _replication code_ to detect a deadlock related to ordering.
Yes, but my suggestion was to remove it, remember? To let the engne kill the other thread.
2. thd_deadlock_victim_preference() is only used when the _storage engine_ detects a deadlock, replication related or not.
Right. But my suggestion was to make it replication related. I didn't know you've put modified_non_trans_table in there too. I agree that modified_non_trans_table affects deadlock victim preferences, but does not affect commit order. So, either one needs two functions (thd_deadlock_victim_preference and thd_need_ordering_with) or modified_non_trans_table should be moved out of it where it was. I'd prefer two functions, it was good that you put modified_non_trans_table inside thd_deadlock_victim_preference.
3. thd_need_ordering_with() is used by the storage engine to decide on lock wait semantics.
Which is, basically, the same as 2, the concept is the same. One function is used when deadlock is detected, the other is earlier, but it doesn't make them suficiently different.
Although in that case thd_deadlock_victim_preference() should better be renamed, because it won't be a generic "victim preference", but specifically victim preference, depending on the fixed transaction commit ordering. A better name could be, for example,
thd_get_commit_order(thd, other_thd) -1 - thd first 0 - unspecified +1 - other_thd first
How can the storage engine know the difference between T1 and T2 being in the same group commit, or in different group commits?
Why the engine needs to know that?
How can the storage engine inquire about need for gap locks, without accidentally reporting a lock wait to the upper layer?
thd_get_commit_order() does not report locks waits to the upper layer, that was the point, see above - no reporting, the engine does the killing.
I think it'll be much simpler with only one function. Why? I think it's quite the opposite...
Why? Because there's only one function with very simple and clear semantics - the server informs the engine what transaction must be committed first. The engine uses this information to choose deadlock victims, to resolve waits, etc. This is just one high-level function, that doesn't use concepts of gap locks, statement-based replication, locks that are released before commit, locks that don't cause waits, and whatever else one must keep in mind when using lower-level API. I think it's much simpler to use. It is more complex if you think of everything it might mean and do in different situation. Compare a modern smartphone and an old punch-card computer. An old computer is simpler internally, a smartphone is easier to use.
Hm. So my idea was to make the new API simple and generic, the upper layer would handle most of the detail about commit ordering or whatever. The storage engine only needs to think about simple stuff like lock waits or deadlock victims, both of which it likely already needs to deal with anyway.
But if I understand correctly, another idea could be to make the new API specific just to this particular problem of enforced commit order in parallel replication. So the storage engine would be able to know of such ordering, and should then handle any necessary actions on this by itself?
Eh, not quite. The idea was to make the new API simple and generic *and* high-level, where the engine would be able know about the enforced commit order and act accordingly, instead of giving her lots of low-level orders about what it should do in every specific internal situation.
And you won't need background killing for slave threads
The need for background killing comes because it does not work to call THD::awake() from the context of where the InnoDB deadlock detector calls thd_report_wait_for(). The problem is that innobase_kill_query() does a recursive mutex lock on lock_sys->mutex.
We discussed on IRC to fix this so background kill is not necessary.
One way would be to document that hton->kill_query must be safe to call from within thd_report_wait_for(). Maybe that is the simplest way. It should not be hard to fix the InnoDB code to make this work. For example, InnoDB could just set a flag in the transaction that it already has lock_sys->mutex held, thus avoiding the recursive mutex lock.
Yes, but it'd make a pretty difficult-to-document API, I'd rather avoid it, if possible.
You also had a different suggestion:
here's another idea. To have thd_get_commit_order() as I thought originally *and* allow innodb to kill other threads, that is, add a function for plugins to invoke THD::awake. This function should not call into the engine that invoked it. That is, supposedly the engine knows how to kill threads stuck in that engine, to kill other threads it should use this new helper.
In that case innodb can detect the deadlock and can kill the other thread and it won't break on lock mutex
The underlying logic of this killer helper would be "an engine knows how to kill a transaction waiting in this engine, so we wouldn't bother doing that. but to kill a transaction waiting elsewhere the engine needs help"
It seems better to me if all the details of enforcing commit order and killing worker threads in case of deadlocks could be kept inside the replication code? Otherwise every storage engine would potentially need to re-implement this? I think that is what I tried to do with my proposed storage engine API, to cleanly separate the storage engine's responsibilities, which is about row-level locks and internal deadlock victims, from the replication codes responsibilities.
But the engine already has all that code. Wait-for graph shows what transaction needs to wait for for, and the suggested function simply helps the engine to resolve waits, by specifying what transaction can *not* wait for. Then it goes to the normal deadlock resolution in the engine.
Wouldn't it be simpler to do it like this?
1. InnoDB deadlock detector reports lock waits to whatever function serves the thd_report_wait_for() purpose.
2. We document that thd_report_wait_for() may need to call hton->kill_query().
3. We fix InnoDB to handle (2) correctly (assuming this is possible to do in a simple way).
Then you'd need to document the exact semantics of thd_report_wait_for(), with all its "should not report false positives, such as ... and ...". And document that thd_report_wait_for() might call ha_kill_query, so one would always need to keep this mutex semantics in mind, because errors won't be easy to find, it's not a use case that happens often. Fixing InnoDB, on the other hand, is pretty easy. If we wouldn't need to think of the API part, but only do an internal fix, I'd use this approach.
And anyway, this absolutely must be in a service, not in the global plugin.h.
Why? My understanding was that we do not support ABI stability for storage engines.
Because anything you put in the global plugin.h is not for storage engines only. It's for every and any plugin out there - up for grabs. The server has no idea whether any plugin uses that or not. If we would have a separate header for storage engine API functions only, that would've not been an issue.
See? In this patch you've removed a function from the global plugin API
Right, the thd_rpl_is_parallel(). That should have never be allowed into the code in the first place, it was a temporary hack only. :-(
Sure. I mean that formally removing a function from the API, creates a new API incompatible with the old one. Which, again, formally, should invalidate *all* plugins using the old API. If the API is global - it invalidates all plugins possible.
header, strictly speaking, you must increase the API version - the major number of - making all older plugins incompatible with the new server. This is a bit extreme, taking into account that no plugin actually uses thd_rpl_is_parallel() To avoid this, please only add new APIs as services. See libservices/HOWTO
I can do that, if you want. It just seems a needless complication to me...
The point is to move these functions from a global API (that affects all plugins) to a small API that affect only plugins that explicitly use it. Services really make it much easier to change APIs in the future, because these changes don't affect other plugins.
Are there any performance penalties for using a service over a normal function call (for built-in storage engines)? I seem to recall there is not?
None whatsoever. For built-in storage engines this is #ifdef-ed and they get normal function calls. Regards, Sergei