Sergei Golubchik <vuvova@gmail.com> writes:
This is what I've written in the patch:
I've tried to answer all points in detail, also to help myself better understand all the issues:
+ The storage engine can report such conflicting locks using this call. This + will allow parallel replication to detect such conflicts and resolve the + deadlock (by killing the second transaction to release the locks that the + first is waiting for, and then later re-try the second killed transaction).
What will happen if the engine doesn't? Deadlocks will timeout. Which is annoying, but it only affects the performance, not the correctness.
No, unfortunately that is not so. Let's say we have T1 and T2. T1 goes to wait on a row lock held by T2. Then T2 completes its prepare phase, it goes to the group commit queue and waits for T1 to complete and commit for both of them. T1 gets lock wait timeout, so T1 will roll back and try again. But T1 will just go to wait on the same row lock again that is held by T2. So T1 will retry over and over until slave_transaction_retries is exceeded, at which point replication will abort with an error. There is no lock wait timeout when T2 waits in the group commit queue.
But can it cause the first transaction to be killed, not the second? Then it's worse - it'll violate the strict commit ordering.
No, this can never happen. Commit order is enforced by the wait_for_commit facility. T2 will not be allowed to commit before T1. If T1 is killed and retried, T2 will continue waiting for T1 to commit the second time. And if T1 fails, it will call wakeup_subsequent_commits() with an error, so that T1 can abort its commit.
And, again, the engine needs to do it for every lock wait. Even while most of them won't cause any deadlocks.
Right. This is only needed for parallel replication worker threads. That is why I introduced thd_need_wait_for(). The storage engine can call this once for a transaction or even for a THD, and save the result. And then it won't need to call thd_report_wait_for() except for the parallel replication worker threads. And for those, lock waits should be rare.
+ The storage engine should not report false positives. That is, it should not + report any lock waits that do not actually require one transaction to wait + for the other. Nor should it report waits for locks that will be released + before the commit of the other transactions.
What will happen if the engine reports false positives? Lock waist that don't require one transaction to wait for another. What if it report waits for locks that will be released before commit of the other transaction?
If it reports that T1 will wait for T2, then we will kill T2 and retry it. This will cause unnecessary work if there is no actual deadlock, so mainly a performance issue. A deeper problem is caused by the async kill processing. If T1 is reported as waiting for T2, we will kill T2 in the background thread. But if there is no such wait, or if the wait is resolved due to T2 releasing locks before commit, then T1 and T2 could both complete, and their worker threads move on to new transactions. Then the background thread will be trying to kill a T2 which no longer exists, which will access incorrect rpl_group_info objects. This is the part of the patch I am most unhappy about :-/ But we started to discuss on IRC to possibly avoid the async kill, instead making it possible to call THD::awake() directly inside thd_report_wait_for(). This would be good, as it would solve the deeper problem. In this case, a false positive would then cause only performance issues, I think, though I should think the problem through carefully to be sure.
+void thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd);
You kill the second thread in a background. What if you convey the ordering to the engine and let it know about a deadlock? Like, if you return 1 from thd_report_wait_for() it means a deadlock and the engine will kill the transaction just normally as for any normal deadlock case.
The problem is that T2 could be executing anywhere in the server, including server code or another storage engine, where InnoDB does not really have any possibility to kill it. This is different from a normal InnoDB deadlock, which only happens if both T1 and T2 are doing an actual row lock wait inside the storage engine itself. But in this case, only T1 is waiting, T2 is continuing whatever processing it is doing, eventually it will end up in the group commit queue in binlog or in wait_for_prior_commit(), and wait for T1 to commit first.
+int thd_need_wait_for(const MYSQL_THD thd);
I'd suggest to remove it. If you think it's useful, you can call it from your thd_report_wait_for() and spare storage engine the knowledge of yet another API function.
See discussion above. The intention is to avoid lots of calls to thd_report_wait_for() from the storage engine for non-replication threads. So calling it from inside thd_report_wait_for() makes no sense. I don't understand the benefits of reducing the functionality available in the API when that functionality is completely optional? But on the other hand, I do not feel strongly about this function. I do not even use it in the patch, as InnoDB does a rather expensive graph traversal of all waiters anyway in its deadlock detector, so it can be just omitted. I included this function because without it, there seems no way for a storage engine to avoid introducing O(N**2) overhead for N transactions contending for a row lock. That seemed unacceptable to me.
+int thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd);
Hmm, really? Couldn't you simply disable gap locks in some other already existing way? Like decrease the transaction isolation level or tell the engine that it's RBR?
No, that is not safe. Suppose we have two transactions: T1: INSERT INTO t1 VALUES (1); T2: INSERT INTO t2 SELECT * FROM t1; T1 and T2 are not run in parallel, because they cannot group commit together on the master. However, on the slave, as soon as T1 _starts_ to commit, T2 is allowed to start. But T2 cannot see the changes done by T1 until T1 _completes_ its commit. Thus it is critical that gap locks are disabled _only_ among transactions within the same group commit, not between different group commits. In fact there is a bug in the patch you got regarding this, as it did not check the commit_id before relaxing gap locks. I currently have the following extra patch in my tree to fix this: === modified file 'sql/sql_class.cc' --- sql/sql_class.cc 2014-06-10 08:13:15 +0000 +++ sql/sql_class.cc 2014-06-27 11:47:53 +0000 @@ -4277,6 +4277,8 @@ thd_need_ordering_with(const MYSQL_THD t return 1; if (rgi->current_gtid.domain_id != other_rgi->current_gtid.domain_id) return 1; + if (rgi->commit_id != other_rgi->commit_id) + return 1; /* These two threads are doing parallel replication within the same replication domain. Their commit order is already fixed, so we do not need On the other hand, if we were to omit this part of the patch completely and just keep the gap locks always, then things would still work, because thd_report_wait_for() would detect a deadlock and do a retry to resolve it. But we would get extra rollbacks. Maybe such extra rollbacks would be rare, though.
+int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2); +
Okay, I see why this one is needed.
On the other hand... It seems like it's the only function that you realy need. Because
thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd);
can be replaced with
if (thd_deadlock_victim_preference(thd, other_thd) < 0) goto deadlock;
And
thd_need_wait_for(const MYSQL_THD thd);
can be removed (replaced with noop). And
thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd);
can be replaced with
thd_deadlock_victim_preference(thd, other_thd) != 0
I don't understand the purpose of using the same function for these? They do completely different things, using a single function will only cause confusion, it seems to me: 1. thd_report_wait_for() is used to allow the _replication code_ to detect a deadlock related to ordering. 2. thd_deadlock_victim_preference() is only used when the _storage engine_ detects a deadlock, replication related or not. 3. thd_need_ordering_with() is used by the storage engine to decide on lock wait semantics.
Although in that case thd_deadlock_victim_preference() should better be renamed, because it won't be a generic "victim preference", but specifically victim preference, depending on the fixed transaction commit ordering. A better name could be, for example,
thd_get_commit_order(thd, other_thd) -1 - thd first 0 - unspecified +1 - other_thd first
How can the storage engine know the difference between T1 and T2 being in the same group commit, or in different group commits? How can the storage engine inquire about need for gap locks, without accidentally reporting a lock wait to the upper layer?
I think it'll be much simpler with only one function.
Why? I think it's quite the opposite...
And also it becomes clear what happens if the engine doesn't use it. This function is used to enquire about the commit order, if it is not used - the commit order cannot be guaranteed.
No, that is not the case. On the contrary - the storage engine is free not to use this, commit order will be guaranteed in any case by wait_for_commit(). Hm. So my idea was to make the new API simple and generic, the upper layer would handle most of the detail about commit ordering or whatever. The storage engine only needs to think about simple stuff like lock waits or deadlock victims, both of which it likely already needs to deal with anyway. But if I understand correctly, another idea could be to make the new API specific just to this particular problem of enforced commit order in parallel replication. So the storage engine would be able to know of such ordering, and should then handle any necessary actions on this by itself?
And you won't need background killing for slave threads
The need for background killing comes because it does not work to call THD::awake() from the context of where the InnoDB deadlock detector calls thd_report_wait_for(). The problem is that innobase_kill_query() does a recursive mutex lock on lock_sys->mutex. We discussed on IRC to fix this so background kill is not necessary. One way would be to document that hton->kill_query must be safe to call from within thd_report_wait_for(). Maybe that is the simplest way. It should not be hard to fix the InnoDB code to make this work. For example, InnoDB could just set a flag in the transaction that it already has lock_sys->mutex held, thus avoiding the recursive mutex lock. You also had a different suggestion:
here's another idea. To have thd_get_commit_order() as I thought originally *and* allow innodb to kill other threads, that is, add a function for plugins to invoke THD::awake. This function should not call into the engine that invoked it. That is, supposedly the engine knows how to kill threads stuck in that engine, to kill other threads it should use this new helper.
In that case innodb can detect the deadlock and can kill the other thread and it won't break on lock mutex
The underlying logic of this killer helper would be "an engine knows how to kill a transaction waiting in this engine, so we wouldn't bother doing that. but to kill a transaction waiting elsewhere the engine needs help"
It seems better to me if all the details of enforcing commit order and killing worker threads in case of deadlocks could be kept inside the replication code? Otherwise every storage engine would potentially need to re-implement this? I think that is what I tried to do with my proposed storage engine API, to cleanly separate the storage engine's responsibilities, which is about row-level locks and internal deadlock victims, from the replication codes responsibilities. Wouldn't it be simpler to do it like this? 1. InnoDB deadlock detector reports lock waits to whatever function serves the thd_report_wait_for() purpose. 2. We document that thd_report_wait_for() may need to call hton->kill_query(). 3. We fix InnoDB to handle (2) correctly (assuming this is possible to do in a simple way).
And anyway, this absolutely must be in a service, not in the global plugin.h.
Why? My understanding was that we do not support ABI stability for storage engines. (And I think we should absolutely not provide such support for storage engines. Storage engines are all about performance and flexibility - ABI stability will make this harder to provide without sufficient other benefits to justify this. ABI seems mainly suited for eg. authentication plugins).
See? In this patch you've removed a function from the global plugin API
Right, the thd_rpl_is_parallel(). That should have never be allowed into the code in the first place, it was a temporary hack only. :-(
header, strictly speaking, you must increase the API version - the major number of - making all older plugins incompatible with the new server. This is a bit extreme, taking into account that no plugin actually uses thd_rpl_is_parallel() To avoid this, please only add new APIs as services. See libservices/HOWTO
I can do that, if you want. It just seems a needless complication to me... Are there any performance penalties for using a service over a normal function call (for built-in storage engines)? I seem to recall there is not? - Kristian.