Re: [Maria-developers] comments to the plugin.h
Sergei Golubchik <vuvova@gmail.com> writes:
This is what I've written in the patch:
I've tried to answer all points in detail, also to help myself better understand all the issues:
+ The storage engine can report such conflicting locks using this call. This + will allow parallel replication to detect such conflicts and resolve the + deadlock (by killing the second transaction to release the locks that the + first is waiting for, and then later re-try the second killed transaction).
What will happen if the engine doesn't? Deadlocks will timeout. Which is annoying, but it only affects the performance, not the correctness.
No, unfortunately that is not so. Let's say we have T1 and T2. T1 goes to wait on a row lock held by T2. Then T2 completes its prepare phase, it goes to the group commit queue and waits for T1 to complete and commit for both of them. T1 gets lock wait timeout, so T1 will roll back and try again. But T1 will just go to wait on the same row lock again that is held by T2. So T1 will retry over and over until slave_transaction_retries is exceeded, at which point replication will abort with an error. There is no lock wait timeout when T2 waits in the group commit queue.
But can it cause the first transaction to be killed, not the second? Then it's worse - it'll violate the strict commit ordering.
No, this can never happen. Commit order is enforced by the wait_for_commit facility. T2 will not be allowed to commit before T1. If T1 is killed and retried, T2 will continue waiting for T1 to commit the second time. And if T1 fails, it will call wakeup_subsequent_commits() with an error, so that T1 can abort its commit.
And, again, the engine needs to do it for every lock wait. Even while most of them won't cause any deadlocks.
Right. This is only needed for parallel replication worker threads. That is why I introduced thd_need_wait_for(). The storage engine can call this once for a transaction or even for a THD, and save the result. And then it won't need to call thd_report_wait_for() except for the parallel replication worker threads. And for those, lock waits should be rare.
+ The storage engine should not report false positives. That is, it should not + report any lock waits that do not actually require one transaction to wait + for the other. Nor should it report waits for locks that will be released + before the commit of the other transactions.
What will happen if the engine reports false positives? Lock waist that don't require one transaction to wait for another. What if it report waits for locks that will be released before commit of the other transaction?
If it reports that T1 will wait for T2, then we will kill T2 and retry it. This will cause unnecessary work if there is no actual deadlock, so mainly a performance issue. A deeper problem is caused by the async kill processing. If T1 is reported as waiting for T2, we will kill T2 in the background thread. But if there is no such wait, or if the wait is resolved due to T2 releasing locks before commit, then T1 and T2 could both complete, and their worker threads move on to new transactions. Then the background thread will be trying to kill a T2 which no longer exists, which will access incorrect rpl_group_info objects. This is the part of the patch I am most unhappy about :-/ But we started to discuss on IRC to possibly avoid the async kill, instead making it possible to call THD::awake() directly inside thd_report_wait_for(). This would be good, as it would solve the deeper problem. In this case, a false positive would then cause only performance issues, I think, though I should think the problem through carefully to be sure.
+void thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd);
You kill the second thread in a background. What if you convey the ordering to the engine and let it know about a deadlock? Like, if you return 1 from thd_report_wait_for() it means a deadlock and the engine will kill the transaction just normally as for any normal deadlock case.
The problem is that T2 could be executing anywhere in the server, including server code or another storage engine, where InnoDB does not really have any possibility to kill it. This is different from a normal InnoDB deadlock, which only happens if both T1 and T2 are doing an actual row lock wait inside the storage engine itself. But in this case, only T1 is waiting, T2 is continuing whatever processing it is doing, eventually it will end up in the group commit queue in binlog or in wait_for_prior_commit(), and wait for T1 to commit first.
+int thd_need_wait_for(const MYSQL_THD thd);
I'd suggest to remove it. If you think it's useful, you can call it from your thd_report_wait_for() and spare storage engine the knowledge of yet another API function.
See discussion above. The intention is to avoid lots of calls to thd_report_wait_for() from the storage engine for non-replication threads. So calling it from inside thd_report_wait_for() makes no sense. I don't understand the benefits of reducing the functionality available in the API when that functionality is completely optional? But on the other hand, I do not feel strongly about this function. I do not even use it in the patch, as InnoDB does a rather expensive graph traversal of all waiters anyway in its deadlock detector, so it can be just omitted. I included this function because without it, there seems no way for a storage engine to avoid introducing O(N**2) overhead for N transactions contending for a row lock. That seemed unacceptable to me.
+int thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd);
Hmm, really? Couldn't you simply disable gap locks in some other already existing way? Like decrease the transaction isolation level or tell the engine that it's RBR?
No, that is not safe. Suppose we have two transactions: T1: INSERT INTO t1 VALUES (1); T2: INSERT INTO t2 SELECT * FROM t1; T1 and T2 are not run in parallel, because they cannot group commit together on the master. However, on the slave, as soon as T1 _starts_ to commit, T2 is allowed to start. But T2 cannot see the changes done by T1 until T1 _completes_ its commit. Thus it is critical that gap locks are disabled _only_ among transactions within the same group commit, not between different group commits. In fact there is a bug in the patch you got regarding this, as it did not check the commit_id before relaxing gap locks. I currently have the following extra patch in my tree to fix this: === modified file 'sql/sql_class.cc' --- sql/sql_class.cc 2014-06-10 08:13:15 +0000 +++ sql/sql_class.cc 2014-06-27 11:47:53 +0000 @@ -4277,6 +4277,8 @@ thd_need_ordering_with(const MYSQL_THD t return 1; if (rgi->current_gtid.domain_id != other_rgi->current_gtid.domain_id) return 1; + if (rgi->commit_id != other_rgi->commit_id) + return 1; /* These two threads are doing parallel replication within the same replication domain. Their commit order is already fixed, so we do not need On the other hand, if we were to omit this part of the patch completely and just keep the gap locks always, then things would still work, because thd_report_wait_for() would detect a deadlock and do a retry to resolve it. But we would get extra rollbacks. Maybe such extra rollbacks would be rare, though.
+int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2); +
Okay, I see why this one is needed.
On the other hand... It seems like it's the only function that you realy need. Because
thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd);
can be replaced with
if (thd_deadlock_victim_preference(thd, other_thd) < 0) goto deadlock;
And
thd_need_wait_for(const MYSQL_THD thd);
can be removed (replaced with noop). And
thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd);
can be replaced with
thd_deadlock_victim_preference(thd, other_thd) != 0
I don't understand the purpose of using the same function for these? They do completely different things, using a single function will only cause confusion, it seems to me: 1. thd_report_wait_for() is used to allow the _replication code_ to detect a deadlock related to ordering. 2. thd_deadlock_victim_preference() is only used when the _storage engine_ detects a deadlock, replication related or not. 3. thd_need_ordering_with() is used by the storage engine to decide on lock wait semantics.
Although in that case thd_deadlock_victim_preference() should better be renamed, because it won't be a generic "victim preference", but specifically victim preference, depending on the fixed transaction commit ordering. A better name could be, for example,
thd_get_commit_order(thd, other_thd) -1 - thd first 0 - unspecified +1 - other_thd first
How can the storage engine know the difference between T1 and T2 being in the same group commit, or in different group commits? How can the storage engine inquire about need for gap locks, without accidentally reporting a lock wait to the upper layer?
I think it'll be much simpler with only one function.
Why? I think it's quite the opposite...
And also it becomes clear what happens if the engine doesn't use it. This function is used to enquire about the commit order, if it is not used - the commit order cannot be guaranteed.
No, that is not the case. On the contrary - the storage engine is free not to use this, commit order will be guaranteed in any case by wait_for_commit(). Hm. So my idea was to make the new API simple and generic, the upper layer would handle most of the detail about commit ordering or whatever. The storage engine only needs to think about simple stuff like lock waits or deadlock victims, both of which it likely already needs to deal with anyway. But if I understand correctly, another idea could be to make the new API specific just to this particular problem of enforced commit order in parallel replication. So the storage engine would be able to know of such ordering, and should then handle any necessary actions on this by itself?
And you won't need background killing for slave threads
The need for background killing comes because it does not work to call THD::awake() from the context of where the InnoDB deadlock detector calls thd_report_wait_for(). The problem is that innobase_kill_query() does a recursive mutex lock on lock_sys->mutex. We discussed on IRC to fix this so background kill is not necessary. One way would be to document that hton->kill_query must be safe to call from within thd_report_wait_for(). Maybe that is the simplest way. It should not be hard to fix the InnoDB code to make this work. For example, InnoDB could just set a flag in the transaction that it already has lock_sys->mutex held, thus avoiding the recursive mutex lock. You also had a different suggestion:
here's another idea. To have thd_get_commit_order() as I thought originally *and* allow innodb to kill other threads, that is, add a function for plugins to invoke THD::awake. This function should not call into the engine that invoked it. That is, supposedly the engine knows how to kill threads stuck in that engine, to kill other threads it should use this new helper.
In that case innodb can detect the deadlock and can kill the other thread and it won't break on lock mutex
The underlying logic of this killer helper would be "an engine knows how to kill a transaction waiting in this engine, so we wouldn't bother doing that. but to kill a transaction waiting elsewhere the engine needs help"
It seems better to me if all the details of enforcing commit order and killing worker threads in case of deadlocks could be kept inside the replication code? Otherwise every storage engine would potentially need to re-implement this? I think that is what I tried to do with my proposed storage engine API, to cleanly separate the storage engine's responsibilities, which is about row-level locks and internal deadlock victims, from the replication codes responsibilities. Wouldn't it be simpler to do it like this? 1. InnoDB deadlock detector reports lock waits to whatever function serves the thd_report_wait_for() purpose. 2. We document that thd_report_wait_for() may need to call hton->kill_query(). 3. We fix InnoDB to handle (2) correctly (assuming this is possible to do in a simple way).
And anyway, this absolutely must be in a service, not in the global plugin.h.
Why? My understanding was that we do not support ABI stability for storage engines. (And I think we should absolutely not provide such support for storage engines. Storage engines are all about performance and flexibility - ABI stability will make this harder to provide without sufficient other benefits to justify this. ABI seems mainly suited for eg. authentication plugins).
See? In this patch you've removed a function from the global plugin API
Right, the thd_rpl_is_parallel(). That should have never be allowed into the code in the first place, it was a temporary hack only. :-(
header, strictly speaking, you must increase the API version - the major number of - making all older plugins incompatible with the new server. This is a bit extreme, taking into account that no plugin actually uses thd_rpl_is_parallel() To avoid this, please only add new APIs as services. See libservices/HOWTO
I can do that, if you want. It just seems a needless complication to me... Are there any performance penalties for using a service over a normal function call (for built-in storage engines)? I seem to recall there is not? - Kristian.
Hi, Kristian! On Jul 01, Kristian Nielsen wrote:
+ The storage engine can report such conflicting locks using this call. This + will allow parallel replication to detect such conflicts and resolve the + deadlock (by killing the second transaction to release the locks that the + first is waiting for, and then later re-try the second killed transaction).
What will happen if the engine doesn't? Deadlocks will timeout. Which is annoying, but it only affects the performance, not the correctness.
No, unfortunately that is not so.
Let's say we have T1 and T2. T1 goes to wait on a row lock held by T2. Then T2 completes its prepare phase, it goes to the group commit queue and waits for T1 to complete and commit for both of them. T1 gets lock wait timeout, so T1 will roll back and try again. But T1 will just go to wait on the same row lock again that is held by T2. So T1 will retry over and over until slave_transaction_retries is exceeded, at which point replication will abort with an error.
There is no lock wait timeout when T2 waits in the group commit queue.
Right. I thought that if a transaction is rolled back, you can, technically force *all* later transactions to roll back too and redo them all. But that'd be kind of expensive. On the other hand, requiring *every* engine to report waits... Not good either. And you cannot really enforce that, there's no way to know whether an engine does it (compare with handler methods - a required method can be simply declared abstract and the engine cannot possibly miss it). I'm kind of lost here. I really don't like requiring engines to do more obligatory things, but I don't see a good way around it either. If we could know whether an engine uses this new API, we could - like I wrote above - default to rolling back all transactions in a group after the failed one. And then the new API would've been merely an optimization. *That* would've been a big improvement. Can we deduce this somehow, without requiring the engine to declare a flag, like, HA_CAN_SUPPORT_TRX_COMMIT_ORDER_IN_PARALLEL_REPLICATION ?
+void thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd);
You kill the second thread in a background. What if you convey the ordering to the engine and let it know about a deadlock? Like, if you return 1 from thd_report_wait_for() it means a deadlock and the engine will kill the transaction just normally as for any normal deadlock case.
The problem is that T2 could be executing anywhere in the server, including server code or another storage engine, where InnoDB does not really have any possibility to kill it.
This is different from a normal InnoDB deadlock, which only happens if both T1 and T2 are doing an actual row lock wait inside the storage engine itself. But in this case, only T1 is waiting, T2 is continuing whatever processing it is doing, eventually it will end up in the group commit queue in binlog or in wait_for_prior_commit(), and wait for T1 to commit first.
I don't know whether you've seen what I suggested on irc, so I'll repeat. Provide a thd_kill(thd) helper that an engine can call to kill a specific thd. It will *not* call ha_kill_query for this particular engine, but will call for other engines. The logic is - the engine knows how to kill a thread that's stuck somewhere in this very engine. But it needs help to kill a thread that's somewhere else. Also it'll help to avoid mutex issues that you were seeing.
+int thd_need_wait_for(const MYSQL_THD thd);
I'd suggest to remove it. If you think it's useful, you can call it from your thd_report_wait_for() and spare storage engine the knowledge of yet another API function.
See discussion above. The intention is to avoid lots of calls to thd_report_wait_for() from the storage engine for non-replication threads. So calling it from inside thd_report_wait_for() makes no sense.
If it's on the very top of thd_report_wait_for() then it'll be one function that immediately returns.
I don't understand the benefits of reducing the functionality available in the API when that functionality is completely optional?
The benefit is the keeping API as simple as possible, while still fully functional. It's very important to have the simple API, even if it's doing something complex underneath. See pluggable authentication, for example, the big comment inside native_password_authenticate() function.
I included this function because without it, there seems no way for a storage engine to avoid introducing O(N**2) overhead for N transactions contending for a row lock. That seemed unacceptable to me.
This is O(N**2) of function calls that immediately return, it's neglectable on the total cost of graph traversal. And anyway, it's only called before waits - the engine is going to wait anyway, one function call won't make this wait any noticeably longer.
+int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2);
Okay, I see why this one is needed.
On the other hand... It seems like it's the only function that you realy need. Because thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd); can be replaced with if (thd_deadlock_victim_preference(thd, other_thd) < 0) goto deadlock; And thd_need_wait_for(const MYSQL_THD thd); can be removed (replaced with noop). And thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd); can be replaced with thd_deadlock_victim_preference(thd, other_thd) != 0
I don't understand the purpose of using the same function for these? They do completely different things, using a single function will only cause confusion, it seems to me:
1. thd_report_wait_for() is used to allow the _replication code_ to detect a deadlock related to ordering.
Yes, but my suggestion was to remove it, remember? To let the engne kill the other thread.
2. thd_deadlock_victim_preference() is only used when the _storage engine_ detects a deadlock, replication related or not.
Right. But my suggestion was to make it replication related. I didn't know you've put modified_non_trans_table in there too. I agree that modified_non_trans_table affects deadlock victim preferences, but does not affect commit order. So, either one needs two functions (thd_deadlock_victim_preference and thd_need_ordering_with) or modified_non_trans_table should be moved out of it where it was. I'd prefer two functions, it was good that you put modified_non_trans_table inside thd_deadlock_victim_preference.
3. thd_need_ordering_with() is used by the storage engine to decide on lock wait semantics.
Which is, basically, the same as 2, the concept is the same. One function is used when deadlock is detected, the other is earlier, but it doesn't make them suficiently different.
Although in that case thd_deadlock_victim_preference() should better be renamed, because it won't be a generic "victim preference", but specifically victim preference, depending on the fixed transaction commit ordering. A better name could be, for example,
thd_get_commit_order(thd, other_thd) -1 - thd first 0 - unspecified +1 - other_thd first
How can the storage engine know the difference between T1 and T2 being in the same group commit, or in different group commits?
Why the engine needs to know that?
How can the storage engine inquire about need for gap locks, without accidentally reporting a lock wait to the upper layer?
thd_get_commit_order() does not report locks waits to the upper layer, that was the point, see above - no reporting, the engine does the killing.
I think it'll be much simpler with only one function. Why? I think it's quite the opposite...
Why? Because there's only one function with very simple and clear semantics - the server informs the engine what transaction must be committed first. The engine uses this information to choose deadlock victims, to resolve waits, etc. This is just one high-level function, that doesn't use concepts of gap locks, statement-based replication, locks that are released before commit, locks that don't cause waits, and whatever else one must keep in mind when using lower-level API. I think it's much simpler to use. It is more complex if you think of everything it might mean and do in different situation. Compare a modern smartphone and an old punch-card computer. An old computer is simpler internally, a smartphone is easier to use.
Hm. So my idea was to make the new API simple and generic, the upper layer would handle most of the detail about commit ordering or whatever. The storage engine only needs to think about simple stuff like lock waits or deadlock victims, both of which it likely already needs to deal with anyway.
But if I understand correctly, another idea could be to make the new API specific just to this particular problem of enforced commit order in parallel replication. So the storage engine would be able to know of such ordering, and should then handle any necessary actions on this by itself?
Eh, not quite. The idea was to make the new API simple and generic *and* high-level, where the engine would be able know about the enforced commit order and act accordingly, instead of giving her lots of low-level orders about what it should do in every specific internal situation.
And you won't need background killing for slave threads
The need for background killing comes because it does not work to call THD::awake() from the context of where the InnoDB deadlock detector calls thd_report_wait_for(). The problem is that innobase_kill_query() does a recursive mutex lock on lock_sys->mutex.
We discussed on IRC to fix this so background kill is not necessary.
One way would be to document that hton->kill_query must be safe to call from within thd_report_wait_for(). Maybe that is the simplest way. It should not be hard to fix the InnoDB code to make this work. For example, InnoDB could just set a flag in the transaction that it already has lock_sys->mutex held, thus avoiding the recursive mutex lock.
Yes, but it'd make a pretty difficult-to-document API, I'd rather avoid it, if possible.
You also had a different suggestion:
here's another idea. To have thd_get_commit_order() as I thought originally *and* allow innodb to kill other threads, that is, add a function for plugins to invoke THD::awake. This function should not call into the engine that invoked it. That is, supposedly the engine knows how to kill threads stuck in that engine, to kill other threads it should use this new helper.
In that case innodb can detect the deadlock and can kill the other thread and it won't break on lock mutex
The underlying logic of this killer helper would be "an engine knows how to kill a transaction waiting in this engine, so we wouldn't bother doing that. but to kill a transaction waiting elsewhere the engine needs help"
It seems better to me if all the details of enforcing commit order and killing worker threads in case of deadlocks could be kept inside the replication code? Otherwise every storage engine would potentially need to re-implement this? I think that is what I tried to do with my proposed storage engine API, to cleanly separate the storage engine's responsibilities, which is about row-level locks and internal deadlock victims, from the replication codes responsibilities.
But the engine already has all that code. Wait-for graph shows what transaction needs to wait for for, and the suggested function simply helps the engine to resolve waits, by specifying what transaction can *not* wait for. Then it goes to the normal deadlock resolution in the engine.
Wouldn't it be simpler to do it like this?
1. InnoDB deadlock detector reports lock waits to whatever function serves the thd_report_wait_for() purpose.
2. We document that thd_report_wait_for() may need to call hton->kill_query().
3. We fix InnoDB to handle (2) correctly (assuming this is possible to do in a simple way).
Then you'd need to document the exact semantics of thd_report_wait_for(), with all its "should not report false positives, such as ... and ...". And document that thd_report_wait_for() might call ha_kill_query, so one would always need to keep this mutex semantics in mind, because errors won't be easy to find, it's not a use case that happens often. Fixing InnoDB, on the other hand, is pretty easy. If we wouldn't need to think of the API part, but only do an internal fix, I'd use this approach.
And anyway, this absolutely must be in a service, not in the global plugin.h.
Why? My understanding was that we do not support ABI stability for storage engines.
Because anything you put in the global plugin.h is not for storage engines only. It's for every and any plugin out there - up for grabs. The server has no idea whether any plugin uses that or not. If we would have a separate header for storage engine API functions only, that would've not been an issue.
See? In this patch you've removed a function from the global plugin API
Right, the thd_rpl_is_parallel(). That should have never be allowed into the code in the first place, it was a temporary hack only. :-(
Sure. I mean that formally removing a function from the API, creates a new API incompatible with the old one. Which, again, formally, should invalidate *all* plugins using the old API. If the API is global - it invalidates all plugins possible.
header, strictly speaking, you must increase the API version - the major number of - making all older plugins incompatible with the new server. This is a bit extreme, taking into account that no plugin actually uses thd_rpl_is_parallel() To avoid this, please only add new APIs as services. See libservices/HOWTO
I can do that, if you want. It just seems a needless complication to me...
The point is to move these functions from a global API (that affects all plugins) to a small API that affect only plugins that explicitly use it. Services really make it much easier to change APIs in the future, because these changes don't affect other plugins.
Are there any performance penalties for using a service over a normal function call (for built-in storage engines)? I seem to recall there is not?
None whatsoever. For built-in storage engines this is #ifdef-ed and they get normal function calls. Regards, Sergei
Sergei Golubchik <serg@mariadb.org> writes:
I'm kind of lost here. I really don't like requiring engines to do more obligatory things, but I don't see a good way around it either.
Agree, that's how I felt while trying to think of _any_ way to solve these problems. I ended up with current solution, but not entirely happy about it. At least, we have some more ideas now to improve things. What should be the next step? I suggest that I could change the API to use the new thd_get_commit_order() instead (see below for detailed suggestion). And try to implement the method where InnoDB gets the information about commit order and does the kill of the other transaction itself. And I'll read up on services and use it for the new API. And then I could prepare a new patch for you to look at. Does this sounds reasonable? Or do you have another suggestion? More detailed discussions below:
I thought that if a transaction is rolled back, you can, technically force *all* later transactions to roll back too and redo them all. But that'd be kind of expensive.
It's not the expensive that worries me. The problem is that some of the following transactions may not be possible to roll back. Suppose we have T1, T2, T3. T1 and T2 are InnoDB DML, T3 is DDL or MyISAM. If now we get a deadlock between T1 and T2 and kill T3 to resolve it, we can seriously corrupt the replication, because we will be left with T3 half-way executed. This is worse because there might not really have been any deadlock - it could be just a too-low deadlock timeout. This is the reason I did the fix as I did. In my patch, we will kill and retry only T2, so T3 will be unaffected.
If we could know whether an engine uses this new API, we could - like I wrote above - default to rolling back all transactions in a group after the failed one. And then the new API would've been merely an optimization. *That* would've been a big improvement.
Suppose with the current patch, the engine does not support the new API. T1 will get a lock timeout. It will retry, but get the lock timeout again, and after slave_transaction_retries it will give up, and replication will stop. So seems we have two alternative default behaviours (I mean if the storage engine does not implement the new API): 1. As in current patch, replication will stop and the DBA will be made aware of the issue and will have to handle manually. 2. As you suggest, we will kill T3, slave will continue automatically (with some lock timeout delays), but we might have silently corrupted the slave state. (1) seems safer, while (2) will be a lot more convenient for the DBA in many cases. Which do you think we should prefer? I wonder if we could somehow check if T3 is safe to roll back? If it is, (2) is both convenient and safe, and could be used. Many users use only InnoDB normally and use DDL rarely. But I'm not sure how hard it would be to do this check in the general case. Maybe it would be worth to look into, as you said it would be much preferable if the new storage engine API was just an optional optimisation.
2. thd_deadlock_victim_preference() is only used when the _storage engine_ detects a deadlock, replication related or not.
3. thd_need_ordering_with() is used by the storage engine to decide on lock wait semantics.
Which is, basically, the same as 2, the concept is the same. One function is used when deadlock is detected, the other is earlier, but it doesn't make them suficiently different.
I do not think it is the same, let me try to explain once more and maybe it will become more clear one way or the other. With just (1), thd_report_wait_for() or whatever we replace it with, the basic problem is solved. In case of a deadlock between T1 and T2, we will kill T2, let T1 continue, and later re-try T2. This works even without thd_need_ordering_with(). But in some cases the deadlock, kill, and retry is unnecessary. This is about the issue in MDEV-5914, for example: T1: UPDATE t1 SET secondary=NULL WHERE primary=1 T2: DELETE t1 WHERE secondary <= 3 If T1 and T2 were group-committed together on the master, we know that it is safe to run them in parallel, so we can relax the gap lock taken by T2 that can block T1. Then we avoid having to rollback and retry T2. But if T2 committed in a different group commit on the master, we do not know that this relaxing of gap locks is safe. Because T2 may still run in parallel with the COMMIT of T1, we need to keep the gap locks, so that T2 will wait if it needs to access any row modified by T1. So thd_need_ordering_with() really _is_ just an optimisation. We might omit it completely from the patch, or we could keep it as a maybe useful optimisation. One way to use the same function might be: thd_get_commit_order(thd, other_thd) -2 - thd commits before, and it is not in same group commit as other_thd -1 - thd commits before, but it is in same group commit as other_thd 0 - thd and other_thd can commit in either order 1 - other_thd commits before, but it is in same group commit as thd 2 - other_thd commits before, and it is not in same group commit as thd Is this something like what you want?
I think it'll be much simpler with only one function. Why? I think it's quite the opposite...
Why? Because there's only one function with very simple and clear semantics - the server informs the engine what transaction must be committed first. The engine uses this information to choose deadlock victims, to resolve waits, etc. This is just one high-level function, that doesn't use concepts of gap locks, statement-based replication, locks that are released before commit, locks that don't cause waits, and whatever else one must keep in mind when using lower-level API.
I think it's much simpler to use. It is more complex if you think of everything it might mean and do in different situation. Compare a modern smartphone and an old punch-card computer. An old computer is simpler internally, a smartphone is easier to use.
So I'm fine with whatever way we agree with in the end. The reason I think my original API is simpler is that there is just one very simple thing that the engine must do: call thd_report_wait_for(T1,T2) whenever T1 is about to wait until T2 commits. Everything else is optional optimisations that can be safely ignored. It seems that any conceivable engine must already have easy access to knowledge about one transaction needing to wait for another. And indeed, in InnoDB only a couple of lines need to be added to handle this. So it is very simple to use. With your suggestion, however, things are more complex for the engine. It needs to implement additional deadlock detection, based on some enforced commit order. The very concept of "enforced commit order" is probably something new. And it needs to implement some way to resolve deadlocks with another transaction T2 that may or may not be running inside the engine. So again, a new concept is introduced, "Is transaction T running inside this engine"? So I still do not understand why you think the second is simpler, or easier to use? But I can try to see if I can implement the second idea. The main complication I see is how to be sure if the T2 is running inside the engine or not. Maybe like this: 1. Call thd_get_commit_order(T1, T2), we learn that we have a deadlock, so need to kill T2. 2. Set a flag in T2 that it has become a deadlock victim. 3. Call the new thd_plugin_awake(), it will kill T2 if it is outside of the engine. 4. It could be that T2 was executing outside the engine at point (2), but had time to enter the engine while we were traversing the list of handlertons in (3). But then we can add code at appropriate places in the engine to check for the flag set in (2), and make sure T2 gets killed also in this case. (Maybe the necessary logic is already present in innobase_kill_query(), I will check). Have I understood your suggestions correctly now, do you think?
So, either one needs two functions (thd_deadlock_victim_preference and thd_need_ordering_with) or modified_non_trans_table should be moved out of it where it was. I'd prefer two functions, it was good that you put modified_non_trans_table inside thd_deadlock_victim_preference.
Ok, right, modified_non_trans_table is unrelated but it felt natural to move it out of the storage engine and into a more general function.
This is O(N**2) of function calls that immediately return, it's neglectable on the total cost of graph traversal. And anyway, it's only called before waits - the engine is going to wait anyway, one function call won't make this wait any noticeably longer.
You are assuming that the engine will need to do this O(N**2) graph traversal anyway. That's true in InnoDB, which has a very primitive algorithm for deadlock detection. But it might not be true in another engine. The idea with thd_need_wait_for() was that it would be used by a storage engine that would not normally need to loop over all other transactions. So it could avoid needlessly implementing such a loop for _all_ transactions, where it would only be needed for replication transactions (which will usually have few waits, if any). But let's omit it for now, we do not know of any engine that might need it, and if someday we find one we can easily add something at that time.
One way would be to document that hton->kill_query must be safe to call from within thd_report_wait_for(). Maybe that is the simplest way. It should not be hard to fix the InnoDB code to make this work. For example, InnoDB could just set a flag in the transaction that it already has lock_sys->mutex held, thus avoiding the recursive mutex lock.
Yes, but it'd make a pretty difficult-to-document API, I'd rather avoid it, if possible.
Agree, it's not nice, I'd also like to avoid it. Thanks, - Kristian.
Hi, Kristian! On Jul 03, Kristian Nielsen wrote:
Sergei Golubchik <serg@mariadb.org> writes:
I'm kind of lost here. I really don't like requiring engines to do more obligatory things, but I don't see a good way around it either.
Agree, that's how I felt while trying to think of _any_ way to solve these problems. I ended up with current solution, but not entirely happy about it.
At least, we have some more ideas now to improve things.
What should be the next step?
As you suggested on irc, it would make sense to make a smaller innodb/xtradb only fix in 10.0 and a more engine-friendly, with the new api, in 10.1
I thought that if a transaction is rolled back, you can, technically force *all* later transactions to roll back too and redo them all. But that'd be kind of expensive.
It's not the expensive that worries me. The problem is that some of the following transactions may not be possible to roll back.
Ah, yes, indeed. We could still 1. rollback regardless and possibly break replication in this case. saying that a transactional engine will work without modifications in most cases, but not when it's mixed with non-trans updates 2. as discussed, have a flag to mark non-trans-updates transactions and don't run them in parallel at all. then a transactional engine will work without modifications. but that's for 10.1, if we do innodb-only fix in 10.0, it means we aren't concerned with other engines there.
But in some cases the deadlock, kill, and retry is unnecessary. This is about the issue in MDEV-5914, for example:
T1: UPDATE t1 SET secondary=NULL WHERE primary=1 T2: DELETE t1 WHERE secondary <= 3
If T1 and T2 were group-committed together on the master, we know that it is safe to run them in parallel, so we can relax the gap lock taken by T2 that can block T1. Then we avoid having to rollback and retry T2.
But if T2 committed in a different group commit on the master, we do not know that this relaxing of gap locks is safe. Because T2 may still run in parallel with the COMMIT of T1, we need to keep the gap locks, so that T2 will wait if it needs to access any row modified by T1.
How can T2 run in parallel with T1 if they're from different groups?
The reason I think my original API is simpler is that there is just one very simple thing that the engine must do: call thd_report_wait_for(T1,T2) whenever T1 is about to wait until T2 commits. Everything else is optional optimisations that can be safely ignored. It seems that any conceivable engine must already have easy access to knowledge about one transaction needing to wait for another. And indeed, in InnoDB only a couple of lines need to be added to handle this. So it is very simple to use. With your suggestion, however, things are more complex for the engine. It needs to implement additional deadlock detection, based on some enforced commit order. The very concept of "enforced commit order" is probably something new. And it needs to implement some way to resolve deadlocks with another transaction T2 that may or may not be running inside the engine. So again, a new concept is introduced, "Is transaction T running inside this engine"?
Hmm, okay... When you put it this way, it does sound simpler. Allright, let's keep thd_report_wait_for() :) Regards, Sergei
Sergei Golubchik <serg@mariadb.org> writes:
As you suggested on irc, it would make sense to make a smaller innodb/xtradb only fix in 10.0 and a more engine-friendly, with the new api, in 10.1
Hmm, okay... When you put it this way, it does sound simpler. Allright, let's keep thd_report_wait_for() :)
Right. So then, the plan seems to be: 1. I remove the new calls from include/plugin.h, instead place them somewhere not part of a public API (maybe just "extern" declarations inside InnoDB/XtraDB, and whatever is needed to make it work correctly for ha_innodb.dll on Windows). 2. I try to remove the kill-in-background, instead do it directly in the thread doing thd_report_wait_for() (I think that should be possible). 3. I apply the other review comments that you sent in another mail. 4. I file a Jira task for 10.1 about a general solution, with a good API and other ideas collected so far. Other than this, the patch will be much the same as what I had initially. Is this ok with you? Or did I miss something?
It's not the expensive that worries me. The problem is that some of the following transactions may not be possible to roll back.
Ah, yes, indeed. We could still
1. rollback regardless and possibly break replication in this case. saying that a transactional engine will work without modifications in most cases, but not when it's mixed with non-trans updates
2. as discussed, have a flag to mark non-trans-updates transactions and don't run them in parallel at all. then a transactional engine will work without modifications.
but that's for 10.1, if we do innodb-only fix in 10.0, it means we aren't concerned with other engines there.
Yes, agree. Seems like a reasonable solution. I share your concerns about the current solution, and some of these ideas seem possible to solve most of the issues better, but are better suited for a next major release.
How can T2 run in parallel with T1 if they're from different groups?
T2 can run in parallel with the commit step of T1, but not with any events of T1 prior to commit. In more detail: Suppose we have 4 transactions in two group commits: (T1, T2) followed by (T3, T4). We will schedule T1, T2, T3, and T4 in parallel, each on their own thread (assuming @@slave_parallel_threads >= 4). T1 and T2 are in the same group commit, so they are allowed to start immediately. However, T3 and T4 are in a different group commit, so they are not ready to start - they might conflict with T1 or T2. So they wait. Suppose T2 reaches its COMMIT (or XID) event first. It calls mark_start_commit(), however at this point it does not do anything. T2 has commit order after T1, so it goes to wait for T1 in wait_for_prior_commit(). Now T1 reaches its COMMIT/XID event, and calls mark_start_commit(). Now both T1 and T2 have completed all their modifications, and are ready to commit. This means that we can now start running T3 and T4. T3 or T4 might have conflicting rows with T1 or T2, but T1 and T2 have already done all their modifications, so it's ok. If there is a conflict, T3 and T4 will just wait. If not, T3 and T4 can run in parallel with the commit steps of T1 and T2. Suppose both T3 and T4 have time to reach their COMMIT/XID event before T1 has time to complete commit. Then T1 can find both T2, T3, and T4 queued up for group commit. And T1 can do a single group commit for all four of them, sharing the fsync overhead among 4 transactions. This way, we get more opportunity for parallelism. This optimisation (starting T3/T4 at _start_ of T1/T2 commit, rather than after) is particularly effective when commit is expensive, eg. with --sync-binlog=1 and --innodb-flush-log-at-trx-commit=1. It allows to make effective use of group commit. It also allows to improve parallelism on slaves deeper down in the hierarchy, using --binlog-commit-wait-count. Without this, the group commit parallelism from a slave would always be less than (or equal) to that on the master. Thanks, - Kristian.
Hi, Kristian! On Jul 04, Kristian Nielsen wrote:
So then, the plan seems to be:
1. I remove the new calls from include/plugin.h, instead place them somewhere not part of a public API (maybe just "extern" declarations inside InnoDB/XtraDB, and whatever is needed to make it work correctly for ha_innodb.dll on Windows).
2. I try to remove the kill-in-background, instead do it directly in the thread doing thd_report_wait_for() (I think that should be possible).
3. I apply the other review comments that you sent in another mail.
4. I file a Jira task for 10.1 about a general solution, with a good API and other ideas collected so far.
Other than this, the patch will be much the same as what I had initially. Is this ok with you? Or did I miss something?
Yes, ok, thanks.
How can T2 run in parallel with T1 if they're from different groups? ... This way, we get more opportunity for parallelism. This optimisation (starting T3/T4 at _start_ of T1/T2 commit, rather than after) is particularly effective when commit is expensive, eg. with --sync-binlog=1 and --innodb-flush-log-at-trx-commit=1. It allows to make effective use of group commit. It also allows to improve parallelism on slaves deeper down in the hierarchy, using --binlog-commit-wait-count. Without this, the group commit parallelism from a slave would always be less than (or equal) to that on the master.
Oh, okay. I didn't know we do that, I thought that - as in the early idea - slave parallelism always follows master's parallelism, and thus can never exceed it. Regards, Sergei
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
So then, the plan seems to be:
1. I remove the new calls from include/plugin.h, instead place them somewhere not part of a public API (maybe just "extern" declarations inside InnoDB/XtraDB, and whatever is needed to make it work correctly for ha_innodb.dll on Windows).
2. I try to remove the kill-in-background, instead do it directly in the thread doing thd_report_wait_for() (I think that should be possible).
3. I apply the other review comments that you sent in another mail.
4. I file a Jira task for 10.1 about a general solution, with a good API and other ideas collected so far.
I have now done this. I have included the patch below, which is on top of the patch you reviewed earlier. You can also extract a complete diff as you wish from the tree lp:~maria-captains/maria/10.0-knielsen, where I have pushed the latest changes. - Kristian. revno: 4178 committer: Kristian Nielsen <knielsen@knielsen-hq.org> branch nick: tmp-10.0 timestamp: Tue 2014-07-08 12:54:47 +0200 message: MDEV-5262, MDEV-5914, MDEV-5941, MDEV-6020: Deadlocks during parallel replication causing replication to fail. After-review changes. For this patch in 10.0, we do not introduce a new public storage engine API, we just fix the InnoDB/XtraDB issues. In 10.1, we will make a better public API that can be used for all storage engines (MDEV-6429). Eliminate the background thread that did deadlock kills asynchroneously. Instead, we ensure that the InnoDB/XtraDB code can handle doing the kill from inside the deadlock detection code (when thd_report_wait_for() needs to kill a later thread to resolve a deadlock). (We preserve the part of the original patch that introduces dedicated mutex and condition for the slave init thread, to remove the abuse of LOCK_thread_count for start/stop synchronisation of the slave init thread). === modified file 'include/mysql/plugin.h' --- include/mysql/plugin.h 2014-06-10 08:13:15 +0000 +++ include/mysql/plugin.h 2014-07-08 10:54:47 +0000 @@ -622,6 +622,7 @@ void **thd_ha_data(const MYSQL_THD thd, void thd_storage_lock_wait(MYSQL_THD thd, long long value); int thd_tx_isolation(const MYSQL_THD thd); int thd_tx_is_read_only(const MYSQL_THD thd); +int thd_rpl_is_parallel(const MYSQL_THD thd); /** Create a temporary file. @@ -729,80 +730,6 @@ void thd_set_ha_data(MYSQL_THD thd, cons */ void thd_wakeup_subsequent_commits(MYSQL_THD thd, int wakeup_error); -/* - Used by a storage engine to report that one transaction THD is about to - go to wait for a transactional lock held by another transactions OTHER_THD. - - This is used for parallel replication, where transactions are required to - commit in the same order on the slave as they did on the master. If the - transactions on the slave can encounter lock conflicts on the slave that did - not exist on the master, this can cause deadlocks. - - The storage engine can report such conflicting locks using this call. This - will allow parallel replication to detect such conflicts and resolve the - deadlock (by killing the second transaction to release the locks that the - first is waiting for, and then later re-try the second killed transaction). - - The storage engine should not report false positives. That is, it should not - report any lock waits that do not actually require one transaction to wait - for the other. Nor should it report waits for locks that will be released - before the commit of the other transactions. -*/ -void thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd); - -/* - This function can optionally be called to check if thd_report_wait_for() - needs to be called for waits done by a given transaction. - - If this function returns false for a given thd, there is no need to do any - calls to thd_report_wait_for() on that thd. - - This call is optional; it is safe to call thd_report_wait_for() in any case. - This call can be used to save some redundant calls to thd_report_wait_for() - if desired. (This is unlikely to matter much unless there are _lots_ of - waits to report, as the overhead of thd_report_wait_for() is small). -*/ -int thd_need_wait_for(const MYSQL_THD thd); - -/* - This function can be called by storage engines to check if the commit order - of two transactions has already been decided by the upper layer. This - happens in parallel replication, where the commit order is forced to be the - same on the slave as it was originally on the master. - - If this function returns false, it means that such commit order will be - enforced. This allows the storage engine to optionally omit gap lock waitss - or similar measures that would otherwise be needed to ensure that - transactions would be serialised in a way that would cause a commit order - that is correct for binlogging for statement-based replication. - - If this function returns true, normal locking should be done as required by - the binlogging and transaction isolation level in effect. -*/ -int thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd); - -/* - If the storage engine detects a deadlock, and needs to choose a victim - transaction to roll back, it can call this function to ask the upper - server layer for which of two possible transactions is prefered to be - aborted and rolled back. - - In parallel replication, if two transactions are running in parallel and - one is fixed to commit before the other, then the one that commits later - will be prefered as the victim - chosing the early transaction as a victim - will not resolve the deadlock anyway, as the later transaction still needs - to wait for the earlier to commit. - - Otherwise, a transaction that uses only transactional tables, and can thus - be safely rolled back, will be prefered as a deadlock victim over a - transaction that also modified non-transactional (eg. MyISAM) tables. - - The return value is -1 if the first transaction is prefered as a deadlock - victim, 1 if the second transaction is prefered, or 0 for no preference (in - which case the storage engine can make the choice as it prefers). -*/ -int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2); - #ifdef __cplusplus } #endif === modified file 'include/mysql/plugin_audit.h.pp' --- include/mysql/plugin_audit.h.pp 2014-06-10 08:13:15 +0000 +++ include/mysql/plugin_audit.h.pp 2014-07-08 10:54:47 +0000 @@ -303,6 +303,7 @@ void **thd_ha_data(const void* thd, cons void thd_storage_lock_wait(void* thd, long long value); int thd_tx_isolation(const void* thd); int thd_tx_is_read_only(const void* thd); +int thd_rpl_is_parallel(const void* thd); int mysql_tmpfile(const char *prefix); unsigned long thd_get_thread_id(const void* thd); void thd_get_xid(const void* thd, MYSQL_XID *xid); @@ -313,10 +314,6 @@ void *thd_get_ha_data(const void* thd, c void thd_set_ha_data(void* thd, const struct handlerton *hton, const void *ha_data); void thd_wakeup_subsequent_commits(void* thd, int wakeup_error); -void thd_report_wait_for(const void* thd, void *other_thd); -int thd_need_wait_for(const void* thd); -int thd_need_ordering_with(const void* thd, const void* other_thd); -int thd_deadlock_victim_preference(const void* thd1, const void* thd2); struct mysql_event_general { unsigned int event_subclass; === modified file 'include/mysql/plugin_auth.h.pp' --- include/mysql/plugin_auth.h.pp 2014-06-10 08:13:15 +0000 +++ include/mysql/plugin_auth.h.pp 2014-07-08 10:54:47 +0000 @@ -303,6 +303,7 @@ void **thd_ha_data(const void* thd, cons void thd_storage_lock_wait(void* thd, long long value); int thd_tx_isolation(const void* thd); int thd_tx_is_read_only(const void* thd); +int thd_rpl_is_parallel(const void* thd); int mysql_tmpfile(const char *prefix); unsigned long thd_get_thread_id(const void* thd); void thd_get_xid(const void* thd, MYSQL_XID *xid); @@ -313,10 +314,6 @@ void *thd_get_ha_data(const void* thd, c void thd_set_ha_data(void* thd, const struct handlerton *hton, const void *ha_data); void thd_wakeup_subsequent_commits(void* thd, int wakeup_error); -void thd_report_wait_for(const void* thd, void *other_thd); -int thd_need_wait_for(const void* thd); -int thd_need_ordering_with(const void* thd, const void* other_thd); -int thd_deadlock_victim_preference(const void* thd1, const void* thd2); #include <mysql/plugin_auth_common.h> typedef struct st_plugin_vio_info { === modified file 'include/mysql/plugin_ftparser.h.pp' --- include/mysql/plugin_ftparser.h.pp 2014-06-10 08:13:15 +0000 +++ include/mysql/plugin_ftparser.h.pp 2014-07-08 10:54:47 +0000 @@ -256,6 +256,7 @@ void **thd_ha_data(const void* thd, cons void thd_storage_lock_wait(void* thd, long long value); int thd_tx_isolation(const void* thd); int thd_tx_is_read_only(const void* thd); +int thd_rpl_is_parallel(const void* thd); int mysql_tmpfile(const char *prefix); unsigned long thd_get_thread_id(const void* thd); void thd_get_xid(const void* thd, MYSQL_XID *xid); @@ -266,10 +267,6 @@ void *thd_get_ha_data(const void* thd, c void thd_set_ha_data(void* thd, const struct handlerton *hton, const void *ha_data); void thd_wakeup_subsequent_commits(void* thd, int wakeup_error); -void thd_report_wait_for(const void* thd, void *other_thd); -int thd_need_wait_for(const void* thd); -int thd_need_ordering_with(const void* thd, const void* other_thd); -int thd_deadlock_victim_preference(const void* thd1, const void* thd2); enum enum_ftparser_mode { MYSQL_FTPARSER_SIMPLE_MODE= 0, === modified file 'mysql-test/suite/perfschema/r/threads_mysql.result' --- mysql-test/suite/perfschema/r/threads_mysql.result 2014-06-03 08:31:11 +0000 +++ mysql-test/suite/perfschema/r/threads_mysql.result 2014-07-08 10:54:47 +0000 @@ -44,16 +44,6 @@ processlist_info NULL unified_parent_thread_id unified parent_thread_id role NULL instrumented YES -name thread/sql/slave_background -type BACKGROUND -processlist_user NULL -processlist_host NULL -processlist_db NULL -processlist_command NULL -processlist_info NULL -unified_parent_thread_id unified parent_thread_id -role NULL -instrumented YES CREATE TEMPORARY TABLE t1 AS SELECT thread_id FROM performance_schema.threads WHERE name LIKE 'thread/sql%'; @@ -115,5 +105,4 @@ parent_thread_name child_thread_name thread/sql/event_scheduler thread/sql/event_worker thread/sql/main thread/sql/one_connection thread/sql/main thread/sql/signal_handler -thread/sql/main thread/sql/slave_background thread/sql/one_connection thread/sql/event_scheduler === modified file 'sql/log.cc' --- sql/log.cc 2014-06-10 08:13:15 +0000 +++ sql/log.cc 2014-07-08 10:54:47 +0000 @@ -4078,7 +4078,7 @@ int MYSQL_BIN_LOG::purge_first_log(Relay included= false; break; } - if (!included && 0 == strcmp(ir->name, rli->group_relay_log_name)) + if (!included && !strcmp(ir->name, rli->group_relay_log_name)) break; if (!next) { @@ -9332,7 +9332,7 @@ int TC_LOG_BINLOG::recover(LOG_INFO *lin file= -1; } - if (0 == strcmp(linfo->log_file_name, last_log_name)) + if (!strcmp(linfo->log_file_name, last_log_name)) break; // No more files to do if ((file= open_binlog(&log, linfo->log_file_name, &errmsg)) < 0) { === modified file 'sql/log_event.cc' --- sql/log_event.cc 2014-06-10 08:13:15 +0000 +++ sql/log_event.cc 2014-07-08 10:54:47 +0000 @@ -7324,6 +7324,13 @@ int Xid_log_event::do_apply_event(rpl_gr uint64 sub_id= 0; Relay_log_info const *rli= rgi->rli; + /* + XID_EVENT works like a COMMIT statement. And it also updates the + mysql.gtid_slave_pos table with the GTID of the current transaction. + + Therefore, it acts much like a normal SQL statement, so we need to do + mysql_reset_thd_for_next_command() as if starting a new statement. + */ mysql_reset_thd_for_next_command(thd); /* Record any GTID in the same transaction, so slave state is transactionally === modified file 'sql/mysqld.cc' --- sql/mysqld.cc 2014-06-03 08:31:11 +0000 +++ sql/mysqld.cc 2014-07-08 10:54:47 +0000 @@ -361,7 +361,7 @@ static I_List<THD> thread_cache; static bool binlog_format_used= false; LEX_STRING opt_init_connect, opt_init_slave; static mysql_cond_t COND_thread_cache, COND_flush_thread_cache; -mysql_cond_t COND_slave_background; +mysql_cond_t COND_slave_init; static DYNAMIC_ARRAY all_options; /* Global variables */ @@ -700,7 +700,7 @@ mysql_mutex_t LOCK_crypt, LOCK_global_system_variables, LOCK_user_conn, LOCK_slave_list, LOCK_active_mi, - LOCK_connection_count, LOCK_error_messages, LOCK_slave_background; + LOCK_connection_count, LOCK_error_messages, LOCK_slave_init; mysql_mutex_t LOCK_stats, LOCK_global_user_client_stats, LOCK_global_table_stats, LOCK_global_index_stats; @@ -875,7 +875,7 @@ PSI_mutex_key key_LOCK_stats, PSI_mutex_key key_LOCK_gtid_waiting; PSI_mutex_key key_LOCK_prepare_ordered, key_LOCK_commit_ordered, - key_LOCK_slave_background; + key_LOCK_slave_init; PSI_mutex_key key_TABLE_SHARE_LOCK_share; static PSI_mutex_info all_server_mutexes[]= @@ -938,7 +938,7 @@ static PSI_mutex_info all_server_mutexes { &key_LOCK_error_messages, "LOCK_error_messages", PSI_FLAG_GLOBAL}, { &key_LOCK_prepare_ordered, "LOCK_prepare_ordered", PSI_FLAG_GLOBAL}, { &key_LOCK_commit_ordered, "LOCK_commit_ordered", PSI_FLAG_GLOBAL}, - { &key_LOCK_slave_background, "LOCK_slave_background", PSI_FLAG_GLOBAL}, + { &key_LOCK_slave_init, "LOCK_slave_init", PSI_FLAG_GLOBAL}, { &key_LOG_INFO_lock, "LOG_INFO::lock", 0}, { &key_LOCK_thread_count, "LOCK_thread_count", PSI_FLAG_GLOBAL}, { &key_LOCK_thread_cache, "LOCK_thread_cache", PSI_FLAG_GLOBAL}, @@ -993,7 +993,7 @@ PSI_cond_key key_TC_LOG_MMAP_COND_queue_ PSI_cond_key key_COND_rpl_thread_queue, key_COND_rpl_thread, key_COND_rpl_thread_pool, key_COND_parallel_entry, key_COND_group_commit_orderer, - key_COND_prepare_ordered, key_COND_slave_background; + key_COND_prepare_ordered, key_COND_slave_init; PSI_cond_key key_COND_wait_gtid, key_COND_gtid_ignore_duplicates; static PSI_cond_info all_server_conds[]= @@ -1042,7 +1042,7 @@ static PSI_cond_info all_server_conds[]= { &key_COND_parallel_entry, "COND_parallel_entry", 0}, { &key_COND_group_commit_orderer, "COND_group_commit_orderer", 0}, { &key_COND_prepare_ordered, "COND_prepare_ordered", 0}, - { &key_COND_slave_background, "COND_slave_background", 0}, + { &key_COND_slave_init, "COND_slave_init", 0}, { &key_COND_wait_gtid, "COND_wait_gtid", 0}, { &key_COND_gtid_ignore_duplicates, "COND_gtid_ignore_duplicates", 0} }; @@ -1050,7 +1050,7 @@ static PSI_cond_info all_server_conds[]= PSI_thread_key key_thread_bootstrap, key_thread_delayed_insert, key_thread_handle_manager, key_thread_main, key_thread_one_connection, key_thread_signal_hand, - key_thread_slave_background, key_rpl_parallel_thread; + key_thread_slave_init, key_rpl_parallel_thread; static PSI_thread_info all_server_threads[]= { @@ -1076,7 +1076,7 @@ static PSI_thread_info all_server_thread { &key_thread_main, "main", PSI_FLAG_GLOBAL}, { &key_thread_one_connection, "one_connection", 0}, { &key_thread_signal_hand, "signal_handler", PSI_FLAG_GLOBAL}, - { &key_thread_slave_background, "slave_background", PSI_FLAG_GLOBAL}, + { &key_thread_slave_init, "slave_init", PSI_FLAG_GLOBAL}, { &key_rpl_parallel_thread, "rpl_parallel_thread", 0} }; @@ -2162,8 +2162,8 @@ static void clean_up_mutexes() mysql_mutex_destroy(&LOCK_prepare_ordered); mysql_cond_destroy(&COND_prepare_ordered); mysql_mutex_destroy(&LOCK_commit_ordered); - mysql_mutex_destroy(&LOCK_slave_background); - mysql_cond_destroy(&COND_slave_background); + mysql_mutex_destroy(&LOCK_slave_init); + mysql_cond_destroy(&COND_slave_init); DBUG_VOID_RETURN; } @@ -4378,9 +4378,9 @@ static int init_thread_environment() mysql_cond_init(key_COND_prepare_ordered, &COND_prepare_ordered, NULL); mysql_mutex_init(key_LOCK_commit_ordered, &LOCK_commit_ordered, MY_MUTEX_INIT_SLOW); - mysql_mutex_init(key_LOCK_slave_background, &LOCK_slave_background, + mysql_mutex_init(key_LOCK_slave_init, &LOCK_slave_init, MY_MUTEX_INIT_SLOW); - mysql_cond_init(key_COND_slave_background, &COND_slave_background, NULL); + mysql_cond_init(key_COND_slave_init, &COND_slave_init, NULL); #ifdef HAVE_OPENSSL mysql_mutex_init(key_LOCK_des_key_file, @@ -9458,8 +9458,6 @@ PSI_stage_info stage_waiting_for_room_in PSI_stage_info stage_master_gtid_wait_primary= { 0, "Waiting in MASTER_GTID_WAIT() (primary waiter)", 0}; PSI_stage_info stage_master_gtid_wait= { 0, "Waiting in MASTER_GTID_WAIT()", 0}; PSI_stage_info stage_gtid_wait_other_connection= { 0, "Waiting for other master connection to process GTID received on multiple master connections", 0}; -PSI_stage_info stage_slave_background_process_request= { 0, "Processing requests", 0}; -PSI_stage_info stage_slave_background_wait_request= { 0, "Waiting for requests", 0}; #ifdef HAVE_PSI_INTERFACE @@ -9579,9 +9577,7 @@ PSI_stage_info *all_server_stages[]= & stage_waiting_to_get_readlock, & stage_master_gtid_wait_primary, & stage_master_gtid_wait, - & stage_gtid_wait_other_connection, - & stage_slave_background_process_request, - & stage_slave_background_wait_request + & stage_gtid_wait_other_connection }; PSI_socket_key key_socket_tcpip, key_socket_unix, key_socket_client_connection; === modified file 'sql/mysqld.h' --- sql/mysqld.h 2014-06-03 08:31:11 +0000 +++ sql/mysqld.h 2014-07-08 10:54:47 +0000 @@ -309,8 +309,8 @@ extern PSI_cond_key key_COND_wait_gtid, extern PSI_thread_key key_thread_bootstrap, key_thread_delayed_insert, key_thread_handle_manager, key_thread_kill_server, key_thread_main, - key_thread_one_connection, key_thread_signal_hand, - key_thread_slave_background, key_rpl_parallel_thread; + key_thread_one_connection, key_thread_signal_hand, key_thread_slave_init, + key_rpl_parallel_thread; extern PSI_file_key key_file_binlog, key_file_binlog_index, key_file_casetest, key_file_dbopt, key_file_des_key_file, key_file_ERRMSG, key_select_to_file, @@ -447,8 +447,6 @@ extern PSI_stage_info stage_waiting_for_ extern PSI_stage_info stage_master_gtid_wait_primary; extern PSI_stage_info stage_master_gtid_wait; extern PSI_stage_info stage_gtid_wait_other_connection; -extern PSI_stage_info stage_slave_background_process_request; -extern PSI_stage_info stage_slave_background_wait_request; #ifdef HAVE_PSI_STATEMENT_INTERFACE /** @@ -512,7 +510,7 @@ extern mysql_mutex_t LOCK_slave_list, LOCK_active_mi, LOCK_manager, LOCK_global_system_variables, LOCK_user_conn, LOCK_prepared_stmt_count, LOCK_error_messages, LOCK_connection_count, - LOCK_slave_background; + LOCK_slave_init; extern MYSQL_PLUGIN_IMPORT mysql_mutex_t LOCK_thread_count; #ifdef HAVE_OPENSSL extern mysql_mutex_t LOCK_des_key_file; @@ -523,7 +521,7 @@ extern mysql_rwlock_t LOCK_grant, LOCK_s extern mysql_rwlock_t LOCK_system_variables_hash; extern mysql_cond_t COND_thread_count; extern mysql_cond_t COND_manager; -extern mysql_cond_t COND_slave_background; +extern mysql_cond_t COND_slave_init; extern int32 thread_running; extern int32 thread_count; extern my_atomic_rwlock_t thread_running_lock, thread_count_lock; === modified file 'sql/rpl_parallel.cc' --- sql/rpl_parallel.cc 2014-06-10 08:13:15 +0000 +++ sql/rpl_parallel.cc 2014-07-08 10:54:47 +0000 @@ -240,7 +240,6 @@ convert_kill_to_deadlock_error(rpl_group rgi->killed_for_retry) { thd->clear_error(); - thd->get_stmt_da()->reset_diagnostics_area(); my_error(ER_LOCK_DEADLOCK, MYF(0)); rgi->killed_for_retry= false; thd->reset_killed(); @@ -325,7 +324,7 @@ retry_event_group(rpl_group_info *rgi, r register_wait_for_prior_event_group_commit(rgi, entry); mysql_mutex_unlock(&entry->LOCK_parallel_entry); - strcpy(log_name, ir->name); + strmake_buf(log_name, ir->name); if ((fd= open_binlog(&rlog, log_name, &errmsg)) <0) { err= 1; === modified file 'sql/rpl_rli.cc' --- sql/rpl_rli.cc 2014-06-10 08:13:15 +0000 +++ sql/rpl_rli.cc 2014-07-08 10:54:47 +0000 @@ -1302,7 +1302,7 @@ Relay_log_info::alloc_inuse_relaylog(con my_error(ER_OUTOFMEMORY, MYF(0), (int)sizeof(*ir)); return 1; } - strcpy(ir->name, name); + strmake_buf(ir->name, name); if (!inuse_relaylog_list) inuse_relaylog_list= ir; @@ -1504,6 +1504,7 @@ rpl_group_info::reinit(Relay_log_info *r trans_retries= 0; last_event_start_time= 0; gtid_sub_id= 0; + commit_id= 0; gtid_pending= false; worker_error= 0; row_stmt_start_timestamp= 0; @@ -1548,6 +1549,7 @@ event_group_new_gtid(rpl_group_info *rgi rgi->current_gtid.server_id= gev->server_id; rgi->current_gtid.domain_id= gev->domain_id; rgi->current_gtid.seq_no= gev->seq_no; + rgi->commit_id= gev->commit_id; rgi->gtid_pending= true; return 0; } === modified file 'sql/rpl_rli.h' --- sql/rpl_rli.h 2014-06-10 08:13:15 +0000 +++ sql/rpl_rli.h 2014-07-08 10:54:47 +0000 @@ -164,6 +164,7 @@ class Relay_log_info : public Slave_repo */ inuse_relaylog *inuse_relaylog_list; inuse_relaylog *last_inuse_relaylog; + /* Lock used to protect inuse_relaylog::dequeued_count */ my_atomic_rwlock_t inuse_relaylog_atomic_lock; /* @@ -526,6 +527,7 @@ struct rpl_group_info */ uint64 gtid_sub_id; rpl_gtid current_gtid; + uint64 commit_id; /* This is used to keep transaction commit order. We will signal this when we commit, and can register it to wait for the === modified file 'sql/slave.cc' --- sql/slave.cc 2014-06-10 08:13:15 +0000 +++ sql/slave.cc 2014-07-08 10:54:47 +0000 @@ -287,22 +287,13 @@ static void init_slave_psi_keys(void) #endif /* HAVE_PSI_INTERFACE */ -static bool slave_background_thread_running; -static bool slave_background_thread_stop; -static bool slave_background_thread_gtid_loaded; - -struct slave_background_kill_t { - slave_background_kill_t *next; - THD *to_kill; -} *slave_background_kill_list; +static bool slave_init_thread_running; pthread_handler_t -handle_slave_background(void *arg __attribute__((unused))) +handle_slave_init(void *arg __attribute__((unused))) { THD *thd; - PSI_stage_info old_stage; - bool stop; my_thread_init(); thd= new THD; @@ -310,7 +301,7 @@ handle_slave_background(void *arg __attr mysql_mutex_lock(&LOCK_thread_count); thd->thread_id= thread_id++; mysql_mutex_unlock(&LOCK_thread_count); - thd->system_thread = SYSTEM_THREAD_SLAVE_BACKGROUND; + thd->system_thread = SYSTEM_THREAD_SLAVE_INIT; thd->store_globals(); thd->security_ctx->skip_grants(); thd->set_command(COM_DAEMON); @@ -323,126 +314,49 @@ handle_slave_background(void *arg __attr thd->get_stmt_da()->sql_errno(), thd->get_stmt_da()->message()); - mysql_mutex_lock(&LOCK_slave_background); - slave_background_thread_gtid_loaded= true; - mysql_cond_broadcast(&COND_slave_background); - - THD_STAGE_INFO(thd, stage_slave_background_process_request); - do - { - slave_background_kill_t *kill_list; - - thd->ENTER_COND(&COND_slave_background, &LOCK_slave_background, - &stage_slave_background_wait_request, - &old_stage); - for (;;) - { - stop= abort_loop || thd->killed || slave_background_thread_stop; - kill_list= slave_background_kill_list; - if (stop || kill_list) - break; - mysql_cond_wait(&COND_slave_background, &LOCK_slave_background); - } - - slave_background_kill_list= NULL; - thd->EXIT_COND(&old_stage); - - while (kill_list) - { - slave_background_kill_t *p = kill_list; - kill_list= p->next; - - mysql_mutex_lock(&p->to_kill->LOCK_thd_data); - p->to_kill->awake(KILL_CONNECTION); - mysql_mutex_unlock(&p->to_kill->LOCK_thd_data); - my_free(p); - } - mysql_mutex_lock(&LOCK_slave_background); - } while (!stop); - - slave_background_thread_running= false; - mysql_cond_broadcast(&COND_slave_background); - mysql_mutex_unlock(&LOCK_slave_background); - mysql_mutex_lock(&LOCK_thread_count); delete thd; mysql_mutex_unlock(&LOCK_thread_count); my_thread_end(); - return 0; -} - + mysql_mutex_lock(&LOCK_slave_init); + slave_init_thread_running= false; + mysql_cond_broadcast(&COND_slave_init); + mysql_mutex_unlock(&LOCK_slave_init); -void -slave_background_kill_request(THD *to_kill) -{ - slave_background_kill_t *p= - (slave_background_kill_t *)my_malloc(sizeof(*p), MYF(MY_WME)); - if (p) - { - p->to_kill= to_kill; - to_kill->rgi_slave->killed_for_retry= true; - mysql_mutex_lock(&LOCK_slave_background); - p->next= slave_background_kill_list; - slave_background_kill_list= p; - mysql_mutex_unlock(&LOCK_slave_background); - mysql_cond_signal(&COND_slave_background); - } + return 0; } /* - Start the slave background thread. - - This thread is currently used for two purposes: + Start the slave init thread. - 1. To load the GTID state from mysql.gtid_slave_pos at server start; reading - from table requires valid THD, which is otherwise not available during - server init. - - 2. To kill worker thread transactions during parallel replication, when a - storage engine attempts to take an errorneous conflicting lock that would - cause a deadlock. Killing is done asynchroneously, as the kill may not - be safe within the context of a callback from inside storage engine - locking code. + This thread is used to load the GTID state from mysql.gtid_slave_pos at + server start; reading from table requires valid THD, which is otherwise not + available during server init. */ static int -start_slave_background_thread() +run_slave_init_thread() { pthread_t th; - slave_background_thread_running= true; - slave_background_thread_stop= false; - slave_background_thread_gtid_loaded= false; - if (mysql_thread_create(key_thread_slave_background, - &th, &connection_attrib, handle_slave_background, - NULL)) + slave_init_thread_running= true; + if (mysql_thread_create(key_thread_slave_init, &th, &connection_attrib, + handle_slave_init, NULL)) { sql_print_error("Failed to create thread while initialising slave"); return 1; } - mysql_mutex_lock(&LOCK_slave_background); - while (!slave_background_thread_gtid_loaded) - mysql_cond_wait(&COND_slave_background, &LOCK_slave_background); - mysql_mutex_unlock(&LOCK_slave_background); + mysql_mutex_lock(&LOCK_slave_init); + while (!slave_init_thread_running) + mysql_cond_wait(&COND_slave_init, &LOCK_slave_init); + mysql_mutex_unlock(&LOCK_slave_init); return 0; } -static void -stop_slave_background_thread() -{ - mysql_mutex_lock(&LOCK_slave_background); - slave_background_thread_stop= true; - mysql_cond_broadcast(&COND_slave_background); - while (slave_background_thread_running) - mysql_cond_wait(&COND_slave_background, &LOCK_slave_background); - mysql_mutex_unlock(&LOCK_slave_background); -} - - /* Initialize slave structures */ int init_slave() @@ -454,7 +368,7 @@ int init_slave() init_slave_psi_keys(); #endif - if (start_slave_background_thread()) + if (run_slave_init_thread()) return 1; /* @@ -1084,9 +998,6 @@ void end_slave() master_info_index= 0; active_mi= 0; mysql_mutex_unlock(&LOCK_active_mi); - - stop_slave_background_thread(); - global_rpl_thread_pool.destroy(); free_all_rpl_filters(); DBUG_VOID_RETURN; === modified file 'sql/slave.h' --- sql/slave.h 2014-06-10 08:13:15 +0000 +++ sql/slave.h 2014-07-08 10:54:47 +0000 @@ -238,7 +238,6 @@ pthread_handler_t handle_slave_io(void * void slave_output_error_info(Relay_log_info *rli, THD *thd); pthread_handler_t handle_slave_sql(void *arg); bool net_request_file(NET* net, const char* fname); -void slave_background_kill_request(THD *to_kill); extern bool volatile abort_loop; extern Master_info main_mi, *active_mi; /* active_mi for multi-master */ === modified file 'sql/sql_class.cc' --- sql/sql_class.cc 2014-07-04 05:44:55 +0000 +++ sql/sql_class.cc 2014-07-08 10:54:47 +0000 @@ -4211,6 +4211,24 @@ extern "C" int thd_slave_thread(const MY return(thd->slave_thread); } +/* Returns true for a worker thread in parallel replication. */ +extern "C" int thd_rpl_is_parallel(const MYSQL_THD thd) +{ + return thd->rgi_slave && thd->rgi_slave->is_parallel_exec; +} + +/* + This function can optionally be called to check if thd_report_wait_for() + needs to be called for waits done by a given transaction. + + If this function returns false for a given thd, there is no need to do any + calls to thd_report_wait_for() on that thd. + + This call is optional; it is safe to call thd_report_wait_for() in any case. + This call can be used to save some redundant calls to thd_report_wait_for() + if desired. (This is unlikely to matter much unless there are _lots_ of + waits to report, as the overhead of thd_report_wait_for() is small). +*/ extern "C" int thd_need_wait_for(const MYSQL_THD thd) { @@ -4224,6 +4242,31 @@ thd_need_wait_for(const MYSQL_THD thd) return rgi->is_parallel_exec; } +/* + Used by InnoDB/XtraDB to report that one transaction THD is about to go to + wait for a transactional lock held by another transactions OTHER_THD. + + This is used for parallel replication, where transactions are required to + commit in the same order on the slave as they did on the master. If the + transactions on the slave encounters lock conflicts on the slave that did + not exist on the master, this can cause deadlocks. + + Normally, such conflicts will not occur, because the same conflict would + have prevented the two transactions from committing in parallel on the + master, thus preventing them from running in parallel on the slave in the + first place. However, it is possible in case when the optimizer chooses a + different plan on the slave than on the master (eg. table scan instead of + index scan). + + InnoDB/XtraDB reports lock waits using this call. If a lock wait causes a + deadlock with the pre-determined commit order, we kill the later transaction, + and later re-try it, to resolve the deadlock. + + This call need only receive reports about waits for locks that will remain + until the holding transaction commits. InnoDB/XtraDB auto-increment locks + are released earlier, and so need not be reported. (Such false positives are + not harmful, but could lead to unnecessary kill and retry, so best avoided). +*/ extern "C" void thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd) { @@ -4254,12 +4297,51 @@ thd_report_wait_for(const MYSQL_THD thd, cause replication to rollback (and later re-try) the other transaction, releasing the lock for this transaction so replication can proceed. */ - -#ifdef HAVE_REPLICATION - slave_background_kill_request(other_thd); -#endif + other_rgi->killed_for_retry= true; + mysql_mutex_lock(&other_thd->LOCK_thd_data); + other_thd->awake(KILL_CONNECTION); + mysql_mutex_unlock(&other_thd->LOCK_thd_data); } +/* + This function is called from InnoDB/XtraDB to check if the commit order of + two transactions has already been decided by the upper layer. This happens + in parallel replication, where the commit order is forced to be the same on + the slave as it was originally on the master. + + If this function returns false, it means that such commit order will be + enforced. This allows the storage engine to optionally omit gap lock waits + or similar measures that would otherwise be needed to ensure that + transactions would be serialised in a way that would cause a commit order + that is correct for binlogging for statement-based replication. + + Since transactions are only run in parallel on the slave if they ran without + lock conflicts on the master, normally no lock conflicts on the slave happen + during parallel replication. However, there are a couple of corner cases + where it can happen, like these secondary-index operations: + + T1: INSERT INTO t1 VALUES (7, NULL); + T2: DELETE FROM t1 WHERE b <= 3; + + T1: UPDATE t1 SET secondary=NULL WHERE primary=1 + T2: DELETE t1 WHERE secondary <= 3 + + The DELETE takes a gap lock that can block the INSERT/UPDATE, but the row + locks set by INSERT/UPDATE do not block the DELETE. Thus, the execution + order of the transactions determine whether a lock conflict occurs or + not. Thus a lock conflict can occur on the slave where it did not on the + master. + + If this function returns true, normal locking should be done as required by + the binlogging and transaction isolation level in effect. But if it returns + false, the correct order will be enforced anyway, and InnoDB/XtraDB can + avoid taking the gap lock, preventing the lock conflict. + + Calling this function is just an optimisation to avoid unnecessary + deadlocks. If it was not used, a gap lock would be set that could eventually + cause a deadlock; the deadlock would be caught by thd_report_wait_for() and + the transaction T2 killed and rolled back (and later re-tried). +*/ extern "C" int thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd) { @@ -4277,7 +4359,7 @@ thd_need_ordering_with(const MYSQL_THD t return 1; if (rgi->current_gtid.domain_id != other_rgi->current_gtid.domain_id) return 1; - if (rgi->commit_id != other_rgi->commit_id) + if (!rgi->commit_id || rgi->commit_id != other_rgi->commit_id) return 1; /* These two threads are doing parallel replication within the same @@ -4289,6 +4371,26 @@ thd_need_ordering_with(const MYSQL_THD t } +/* + If the storage engine detects a deadlock, and needs to choose a victim + transaction to roll back, it can call this function to ask the upper + server layer for which of two possible transactions is prefered to be + aborted and rolled back. + + In parallel replication, if two transactions are running in parallel and + one is fixed to commit before the other, then the one that commits later + will be prefered as the victim - chosing the early transaction as a victim + will not resolve the deadlock anyway, as the later transaction still needs + to wait for the earlier to commit. + + Otherwise, a transaction that uses only transactional tables, and can thus + be safely rolled back, will be prefered as a deadlock victim over a + transaction that also modified non-transactional (eg. MyISAM) tables. + + The return value is -1 if the first transaction is prefered as a deadlock + victim, 1 if the second transaction is prefered, or 0 for no preference (in + which case the storage engine can make the choice as it prefers). +*/ extern "C" int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2) { === modified file 'sql/sql_class.h' --- sql/sql_class.h 2014-06-10 08:13:15 +0000 +++ sql/sql_class.h 2014-07-08 10:54:47 +0000 @@ -1358,7 +1358,7 @@ enum enum_thread_type SYSTEM_THREAD_EVENT_SCHEDULER= 16, SYSTEM_THREAD_EVENT_WORKER= 32, SYSTEM_THREAD_BINLOG_BACKGROUND= 64, - SYSTEM_THREAD_SLAVE_BACKGROUND= 128, + SYSTEM_THREAD_SLAVE_INIT= 128, }; inline char const * === modified file 'storage/heap/hp_write.c' --- storage/heap/hp_write.c 2014-06-03 08:31:11 +0000 +++ storage/heap/hp_write.c 2014-07-08 10:54:47 +0000 @@ -156,7 +156,7 @@ static uchar *next_free_record_pos(HP_SH ("record file full. records: %lu max_records: %lu " "data_length: %llu index_length: %llu " "max_table_size: %llu", - (unsigned long)info->records, info->max_records, + info->records, info->max_records, info->data_length, info->index_length, info->max_table_size)); my_errno=HA_ERR_RECORD_FILE_FULL; === modified file 'storage/innobase/handler/ha_innodb.cc' --- storage/innobase/handler/ha_innodb.cc 2014-06-10 08:13:15 +0000 +++ storage/innobase/handler/ha_innodb.cc 2014-07-08 10:54:47 +0000 @@ -4169,13 +4169,18 @@ innobase_kill_query( if (trx) { + THD *cur = current_thd; + THD *owner = trx->current_lock_mutex_owner; + /* Cancel a pending lock request. */ - lock_mutex_enter(); + if (owner != cur) + lock_mutex_enter(); trx_mutex_enter(trx); if (trx->lock.wait_lock) lock_cancel_waiting_and_release(trx->lock.wait_lock); trx_mutex_exit(trx); - lock_mutex_exit(); + if (owner != cur) + lock_mutex_exit(); } DBUG_VOID_RETURN; === modified file 'storage/innobase/include/trx0trx.h' --- storage/innobase/include/trx0trx.h 2014-03-26 21:25:38 +0000 +++ storage/innobase/include/trx0trx.h 2014-07-08 10:54:47 +0000 @@ -993,6 +993,11 @@ struct trx_t{ count of tables being flushed. */ /*------------------------------*/ + THD* current_lock_mutex_owner; + /*!< If this is equal to current_thd, + then in innobase_kill_query() we know we + already hold the lock_sys->mutex. */ + /*------------------------------*/ #ifdef UNIV_DEBUG ulint start_line; /*!< Track where it was started from */ const char* start_file; /*!< Filename where it was started */ === modified file 'storage/innobase/lock/lock0lock.cc' --- storage/innobase/lock/lock0lock.cc 2014-06-10 08:13:15 +0000 +++ storage/innobase/lock/lock0lock.cc 2014-07-08 10:54:47 +0000 @@ -374,6 +374,11 @@ struct lock_stack_t { ulint heap_no; /*!< heap number if rec lock */ }; +extern "C" void thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd); +extern "C" int thd_need_wait_for(const MYSQL_THD thd); +extern "C" +int thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd); + /** Stack to use during DFS search. Currently only a single stack is required because there is no parallel deadlock check. This stack is protected by the lock_sys_t::mutex. */ @@ -392,6 +397,14 @@ UNIV_INTERN mysql_pfs_key_t lock_sys_wai #ifdef UNIV_DEBUG UNIV_INTERN ibool lock_print_waits = FALSE; +/* Buffer to collect THDs to report waits for. */ +struct thd_wait_reports { + struct thd_wait_reports *next; + ulint used; + trx_t *waitees[64]; +}; + + /*********************************************************************//** Validates the lock system. @return TRUE if ok */ @@ -1037,8 +1050,12 @@ lock_rec_has_to_wait( the correct order so that statement-based replication will give the correct results. Since the right order was already determined on the master, we do not need - to enforce it again here (and doing so could lead to - occasional deadlocks). */ + to enforce it again here. + + Skipping the locks is not essential for correctness, + since in case of deadlock we will just kill the later + transaction and retry it. But it can save some + unnecessary rollbacks and retries. */ return (FALSE); } @@ -3814,7 +3831,8 @@ static trx_id_t lock_deadlock_search( /*=================*/ - lock_deadlock_ctx_t* ctx) /*!< in/out: deadlock context */ + lock_deadlock_ctx_t* ctx, /*!< in/out: deadlock context */ + struct thd_wait_reports*waitee_ptr) /*!< in/out: list of waitees */ { const lock_t* lock; ulint heap_no; @@ -3893,10 +3911,24 @@ lock_deadlock_search( /* We do not need to report autoinc locks to the upper layer. These locks are released before commit, so they can not cause deadlocks with binlog-fixed commit order. */ - if (lock_get_type_low(lock) != LOCK_TABLE || - lock_get_mode(lock) != LOCK_AUTO_INC) - thd_report_wait_for(ctx->start->mysql_thd, - lock->trx->mysql_thd); + if (waitee_ptr && (lock_get_type_low(lock) != LOCK_TABLE || + lock_get_mode(lock) != LOCK_AUTO_INC)) { + if (waitee_ptr->used == sizeof(waitee_ptr->waitees)/ + sizeof(waitee_ptr->waitees[0])) { + waitee_ptr->next = + (struct thd_wait_reports *) + mem_alloc(sizeof(*waitee_ptr)); + waitee_ptr = waitee_ptr->next; + if (!waitee_ptr) { + ctx->too_deep = TRUE; + return(ctx->start->id); + } + waitee_ptr->next = NULL; + waitee_ptr->used = 0; + } + waitee_ptr->waitees[waitee_ptr->used++] = lock->trx; + } + if (lock->trx->lock.que_state == TRX_QUE_LOCK_WAIT) { /* Another trx ahead has requested a lock in an @@ -3989,6 +4021,41 @@ lock_deadlock_trx_rollback( trx_mutex_exit(trx); } +static void +mysql_report_waiters(struct thd_wait_reports *waitee_buf_ptr, + THD *mysql_thd, + trx_id_t victim_trx_id) +{ + struct thd_wait_reports *p = waitee_buf_ptr; + while (p) { + struct thd_wait_reports *q; + ulint i = 0; + + while (i < p->used) { + trx_t *w_trx = p->waitees[i]; + /* There is no need to report waits to a trx already + selected as a victim. */ + if (w_trx->id != victim_trx_id) + { + /* If thd_report_wait_for() decides to kill the + transaction, then we will get a call back into + innobase_kill_query. We mark this by setting + current_lock_mutex_owner, so we can avoid trying + to recursively take lock_sys->mutex. */ + w_trx->current_lock_mutex_owner = mysql_thd; + thd_report_wait_for(mysql_thd, w_trx->mysql_thd); + w_trx->current_lock_mutex_owner = NULL; + } + ++i; + } + q = p->next; + if (p != waitee_buf_ptr) + mem_free(q); + p = q; + } +} + + /********************************************************************//** Checks if a joining lock request results in a deadlock. If a deadlock is found this function will resolve the dadlock by choosing a victim transaction @@ -4005,12 +4072,20 @@ lock_deadlock_check_and_resolve( const trx_t* trx) /*!< in: transaction */ { trx_id_t victim_trx_id; + struct thd_wait_reports waitee_buf, *waitee_buf_ptr; + THD* start_mysql_thd; ut_ad(trx != NULL); ut_ad(lock != NULL); ut_ad(lock_mutex_own()); assert_trx_in_list(trx); + start_mysql_thd = trx->mysql_thd; + if (start_mysql_thd && thd_need_wait_for(start_mysql_thd)) + waitee_buf_ptr = &waitee_buf; + else + waitee_buf_ptr = NULL; + /* Try and resolve as many deadlocks as possible. */ do { lock_deadlock_ctx_t ctx; @@ -4023,7 +4098,17 @@ lock_deadlock_check_and_resolve( ctx.wait_lock = lock; ctx.mark_start = lock_mark_counter; - victim_trx_id = lock_deadlock_search(&ctx); + if (waitee_buf_ptr) { + waitee_buf_ptr->next = NULL; + waitee_buf_ptr->used = 0; + } + + victim_trx_id = lock_deadlock_search(&ctx, waitee_buf_ptr); + + /* Report waits to upper layer, as needed. */ + if (waitee_buf_ptr) + mysql_report_waiters(waitee_buf_ptr, start_mysql_thd, + victim_trx_id); /* Search too deep, we rollback the joining transaction. */ if (ctx.too_deep) { === modified file 'storage/innobase/trx/trx0trx.cc' --- storage/innobase/trx/trx0trx.cc 2014-06-10 08:13:15 +0000 +++ storage/innobase/trx/trx0trx.cc 2014-07-08 10:54:47 +0000 @@ -50,6 +50,9 @@ Created 3/26/1996 Heikki Tuuri #include<set> +extern "C" +int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2); + /** Set of table_id */ typedef std::set<table_id_t> table_id_set; === modified file 'storage/xtradb/handler/ha_innodb.cc' --- storage/xtradb/handler/ha_innodb.cc 2014-06-10 08:13:15 +0000 +++ storage/xtradb/handler/ha_innodb.cc 2014-07-08 10:54:47 +0000 @@ -4650,12 +4650,15 @@ innobase_kill_connection( DBUG_ENTER("innobase_kill_connection"); DBUG_ASSERT(hton == innodb_hton_ptr); - lock_mutex_enter(); - trx = thd_to_trx(thd); if (trx) { + THD *cur = current_thd; + THD *owner = trx->current_lock_mutex_owner; + + if (owner != cur) + lock_mutex_enter(); trx_mutex_enter(trx); /* Cancel a pending lock request. */ @@ -4663,10 +4666,10 @@ innobase_kill_connection( lock_cancel_waiting_and_release(trx->lock.wait_lock); trx_mutex_exit(trx); + if (owner != cur) + lock_mutex_exit(); } - lock_mutex_exit(); - DBUG_VOID_RETURN; } === modified file 'storage/xtradb/include/trx0trx.h' --- storage/xtradb/include/trx0trx.h 2014-02-26 18:21:23 +0000 +++ storage/xtradb/include/trx0trx.h 2014-07-08 10:54:47 +0000 @@ -1020,6 +1020,11 @@ struct trx_t{ count of tables being flushed. */ /*------------------------------*/ + THD* current_lock_mutex_owner; + /*!< If this is equal to current_thd, + then in innobase_kill_query() we know we + already hold the lock_sys->mutex. */ + /*------------------------------*/ #ifdef UNIV_DEBUG ulint start_line; /*!< Track where it was started from */ const char* start_file; /*!< Filename where it was started */ === modified file 'storage/xtradb/lock/lock0lock.cc' --- storage/xtradb/lock/lock0lock.cc 2014-06-10 08:13:15 +0000 +++ storage/xtradb/lock/lock0lock.cc 2014-07-08 10:54:47 +0000 @@ -374,6 +374,11 @@ struct lock_stack_t { ulint heap_no; /*!< heap number if rec lock */ }; +extern "C" void thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd); +extern "C" int thd_need_wait_for(const MYSQL_THD thd); +extern "C" +int thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd); + /** Stack to use during DFS search. Currently only a single stack is required because there is no parallel deadlock check. This stack is protected by the lock_sys_t::mutex. */ @@ -392,6 +397,14 @@ UNIV_INTERN mysql_pfs_key_t lock_sys_wai #ifdef UNIV_DEBUG UNIV_INTERN ibool lock_print_waits = FALSE; +/* Buffer to collect THDs to report waits for. */ +struct thd_wait_reports { + struct thd_wait_reports *next; + ulint used; + trx_t *waitees[64]; +}; + + /*********************************************************************//** Validates the lock system. @return TRUE if ok */ @@ -1038,8 +1051,12 @@ lock_rec_has_to_wait( the correct order so that statement-based replication will give the correct results. Since the right order was already determined on the master, we do not need - to enforce it again here (and doing so could lead to - occasional deadlocks). */ + to enforce it again here. + + Skipping the locks is not essential for correctness, + since in case of deadlock we will just kill the later + transaction and retry it. But it can save some + unnecessary rollbacks and retries. */ return (FALSE); } @@ -3827,7 +3844,8 @@ static trx_id_t lock_deadlock_search( /*=================*/ - lock_deadlock_ctx_t* ctx) /*!< in/out: deadlock context */ + lock_deadlock_ctx_t* ctx, /*!< in/out: deadlock context */ + struct thd_wait_reports*waitee_ptr) /*!< in/out: list of waitees */ { const lock_t* lock; ulint heap_no; @@ -3906,10 +3924,24 @@ lock_deadlock_search( /* We do not need to report autoinc locks to the upper layer. These locks are released before commit, so they can not cause deadlocks with binlog-fixed commit order. */ - if (lock_get_type_low(lock) != LOCK_TABLE || - lock_get_mode(lock) != LOCK_AUTO_INC) - thd_report_wait_for(ctx->start->mysql_thd, - lock->trx->mysql_thd); + if (waitee_ptr && (lock_get_type_low(lock) != LOCK_TABLE || + lock_get_mode(lock) != LOCK_AUTO_INC)) { + if (waitee_ptr->used == sizeof(waitee_ptr->waitees)/ + sizeof(waitee_ptr->waitees[0])) { + waitee_ptr->next = + (struct thd_wait_reports *) + mem_alloc(sizeof(*waitee_ptr)); + waitee_ptr = waitee_ptr->next; + if (!waitee_ptr) { + ctx->too_deep = TRUE; + return(ctx->start->id); + } + waitee_ptr->next = NULL; + waitee_ptr->used = 0; + } + waitee_ptr->waitees[waitee_ptr->used++] = lock->trx; + } + if (lock->trx->lock.que_state == TRX_QUE_LOCK_WAIT) { /* Another trx ahead has requested a lock in an @@ -4002,6 +4034,41 @@ lock_deadlock_trx_rollback( trx_mutex_exit(trx); } +static void +mysql_report_waiters(struct thd_wait_reports *waitee_buf_ptr, + THD *mysql_thd, + trx_id_t victim_trx_id) +{ + struct thd_wait_reports *p = waitee_buf_ptr; + while (p) { + struct thd_wait_reports *q; + ulint i = 0; + + while (i < p->used) { + trx_t *w_trx = p->waitees[i]; + /* There is no need to report waits to a trx already + selected as a victim. */ + if (w_trx->id != victim_trx_id) + { + /* If thd_report_wait_for() decides to kill the + transaction, then we will get a call back into + innobase_kill_query. We mark this by setting + current_lock_mutex_owner, so we can avoid trying + to recursively take lock_sys->mutex. */ + w_trx->current_lock_mutex_owner = mysql_thd; + thd_report_wait_for(mysql_thd, w_trx->mysql_thd); + w_trx->current_lock_mutex_owner = NULL; + } + ++i; + } + q = p->next; + if (p != waitee_buf_ptr) + mem_free(q); + p = q; + } +} + + /********************************************************************//** Checks if a joining lock request results in a deadlock. If a deadlock is found this function will resolve the dadlock by choosing a victim transaction @@ -4018,12 +4085,20 @@ lock_deadlock_check_and_resolve( const trx_t* trx) /*!< in: transaction */ { trx_id_t victim_trx_id; + struct thd_wait_reports waitee_buf, *waitee_buf_ptr; + THD* start_mysql_thd; ut_ad(trx != NULL); ut_ad(lock != NULL); ut_ad(lock_mutex_own()); assert_trx_in_list(trx); + start_mysql_thd = trx->mysql_thd; + if (start_mysql_thd && thd_need_wait_for(start_mysql_thd)) + waitee_buf_ptr = &waitee_buf; + else + waitee_buf_ptr = NULL; + /* Try and resolve as many deadlocks as possible. */ do { lock_deadlock_ctx_t ctx; @@ -4036,7 +4111,17 @@ lock_deadlock_check_and_resolve( ctx.wait_lock = lock; ctx.mark_start = lock_mark_counter; - victim_trx_id = lock_deadlock_search(&ctx); + if (waitee_buf_ptr) { + waitee_buf_ptr->next = NULL; + waitee_buf_ptr->used = 0; + } + + victim_trx_id = lock_deadlock_search(&ctx, waitee_buf_ptr); + + /* Report waits to upper layer, as needed. */ + if (waitee_buf_ptr) + mysql_report_waiters(waitee_buf_ptr, start_mysql_thd, + victim_trx_id); /* Search too deep, we rollback the joining transaction. */ if (ctx.too_deep) { === modified file 'storage/xtradb/trx/trx0trx.cc' --- storage/xtradb/trx/trx0trx.cc 2014-06-10 08:13:15 +0000 +++ storage/xtradb/trx/trx0trx.cc 2014-07-08 10:54:47 +0000 @@ -51,6 +51,9 @@ Created 3/26/1996 Heikki Tuuri #include<set> +extern "C" +int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2); + /** Set of table_id */ typedef std::set<table_id_t> table_id_set;
participants (2)
-
Kristian Nielsen
-
Sergei Golubchik