Re: [Maria-developers] [Commit] MDEV-12746 out-of-order retry
andrei.elkin@pp.inet.fi writes:
commit 3cebb54e6387a7eace1757c82ed0efd6e11590b9 Author: Andrei Elkin <andrei.elkin@mariadb.com> Date: Fri Feb 9 15:00:23 2018 +0200
MDEV-12746 rpl.rpl_parallel_optimistic_nobinlog fails committing out of order at retry
The 2nd issue was out of order commit by transactions that are ordered after an erroring out transaction. The latter was possible in the test thanks to the above retry failure. Out-of-order committing was done despite the slave was stopping.
Let me see if I understand this correctly: Normally, a transaction in optimistic parallel replication cannot commit out of order because it either waits for the prior transaction to commit first, or, if the prior transaction has stopped with an error earlier, then the following transaction will not even start. But in this case, during retry, there was missing the check that the prior transaction stopped with an error. Therefore, the following transaction's wait_for_prior_commit() does nothing, and it wrongly goes ahead and runs and even commits, even though an earlier transaction failed and replication is stopping. Correct? That's a nice catch of a subtle, but potentially serious bug. I wonder how this is possible though. As I recall, when a transaction fails, this failure is marked in the wait_for_prior_commit data structures. So the retrying transaction should have gotten an error in wait_for_prior_commit() and aborted the retries. So why does this not happen in this case?
The test failures were of two sorts. One is that the number of potential temporary errors actually exceeds the default value of the slave retry option. Increased @@global.slave_transaction_retries fixes that part.
Hm, this shouldn't be needed, I think. The retry waits for all prior transactions to commit first, so after the first retry no further retries should be caused by other parallel replication workers. So I would expect this problem to be a side-effect of the other, real issue, and this increase should not be needed (and might hide another problem later, if left in).
diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 34f9113f7fe..9f1d23a1161 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc
@@ -745,7 +761,25 @@ retry_event_group(rpl_group_info *rgi, rpl_parallel_thread *rpt, for (;;) { mysql_mutex_lock(&entry->LOCK_parallel_entry); - register_wait_for_prior_event_group_commit(rgi, entry); + if (entry->stop_on_error_sub_id == (uint64) ULONGLONG_MAX || +#ifndef DBUG_OFF + (DBUG_EVALUATE_IF("simulate_mdev_12746", 1, 0)) || +#endif + rgi->gtid_sub_id < entry->stop_on_error_sub_id) + { + register_wait_for_prior_event_group_commit(rgi, entry); + } + else + { + /* + An earlier transaction by another worker may have not had not + have the current one in subsequent committers list and thus + the actual induced error status could not have been sumbitted. + So it should be set by ourselves now. + */
This comment is hard to understand - what do you mean here?
+ if (!rgi->worker_error) + rgi->worker_error= 1; + }
I would have expected an abort (goto err) here (or well just below, after unlocking the mutex). Is this not needed, and if so, why not?
@@ -716,12 +724,20 @@ retry_event_group(rpl_group_info *rgi, rpl_parallel_thread *rpt, unregistering (and later re-registering) the wait. */ if(thd->wait_for_commit_ptr) - thd->wait_for_commit_ptr->unregister_wait_for_prior_commit(); + thd->wait_for_commit_ptr->unregister_wait_for_prior_commit();
Indentation? - Kristian.
Kristian, hello. Thanks for looking at fixes so promptly!
andrei.elkin@pp.inet.fi writes:
commit 3cebb54e6387a7eace1757c82ed0efd6e11590b9 Author: Andrei Elkin <andrei.elkin@mariadb.com> Date: Fri Feb 9 15:00:23 2018 +0200
MDEV-12746 rpl.rpl_parallel_optimistic_nobinlog fails committing out of order at retry
The 2nd issue was out of order commit by transactions that are ordered after an erroring out transaction. The latter was possible in the test thanks to the above retry failure. Out-of-order committing was done despite the slave was stopping.
Let me see if I understand this correctly:
Normally, a transaction in optimistic parallel replication cannot commit out of order because it either waits for the prior transaction to commit first, or, if the prior transaction has stopped with an error earlier, then the following transaction will not even start.
But in this case, during retry, there was missing the check that the prior transaction stopped with an error.
Right.
Therefore, the following transaction's wait_for_prior_commit() does nothing, and it wrongly goes ahead and runs and even commits, even though an earlier transaction failed and replication is stopping.
Correct?
That's the case.
That's a nice catch of a subtle, but potentially serious bug.
I wonder how this is possible though. As I recall, when a transaction fails, this failure is marked in the wait_for_prior_commit data structures. So the retrying transaction should have gotten an error in wait_for_prior_commit() and aborted the retries.
So why does this not happen in this case?
One observed possibility is reflected by a simulation test. First the parent errros out goes to `finish_event_group()' but it's possible it does not have yet the child in its `subsequent_commits_list' list. And that's how the parent's `wakeup_error' never reaches the child. This is certainly the case when a child retries after a temp failure. In the test condition the child's `worker_error' is zero, but if a non-zero case `worker_error' gets cleared anyway through `unregister_wait_for_prior_commit()'. So I understood so far that the retrying worker needs to check `stop_on_error_sub_id' that effectively reflects the error status. We can't relay on `worker_error' that is. Perhaps `subsequent_commits_list' must contain the child as well so the retrying worker would see its status the most direct way. But I did not look into any possible improvement into that direction.
The test failures were of two sorts. One is that the number of potential temporary errors actually exceeds the default value of the slave retry option. Increased @@global.slave_transaction_retries fixes that part.
Hm, this shouldn't be needed, I think. The retry waits for all prior transactions to commit first, so after the first retry no further retries should be caused by other parallel replication workers.
You must mean the retrying worker stops doing it optimistically. I saw that. But I saw through logs that at times a worker could re-try about the number of worker pool size. Perhaps this unrestrained behaviour was caused by lack of a check of the current patch. I'm checking that closely.
So I would expect this problem to be a side-effect of the other, real issue, and this increase should not be needed (and might hide another problem later, if left in).
diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 34f9113f7fe..9f1d23a1161 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc
@@ -745,7 +761,25 @@ retry_event_group(rpl_group_info *rgi, rpl_parallel_thread *rpt, for (;;) { mysql_mutex_lock(&entry->LOCK_parallel_entry); - register_wait_for_prior_event_group_commit(rgi, entry); + if (entry->stop_on_error_sub_id == (uint64) ULONGLONG_MAX || +#ifndef DBUG_OFF + (DBUG_EVALUATE_IF("simulate_mdev_12746", 1, 0)) || +#endif + rgi->gtid_sub_id < entry->stop_on_error_sub_id) + { + register_wait_for_prior_event_group_commit(rgi, entry); + } + else + { + /* + An earlier transaction by another worker may have not had not + have the current one in subsequent committers list and thus + the actual induced error status could not have been sumbitted. + So it should be set by ourselves now. + */
This comment is hard to understand - what do you mean here?
I put more correct in above. The comments should've read An earlier transaction by another worker may have not had the current one in its subsequent committers list and thus the actual (induced) error status to the current transaction could not be set by the earlier worker.
+ if (!rgi->worker_error) + rgi->worker_error= 1; + }
I would have expected an abort (goto err) here (or well just below, after unlocking the mutex). Is this not needed, and if so, why not?
I left it to upcoming (in 2 lines) THD::wait_for_prior_commit(). It takes care to check out `worker_error'. int wait_for_prior_commit(THD *thd) { /* Quick inline check, to avoid function call and locking in the common case where no wakeup is registered, or a registered wait was already signalled. */ if (waitee) return wait_for_prior_commit2(thd); else { if (wakeup_error) my_error(ER_PRIOR_COMMIT_FAILED, MYF(0)); return wakeup_error; } } No waitee is guaranteed as this worker skipped `register_wait_for_prior_event_group_commit()'.
@@ -716,12 +724,20 @@ retry_event_group(rpl_group_info *rgi, rpl_parallel_thread *rpt, unregistering (and later re-registering) the wait. */ if(thd->wait_for_commit_ptr) - thd->wait_for_commit_ptr->unregister_wait_for_prior_commit(); + thd->wait_for_commit_ptr->unregister_wait_for_prior_commit();
Indentation?
To be fixed.
- Kristian.
Cheers, Andrei
andrei.elkin@pp.inet.fi writes:
First the parent errros out goes to `finish_event_group()' but it's possible it does not have yet the child in its `subsequent_commits_list'
So I understood so far that the retrying worker needs to check `stop_on_error_sub_id' that effectively reflects the error status.
Agree, the basic fix looks correct, of checking stop_on_error_sub_id. And yes, there are probably cases where after stop_on_error_sub_id is set, earlier transactions are aborted early without setting up the subsequent-transaction-chain, so that wait_for_prior_commit cannot be relied upon, and stop_on_error_sub_id must be checked. So again, nice catch, this would be nasty to try and debug from errors seen only in a user's setup. I am only wondering about two points, if there is something that needs to be adjusted in the patch around the error exit after stop_on_error_sub_id, or perhaps something else not yet understood going on: Number one:
+ if (!rgi->worker_error) + rgi->worker_error= 1; + }
I would have expected an abort (goto err) here (or well just below, after unlocking the mutex). Is this not needed, and if so, why not?
I left it to upcoming (in 2 lines) THD::wait_for_prior_commit(). It takes care to check out `worker_error'.
int wait_for_prior_commit(THD *thd) { /* Quick inline check, to avoid function call and locking in the common case where no wakeup is registered, or a registered wait was already signalled. */ if (waitee) return wait_for_prior_commit2(thd); else { if (wakeup_error) my_error(ER_PRIOR_COMMIT_FAILED, MYF(0)); return wakeup_error; } }
No waitee is guaranteed as this worker skipped `register_wait_for_prior_event_group_commit()'.
Sorry, I don't get it. wait_for_prior_commit() checks thd->wait_for_commit_ptr->wakeup_error, not rgi->worker_error? wait_for_prior_commit() doesn't have access to rgi.
list. And that's how the parent's `wakeup_error' never reaches the child. This is certainly the case when a child retries after a temp failure. In the test condition the child's `worker_error' is zero, but if a non-zero case `worker_error' gets cleared anyway through `unregister_wait_for_prior_commit()'.
Isn't there some confusion here? Between "wakeup_error", which sits inside struct wait_for_commit, and indicates that a prior event group failed. And "worker_error" in rpl_group_info, which indicates that _this_ event group failed, and is used to remember to do proper error cleanup inside the worker thread? Number two:
You must mean the retrying worker stops doing it optimistically. I saw that. But I saw through logs that at times a worker could re-try about the number of worker pool size. Perhaps this unrestrained behaviour was caused by lack of a check of the current patch.
Yes, that is what I mean. And yes, the underlying bug with missing check on stop_on_error_sub_id might have caused this as a secondary effect. My point was just that users should not need to set a high retry count just from using parallel replication (remember, several thousand worker threads have been used with success on production data). So if this is necessary in a test case, it might indicate a problem with the code... - Kristian.
Hello. Kristian Nielsen <knielsen@knielsen-hq.org> writes:
andrei.elkin@pp.inet.fi writes:
First the parent errros out goes to `finish_event_group()' but it's possible it does not have yet the child in its `subsequent_commits_list'
So I understood so far that the retrying worker needs to check `stop_on_error_sub_id' that effectively reflects the error status.
Agree, the basic fix looks correct, of checking stop_on_error_sub_id.
And yes, there are probably cases where after stop_on_error_sub_id is set, earlier transactions are aborted early without setting up the subsequent-transaction-chain, so that wait_for_prior_commit cannot be relied upon, and stop_on_error_sub_id must be checked.
So again, nice catch, this would be nasty to try and debug from errors seen only in a user's setup.
Thanks!
I am only wondering about two points, if there is something that needs to be adjusted in the patch around the error exit after stop_on_error_sub_id, or perhaps something else not yet understood going on:
Number one:
+ if (!rgi->worker_error) + rgi->worker_error= 1; + }
I would have expected an abort (goto err) here (or well just below, after unlocking the mutex). Is this not needed, and if so, why not?
I left it to upcoming (in 2 lines) THD::wait_for_prior_commit(). It takes care to check out `worker_error'.
int wait_for_prior_commit(THD *thd) { /* Quick inline check, to avoid function call and locking in the common case where no wakeup is registered, or a registered wait was already signalled. */ if (waitee) return wait_for_prior_commit2(thd); else { if (wakeup_error) my_error(ER_PRIOR_COMMIT_FAILED, MYF(0)); return wakeup_error; } }
No waitee is guaranteed as this worker skipped `register_wait_for_prior_event_group_commit()'.
Sorry, I don't get it. wait_for_prior_commit() checks thd->wait_for_commit_ptr->wakeup_error, not rgi->worker_error? wait_for_prior_commit() doesn't have access to rgi.
Thanks for this check. Indeed, I did not fully recover from earlier confusion between the two (too much closely named (in some metrics) identifier they are!). thd->wait_for_commit_ptr->wakeup_error := 1 is meant even though the test passes (must be thanks to rgi->worker_error later checks).
list. And that's how the parent's `wakeup_error' never reaches the child. This is certainly the case when a child retries after a temp failure. In the test condition the child's `worker_error' is zero, but if a non-zero case `worker_error' gets cleared anyway through `unregister_wait_for_prior_commit()'.
Isn't there some confusion here? Between "wakeup_error", which sits inside struct wait_for_commit, and indicates that a prior event group failed. And "worker_error" in rpl_group_info, which indicates that _this_ event group failed, and is used to remember to do proper error cleanup inside the worker thread?
To fix.
Number two:
You must mean the retrying worker stops doing it optimistically. I saw that. But I saw through logs that at times a worker could re-try about the number of worker pool size. Perhaps this unrestrained behaviour was caused by lack of a check of the current patch.
Yes, that is what I mean. And yes, the underlying bug with missing check on stop_on_error_sub_id might have caused this as a secondary effect.
My point was just that users should not need to set a high retry count just from using parallel replication (remember, several thousand worker threads have been used with success on production data). So if this is necessary in a test case, it might indicate a problem with the code...
True. I am making changes to submit a newer version very soon. Thanks for your help! Andrei
- Kristian.
Kristian, The improved patch is here for more comments if you will have. Cheers, Andrei
Hello.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
andrei.elkin@pp.inet.fi writes:
First the parent errros out goes to `finish_event_group()' but it's possible it does not have yet the child in its `subsequent_commits_list'
So I understood so far that the retrying worker needs to check `stop_on_error_sub_id' that effectively reflects the error status.
Agree, the basic fix looks correct, of checking stop_on_error_sub_id.
And yes, there are probably cases where after stop_on_error_sub_id is set, earlier transactions are aborted early without setting up the subsequent-transaction-chain, so that wait_for_prior_commit cannot be relied upon, and stop_on_error_sub_id must be checked.
So again, nice catch, this would be nasty to try and debug from errors seen only in a user's setup.
Thanks!
I am only wondering about two points, if there is something that needs to be adjusted in the patch around the error exit after stop_on_error_sub_id, or perhaps something else not yet understood going on:
Number one:
+ if (!rgi->worker_error) + rgi->worker_error= 1; + }
I would have expected an abort (goto err) here (or well just below, after unlocking the mutex). Is this not needed, and if so, why not?
I left it to upcoming (in 2 lines) THD::wait_for_prior_commit(). It takes care to check out `worker_error'.
int wait_for_prior_commit(THD *thd) { /* Quick inline check, to avoid function call and locking in the common case where no wakeup is registered, or a registered wait was already signalled. */ if (waitee) return wait_for_prior_commit2(thd); else { if (wakeup_error) my_error(ER_PRIOR_COMMIT_FAILED, MYF(0)); return wakeup_error; } }
No waitee is guaranteed as this worker skipped `register_wait_for_prior_event_group_commit()'.
Sorry, I don't get it. wait_for_prior_commit() checks thd->wait_for_commit_ptr->wakeup_error, not rgi->worker_error? wait_for_prior_commit() doesn't have access to rgi.
Thanks for this check. Indeed, I did not fully recover from earlier confusion between the two (too much closely named (in some metrics) identifier they are!).
thd->wait_for_commit_ptr->wakeup_error := 1
is meant even though the test passes (must be thanks to rgi->worker_error later checks).
list. And that's how the parent's `wakeup_error' never reaches the child. This is certainly the case when a child retries after a temp failure. In the test condition the child's `worker_error' is zero, but if a non-zero case `worker_error' gets cleared anyway through `unregister_wait_for_prior_commit()'.
Isn't there some confusion here? Between "wakeup_error", which sits inside struct wait_for_commit, and indicates that a prior event group failed. And "worker_error" in rpl_group_info, which indicates that _this_ event group failed, and is used to remember to do proper error cleanup inside the worker thread?
To fix.
Number two:
You must mean the retrying worker stops doing it optimistically. I saw that. But I saw through logs that at times a worker could re-try about the number of worker pool size. Perhaps this unrestrained behaviour was caused by lack of a check of the current patch.
Yes, that is what I mean. And yes, the underlying bug with missing check on stop_on_error_sub_id might have caused this as a secondary effect.
My point was just that users should not need to set a high retry count just from using parallel replication (remember, several thousand worker threads have been used with success on production data). So if this is necessary in a test case, it might indicate a problem with the code...
True. I am making changes to submit a newer version very soon.
Thanks for your help!
Andrei
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
participants (2)
-
andrei.elkin@pp.inet.fi
-
Kristian Nielsen