Kristian, hello. Thanks for looking at fixes so promptly!
andrei.elkin@pp.inet.fi writes:
commit 3cebb54e6387a7eace1757c82ed0efd6e11590b9 Author: Andrei Elkin <andrei.elkin@mariadb.com> Date: Fri Feb 9 15:00:23 2018 +0200
MDEV-12746 rpl.rpl_parallel_optimistic_nobinlog fails committing out of order at retry
The 2nd issue was out of order commit by transactions that are ordered after an erroring out transaction. The latter was possible in the test thanks to the above retry failure. Out-of-order committing was done despite the slave was stopping.
Let me see if I understand this correctly:
Normally, a transaction in optimistic parallel replication cannot commit out of order because it either waits for the prior transaction to commit first, or, if the prior transaction has stopped with an error earlier, then the following transaction will not even start.
But in this case, during retry, there was missing the check that the prior transaction stopped with an error.
Right.
Therefore, the following transaction's wait_for_prior_commit() does nothing, and it wrongly goes ahead and runs and even commits, even though an earlier transaction failed and replication is stopping.
Correct?
That's the case.
That's a nice catch of a subtle, but potentially serious bug.
I wonder how this is possible though. As I recall, when a transaction fails, this failure is marked in the wait_for_prior_commit data structures. So the retrying transaction should have gotten an error in wait_for_prior_commit() and aborted the retries.
So why does this not happen in this case?
One observed possibility is reflected by a simulation test. First the parent errros out goes to `finish_event_group()' but it's possible it does not have yet the child in its `subsequent_commits_list' list. And that's how the parent's `wakeup_error' never reaches the child. This is certainly the case when a child retries after a temp failure. In the test condition the child's `worker_error' is zero, but if a non-zero case `worker_error' gets cleared anyway through `unregister_wait_for_prior_commit()'. So I understood so far that the retrying worker needs to check `stop_on_error_sub_id' that effectively reflects the error status. We can't relay on `worker_error' that is. Perhaps `subsequent_commits_list' must contain the child as well so the retrying worker would see its status the most direct way. But I did not look into any possible improvement into that direction.
The test failures were of two sorts. One is that the number of potential temporary errors actually exceeds the default value of the slave retry option. Increased @@global.slave_transaction_retries fixes that part.
Hm, this shouldn't be needed, I think. The retry waits for all prior transactions to commit first, so after the first retry no further retries should be caused by other parallel replication workers.
You must mean the retrying worker stops doing it optimistically. I saw that. But I saw through logs that at times a worker could re-try about the number of worker pool size. Perhaps this unrestrained behaviour was caused by lack of a check of the current patch. I'm checking that closely.
So I would expect this problem to be a side-effect of the other, real issue, and this increase should not be needed (and might hide another problem later, if left in).
diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 34f9113f7fe..9f1d23a1161 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc
@@ -745,7 +761,25 @@ retry_event_group(rpl_group_info *rgi, rpl_parallel_thread *rpt, for (;;) { mysql_mutex_lock(&entry->LOCK_parallel_entry); - register_wait_for_prior_event_group_commit(rgi, entry); + if (entry->stop_on_error_sub_id == (uint64) ULONGLONG_MAX || +#ifndef DBUG_OFF + (DBUG_EVALUATE_IF("simulate_mdev_12746", 1, 0)) || +#endif + rgi->gtid_sub_id < entry->stop_on_error_sub_id) + { + register_wait_for_prior_event_group_commit(rgi, entry); + } + else + { + /* + An earlier transaction by another worker may have not had not + have the current one in subsequent committers list and thus + the actual induced error status could not have been sumbitted. + So it should be set by ourselves now. + */
This comment is hard to understand - what do you mean here?
I put more correct in above. The comments should've read An earlier transaction by another worker may have not had the current one in its subsequent committers list and thus the actual (induced) error status to the current transaction could not be set by the earlier worker.
+ if (!rgi->worker_error) + rgi->worker_error= 1; + }
I would have expected an abort (goto err) here (or well just below, after unlocking the mutex). Is this not needed, and if so, why not?
I left it to upcoming (in 2 lines) THD::wait_for_prior_commit(). It takes care to check out `worker_error'. int wait_for_prior_commit(THD *thd) { /* Quick inline check, to avoid function call and locking in the common case where no wakeup is registered, or a registered wait was already signalled. */ if (waitee) return wait_for_prior_commit2(thd); else { if (wakeup_error) my_error(ER_PRIOR_COMMIT_FAILED, MYF(0)); return wakeup_error; } } No waitee is guaranteed as this worker skipped `register_wait_for_prior_event_group_commit()'.
@@ -716,12 +724,20 @@ retry_event_group(rpl_group_info *rgi, rpl_parallel_thread *rpt, unregistering (and later re-registering) the wait. */ if(thd->wait_for_commit_ptr) - thd->wait_for_commit_ptr->unregister_wait_for_prior_commit(); + thd->wait_for_commit_ptr->unregister_wait_for_prior_commit();
Indentation?
To be fixed.
- Kristian.
Cheers, Andrei