Michael Widenius <monty@askmonty.org> writes:
What should happen if you kill a replication thread is that replication should stop for that master.
Kristian> This needs more thought, I think ... certainly something looks not right.
After looking at the full code, I think that the logical way things should work is:
'stop' is to be used when you want to nicely take done replication. This means that the current commit groups should be given time to finish.
thd->killed should mean that we should stop ASAP. - All not commited things should abort.
This is needed in a 'panic shutdown' (like out soon-out-of- power) or when trying to kill the replication thread when one notices that something went horribly wrong (like ALTER TABLE stopping replication).
Ok, so I thought more about this. There are two kinds of threads involved. One is the normal SQL thread (this is the only thread involved in non-parallel replication). There is a one-to-one correspondence between an SQL thread and the associated master connection. It seems to make sense that KILL CONNECTION on the SQL thread would stop the replication. I do not know whether this is how it works or not in current replication, but whatever it is, we should just leave the behaviour the same. The other kind of thread is the parallel replication worker thread. We have a pool of those (the size of the pool specified by --slave-parallel-threads). These worker threads are not associated with any particular master connection. They are assigned to execute an event group at a time for whichever master connection is in need for a new thread. So a KILL QUERY or KILL CONNECTION on a worker thread should abort the currently executing event - and it will, I assume, using the existing code for killing a running query operation. This will fail the execution of the query, and thus eventually stop the associated SQL thread and master connection, once the error is propagated out of the worker thread up to the parent SQL thread. But KILL of a worker thread should _not_ cause the thread to exit, I think. The pool of parallel threads is just a thread pool, and for simplicity in the first version, it has a fixed size. A KILL will disconnect the thread from the replication connection it was servicing, but the thread must remain alive, ready to serve another connection if needed. So the rpt->stop is only about maintaining the thread pool for parallel replication, not about anything to do with aborting any specific master connection replication. In rpl_parallel_change_thread_count() we set rpt->stop to ask all existing threads to exit, and afterwards spawn a new set of threads. This can only happen when all replication SQL/IO threads are stopped and no event execution is taking place. So this code of mine is actually wrong, I think: while (!rpt->stop && !thd->killed) It should just be while (!rpt->stop) { ... }. And thd->killed should not be used at all to control thread termination. Instead, it should be used (and is not currently, needs to be fixed) in various places for event execution to allow aborting currently executing event. This is needed around rpt_handle_event() I think, and also around queue_for_group_commit(). For normal stop, this is handled mostly in the SQL thread. It waits with a timeout for the current event group to finish normally, then does a hard kill if needed. This needs to be extended to handle running things in a worker thread, of course. The code is around sql_slave_killed(), I think. Does that make sense?
+ mysql_mutex_lock(&LOCK_wait_commit); + waiting_for_commit= false; + mysql_cond_signal(&COND_wait_commit); + mysql_mutex_unlock(&LOCK_wait_commit); +}
In this particular case it looks safe to move the cond signal out of the mutex. I don't see how anyone could miss the signal.
Kristian> Probably. Generally I prefer to write it as above, to not have to make sure Kristian> that moving it out is safe.
I agree that this is the way to do in general. However for cases where we are waiting almost once per query, we need to make things faster as this extra wakup will take notable resources!
Ok, no problem, I've changed it. I should benchmark this, it keeps coming up. It's not clear to me which way would be the fastest, seems it would depend on the implementation.
Hi!
Part 2 of review of parallel replication
=== added file 'sql/rpl_parallel.cc'
<cut>
+pthread_handler_t +handle_rpl_parallel_thread(void *arg) +{
<cut>
+ if (end_of_group) + { + in_event_group= false; +
Add comment:
/* All events for this group has now been executed and logged (but not necessaerly synced). Inform the other event groups that are waiting for this thread that we are done. */
+ rgi->commit_orderer.unregister_wait_for_prior_commit(); + thd->wait_for_commit_ptr= NULL; + + /* + Record that we have finished, so other event groups will no + longer attempt to wait for us to commit. + + We can race here with the next transactions, but that is fine, as + long as we check that we do not decrease last_committed_sub_id. If + this commit is done, then any prior commits will also have been + done and also no longer need waiting for. + */ + mysql_mutex_lock(&entry->LOCK_parallel_entry); + if (entry->last_committed_sub_id < event_gtid_sub_id) + { + entry->last_committed_sub_id= event_gtid_sub_id; + mysql_cond_broadcast(&entry->COND_parallel_entry); + } + mysql_mutex_unlock(&entry->LOCK_parallel_entry); +
Add:
/* Tell storage engines that they can stop waiting. See comments in sql_class.cc::wakeup_subsequent_commits2() for why we can't do this part of the wakeup in unregister_wait_for_prior_commit(). */
I think there is some misunderstanding here. I added some comments that will hopefully clarify, but the situation is a bit different than how I read this suggestion. The situation is that we have for example three threads, A, B, and C. We are thread B at this point in the code. Thread B needed to wait for A to commit first. C needs to wait for B to commit. The call to unregister_wait_for_prior_commit() relates to B waiting for A, it cannot be used in relation to C waiting for B, irrespectively of comments in the function wakeup_subsequent_commits2() ... The unregister_wait_for_prior_commit() removes B from the list of threads waiting for A. This is usually redundant, because B already removed itself with wait_for_prior_commit() during the commit step. However, if there is for example an error we might not reach commit, and thus we need to remove the dangling entry in A. The wakeup_subsequent_commits() signals C that B is done. Again, this is often redundant, because we already signalled C during B's commit step. However, C could have registered to wait for B between B's commit and this place in the code, so it is needed for correctness. Note that in the cases where this is redundant, no extra mutexes are taken. The unregister_wait_for_prior_commit() and wakeup_subsequent_commits() functions are inlines that do just a single test of one field in the struct.
As far as I can see, rpt->free is always 0 here. - It's set to 0 at the top of this loop and only set to 1 here.
Can we remove the rpt->free variable?
Yes, it looks like it. I have removed it. It was supposed to mark if the worker thread was free to be reused for another event group, but this is marked already by the thread being included in the rpt->pool->free_list.
Add:
/* The following loops wait until each thread is done it's work for each event group and then stops it when the thread has nothing to do.
Note that this may take some time as rpl_paralell:do_event() may request threads at the same time. */
I added this comment instead: /* Grab each old thread in turn, and signal it to stop. Note that since we require all replication threads to be stopped before changing the parallel replication worker thread pool, all the threads will be already idle and will terminate immediately. */ Hope that clarifies things.
+void +rpl_parallel::wait_for_done() +{ + struct rpl_parallel_entry *e; + uint32 i; + + for (i= 0; i < domain_hash.records; ++i) + { + e= (struct rpl_parallel_entry *)my_hash_element(&domain_hash, i); + mysql_mutex_lock(&e->LOCK_parallel_entry); + while (e->current_sub_id > e->last_committed_sub_id) + mysql_cond_wait(&e->COND_parallel_entry, &e->LOCK_parallel_entry); + mysql_mutex_unlock(&e->LOCK_parallel_entry); + } +}
How to handle errors in the above case? For example when we get an error and we had to abort things?
I think even if we have to abort event execution in a worker thread, we should still increment e->last_committed_sub_id. So the wait here is ok. But we do need code to be able to force all worker threads to stop. An important piece of this puzzle is in sql_slave_killed(). I do not understand the full details of this code yet, but I believe the idea is to wait with some timeout for event groups to finish normally, then hard kill if the timeout triggers (I vaguely remember that for InnoDB we can just kill and rollback, while for MyISAM the timeout is necessary to avoid leaving half an update executed). We need to extend sql_slave_killed() to also handle waiting for each worker thread to stop, with the timeout. Once we have that, it is hopefully clear how to do wait_for_done(); maybe it can call sql_slave_killed(), or maybe it is no longer needed (wait_for_done() is just a temporary thing I put in to get things working for the benchmark). We also need to extend the API around struct wait_for_commit to include error handling. I suppose wait_for_prior_commit() should check for thd->killed when it waits, and be extended to return an error if killed. And then also wakeup_subsequent_commits() should be extended to take an error argument, so that if B is waiting for A, and A gets an error, then A can call wakeup_subsequent_commits() with error=true, which will cause B to wake up but also get an error. Then if we have a number of event groups processing in parallel, A->B->C->D. And B ends up with an error. Then A can complete, but B will signal an error to C (and then to D), so both B, C, and D will fail. Then somewhere, perhaps around the end of apply_event_and_update_pos(), we must check an error flag set for B, C, and D, and signal the error back to the main SQL thread so that it can halt and print the error to the error log, as usual. Something like that, I think. Some work to do, but should be possible at least.
The other way would be to, on error, set last_committed_sub_id back to it's last value.
What do you suggest? Assuming setting the value back would be the easiest solution..
Yes (if I understood correctly what you meant with "setting back")... I think we should update things the same way whether the transaction commits successfully or fails, so we get the same wakeup of later transactions, but with the extra error flag added to wait_for_commit so that later transactions can see the failure and thus fail and rollback themselves. And then the error can be propagated up the call stack, as normal. Of course, this becomes more complex due to the need to be able to re-try a transaction in case of certain temporary errors like deadlocks and so. Hopefully something can be made to work.
+ /* + We are already executing something else in this domain. But the two + event groups were committed together in the same group commit on the + master, so we can still do them in parallel here on the slave. + + However, the commit of this event must wait for the commit of the prior + event, to preserve binlog commit order and visibility across all + servers in the replication hierarchy. + */ + rpl_parallel_thread *rpt= global_rpl_thread_pool.get_thread(e);
Here we could add a test if get_thread() is 0, which could happen on shutdown or when killing a replication thread.
I do not think we need to do this. I made it so that the worker thread pool can never change size without stopping all replication threads first, so we avoid a lot of these issues. And as I wrote above, killing a worker thread should terminate the event group that it is currently servicing (and the associated SQL thread and master connection), but not remove the thread from the pool. (I can imagine that in a later version we want to implement a smarter worker thread manager, but for the first version I tried to keep things simple, there are plenty of other problems that we need to tackle...)
I think it's relatively safe to use get_thread() as a end-synchronize point as we should never have anything half-done when we call get_thread()). As get_thread() can take a very long time and is not called in many places, it's also sounds like a resonable thing to do.
Sorry, I do not understand. "End-synchronize point" for what?
+ /* + Events like ROTATE and FORMAT_DESCRIPTION. Do not run in worker thread. + Same for events not preceeded by GTID (we should not see those normally, + but they might be from an old master). + */ + qev->rgi= serial_rgi; + rpt_handle_event(qev, NULL); + delete_or_keep_event_post_apply(rli, typ, qev->ev);
Why is delete_or_keep_event_post_apply() not part of rpt_handle_event()? You always call these two together...
It is to share code with the non-parallel case. When parallel replication is disabled, everything is done in the SQL thread in slave.cc. Like this: exec_res= apply_event_and_update_pos(ev, thd, serial_rgi, NULL); delete_or_keep_event_post_apply(serial_rgi, typ, ev); Or did you mean that I could move the call of delete_or_keep_event_post_apply() into the function rpt_handle_event()? That might be possible, though it was not 100% clear to me as there are a few differences in how it is done in the different cases. But in any case that part of the code needs more work. Especially with regard to error handling, and related to updating the last executed event position. Probably it makes sense to revisit this as part of that work. - Kristian.