Hi Monty, So as promised, I took a look at the existing code for STOP SLAVE, and came up with some ideas for how to extend this to handle parallel replication. In existing code, STOP SLAVE ends up in terminate_slave_threads(). The interesting part here is the SQL thread; stopping the IO thread should not really be affected by the parallel replication feature. What happens is basically this: mi->rli.abort_slave=1; terminate_slave_thread(mi->rli.sql_thd, &mi->rli.run_lock, &mi->rli.stop_cond, &mi->rli.slave_running) So the rli->abort_slave is the flag by which main server can tell the SQL thread to stop. What terminate_slave_thread() does is to repeatedly execute the following every 2 seconds until rli->slave_running becomes false: pthread_kill(thd->real_id, SIGALRM); // Or SIGUSR1 thd->awake(NOT_KILLED); Since it uses NOT_KILLED, I assume this means that any currrently executing event/query is never terminated by STOP SLAVE. It seems to me the only thing this can wake up is if the SQL thread is waiting for more events to arrive in the relay log, but maybe there are other things I did not think of.
From the SQL thread's side, the rli->abort_slave flag is checked in sql_slave_killed(). This function is checked in a few places, basically when waiting for a new event in the relay log and before executing a new event. So normally, once STOP SLAVE sets rli->abort_slave, the SQL thread will complete execution of the current event, if any, and then stop.
However, if the SQL thread is in the middle of executing an event group that modifies non-transactional tables, then there are changes that cannot be rolled back, so it is not safe to just exit in the middle of the event group. In this case, more events are executed, until either the event group is completed, or a fixed timeout of 60 seconds has elapsed. -- It seems fairly straight-forward to extend this to work in the parallel case: - terminate_slave_thread() should be extended so that it also signals any active worker threads for that master connection. So we should introduce rgi->abort_slave, and set it during STOP SLAVE for all queued rgi entries. - Then all the worker threads should check for rgi->abort_slave in appropriate places using a similar function to sql_slave_killed(). If the worker thread is in the middle of executing an event group with non-transactional updates, then it should try to finish that group with a timeout, else it should stop. Then when a worker thread ends the current event group, it should roll back any active transaction, unregister itself if it had a wait for a previous commit, and wakeup any other transactions that might be waiting for it. So overall, STOP SLAVE will set an abort_slave flag both for the main SQL thread and for any active worker threads. These threads will then stop once they have finished executing the current event (or possibly event group). And the rpl_parallel::wait_for_done() function can be used as it currently is to make sure that all the workers have time to complete or abort before the main SQL thread exits. Seems simple enough. I suggest that we do this after we have implemented error handling (the case where a query fails in some worker and the slave has to abort). The normal stop case would probably integrate naturally into the same mechanisms for propagating errors between worker threads and the main SQL thread. - Kristian.