Hi!
"Kristian" == Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Kristian> Michael Widenius <monty@askmonty.org> writes:
What should happen if you kill a replication thread is that replication should stop for that master.
<cut> Kristian> It seems to make sense that KILL CONNECTION on the SQL thread would stop the Kristian> replication. I do not know whether this is how it works or not in current Kristian> replication, but whatever it is, we should just leave the behaviour the same. This is how things are now (as far as I know). <cut> Kristian> But KILL of a worker thread should _not_ cause the thread to exit, I Kristian> think. The pool of parallel threads is just a thread pool, and for simplicity Kristian> in the first version, it has a fixed size. A KILL will disconnect the thread Kristian> from the replication connection it was servicing, but the thread must remain Kristian> alive, ready to serve another connection if needed. Correct. The KILL command only kills the query or the connection (ie, THD). The thread is always reused. Kristian> So the rpt->stop is only about maintaining the thread pool for parallel Kristian> replication, not about anything to do with aborting any specific master Kristian> connection replication. In rpl_parallel_change_thread_count() we set rpt->stop Kristian> to ask all existing threads to exit, and afterwards spawn a new set of Kristian> threads. This can only happen when all replication SQL/IO threads are stopped Kristian> and no event execution is taking place. Yes. The questions is when rpt->stop should take effect. My suggestion is that the replication threads should only examine rpt->stop after commits. This way we can ensure that when all threads are stopped, everything is committed to a certain point and we never have to do a rollback. Kristian> So this code of mine is actually wrong, I think: Kristian> while (!rpt->stop && !thd->killed) Kristian> It should just be while (!rpt->stop) { ... }. Kristian> And thd->killed should not be used at all to control thread Kristian> termination. Instead, it should be used (and is not currently, needs to be Kristian> fixed) in various places for event execution to allow aborting currently Kristian> executing event. This is needed around rpt_handle_event() I think, and also Kristian> around queue_for_group_commit(). Yes. I will be working on this today & tomorrow. Note that every loop where we wait needs to be stoppable, either with 'stop' or with KILL. Kristian> For normal stop, this is handled mostly in the SQL thread. It waits with a Kristian> timeout for the current event group to finish normally, then does a hard kill Kristian> if needed. This needs to be extended to handle running things in a worker Kristian> thread, of course. The code is around sql_slave_killed(), I think. Kristian> Does that make sense? Makes sence. I will know more tomorrow when I have dug into the code a bit more.
+ mysql_mutex_lock(&LOCK_wait_commit); + waiting_for_commit= false; + mysql_cond_signal(&COND_wait_commit); + mysql_mutex_unlock(&LOCK_wait_commit); +}
In this particular case it looks safe to move the cond signal out of the mutex. I don't see how anyone could miss the signal.
Kristian> Probably. Generally I prefer to write it as above, to not have to make sure Kristian> that moving it out is safe.
I agree that this is the way to do in general. However for cases where we are waiting almost once per query, we need to make things faster as this extra wakup will take notable resources!
Kristian> Ok, no problem, I've changed it. Kristian> I should benchmark this, it keeps coming up. It's not clear to me which way Kristian> would be the fastest, seems it would depend on the implementation. I did a benchmark of this a long time ago (+10 years) on Linux and then using the signal after unlock was notable faster. The gain is two thread switches + one thread wakeup per cond_signal.
Hi!
Part 2 of review of parallel replication
=== added file 'sql/rpl_parallel.cc'
<cut>
+pthread_handler_t +handle_rpl_parallel_thread(void *arg) +{
<cut>
+ if (end_of_group) + { + in_event_group= false; +
Add comment:
/* All events for this group has now been executed and logged (but not necessaerly synced). Inform the other event groups that are waiting for this thread that we are done. */
+ rgi->commit_orderer.unregister_wait_for_prior_commit(); + thd->wait_for_commit_ptr= NULL; + + /* + Record that we have finished, so other event groups will no + longer attempt to wait for us to commit. + + We can race here with the next transactions, but that is fine, as + long as we check that we do not decrease last_committed_sub_id. If + this commit is done, then any prior commits will also have been + done and also no longer need waiting for. + */ + mysql_mutex_lock(&entry->LOCK_parallel_entry); + if (entry->last_committed_sub_id < event_gtid_sub_id) + { + entry->last_committed_sub_id= event_gtid_sub_id; + mysql_cond_broadcast(&entry->COND_parallel_entry); + } + mysql_mutex_unlock(&entry->LOCK_parallel_entry); +
Add:
/* Tell storage engines that they can stop waiting. See comments in sql_class.cc::wakeup_subsequent_commits2() for why we can't do this part of the wakeup in unregister_wait_for_prior_commit(). */
Kristian> I think there is some misunderstanding here. I added some comments that will Kristian> hopefully clarify, but the situation is a bit different than how I read this Kristian> suggestion. I may have missunderstood some of the code. Then it's even more important to get a good comment! Kristian> The situation is that we have for example three threads, A, B, and C. We are Kristian> thread B at this point in the code. Thread B needed to wait for A to commit Kristian> first. C needs to wait for B to commit. Kristian> The call to unregister_wait_for_prior_commit() relates to B waiting for A, it Kristian> cannot be used in relation to C waiting for B, irrespectively of comments in Kristian> the function wakeup_subsequent_commits2() ... ok. I thought that when we called unregister_wait_for_prior_commit() we had already handled the event including logging. (Apparently rpt_handle_event() does less than what I expected). Kristian> The unregister_wait_for_prior_commit() removes B from the list of threads Kristian> waiting for A. This is usually redundant, because B already removed itself Kristian> with wait_for_prior_commit() during the commit step. However, if there is for Kristian> example an error we might not reach commit, and thus we need to remove the Kristian> dangling entry in A. ok, this was a bit unclear. I thought that B could not stop waiting for A until A had written everything to the binary log. I don't see how we B can stop waiting for A until A at least has done all commands and written all it's logs. Kristian> The wakeup_subsequent_commits() signals C that B is done. Again, this is often Kristian> redundant, because we already signalled C during B's commit step. However, C Kristian> could have registered to wait for B between B's commit and this place in the Kristian> code, so it is needed for correctness. This part is clear. <cut>
/* The following loops wait until each thread is done it's work for each event group and then stops it when the thread has nothing to do.
Note that this may take some time as rpl_paralell:do_event() may request threads at the same time. */
Kristian> I added this comment instead: Kristian> /* Kristian> Grab each old thread in turn, and signal it to stop. Kristian> Note that since we require all replication threads to be stopped before Kristian> changing the parallel replication worker thread pool, all the threads will Kristian> be already idle and will terminate immediately. Kristian> */ Kristian> Hope that clarifies things. Yes, it does. Thanks.
+void +rpl_parallel::wait_for_done() +{ + struct rpl_parallel_entry *e; + uint32 i; + + for (i= 0; i < domain_hash.records; ++i) + { + e= (struct rpl_parallel_entry *)my_hash_element(&domain_hash, i); + mysql_mutex_lock(&e->LOCK_parallel_entry); + while (e->current_sub_id > e->last_committed_sub_id) + mysql_cond_wait(&e->COND_parallel_entry, &e->LOCK_parallel_entry); + mysql_mutex_unlock(&e->LOCK_parallel_entry); + } +}
How to handle errors in the above case? For example when we get an error and we had to abort things?
Kristian> I think even if we have to abort event execution in a worker thread, we should Kristian> still increment e->last_committed_sub_id. So the wait here is ok. But we do Kristian> need code to be able to force all worker threads to stop. Kristian> An important piece of this puzzle is in sql_slave_killed(). I do not Kristian> understand the full details of this code yet, but I believe the idea is to Kristian> wait with some timeout for event groups to finish normally, then hard kill if Kristian> the timeout triggers (I vaguely remember that for InnoDB we can just kill and Kristian> rollback, while for MyISAM the timeout is necessary to avoid leaving half an Kristian> update executed). We could extend this to check if we are using transactional tables or not and wait longer if not transaction tables has been used for the query. Kristian> We need to extend sql_slave_killed() to also handle waiting for each worker Kristian> thread to stop, with the timeout. Once we have that, it is hopefully clear how Kristian> to do wait_for_done(); maybe it can call sql_slave_killed(), or maybe it is no Kristian> longer needed (wait_for_done() is just a temporary thing I put in to get Kristian> things working for the benchmark). Kristian> We also need to extend the API around struct wait_for_commit to include error Kristian> handling. I suppose wait_for_prior_commit() should check for thd->killed when Kristian> it waits, and be extended to return an error if killed. And then also Kristian> wakeup_subsequent_commits() should be extended to take an error argument, so Kristian> that if B is waiting for A, and A gets an error, then A can call Kristian> wakeup_subsequent_commits() with error=true, which will cause B to wake up but Kristian> also get an error. Yes, that sounds right. Kristian> Then if we have a number of event groups processing in parallel, A-> B->C->D. And B ends up with an error. Then A can complete, but B will Kristian> signal an error to C (and then to D), so both B, C, and D will fail. One problem we have is that if B, C ,D are using non transactional tables, it would be better to have them complete than abort. This issue we can leave for now. Kristian> Then somewhere, perhaps around the end of apply_event_and_update_pos(), we Kristian> must check an error flag set for B, C, and D, and signal the error back to the Kristian> main SQL thread so that it can halt and print the error to the error log, as Kristian> usual. Kristian> Something like that, I think. Some work to do, but should be possible at least. Yes. Not too hard, I think.
The other way would be to, on error, set last_committed_sub_id back to it's last value.
What do you suggest? Assuming setting the value back would be the easiest solution..
Kristian> Yes (if I understood correctly what you meant with "setting back")... Kristian> I think we should update things the same way whether the transaction commits Kristian> successfully or fails, so we get the same wakeup of later transactions, but Kristian> with the extra error flag added to wait_for_commit so that later transactions Kristian> can see the failure and thus fail and rollback themselves. And then the error Kristian> can be propagated up the call stack, as normal. Kristian> Of course, this becomes more complex due to the need to be able to re-try a Kristian> transaction in case of certain temporary errors like deadlocks and so. Kristian> Hopefully something can be made to work. As we should never get a deadlock or temporary error on a slave, this should not be too hard for a first version.
+ /* + We are already executing something else in this domain. But the two + event groups were committed together in the same group commit on the + master, so we can still do them in parallel here on the slave. + + However, the commit of this event must wait for the commit of the prior + event, to preserve binlog commit order and visibility across all + servers in the replication hierarchy. + */ + rpl_parallel_thread *rpt= global_rpl_thread_pool.get_thread(e);
Here we could add a test if get_thread() is 0, which could happen on shutdown or when killing a replication thread.
Kristian> I do not think we need to do this. I made it so that the worker thread pool Kristian> can never change size without stopping all replication threads first, so we Kristian> avoid a lot of these issues. And as I wrote above, killing a worker thread Kristian> should terminate the event group that it is currently servicing (and the Kristian> associated SQL thread and master connection), but not remove the thread from Kristian> the pool. I was thinking about the case where all worker threads are occupied for a long time. In this case we will get do_event() stop a long time until a new thread is free. I do see some small advantages in beeing able to free the sql-thread early. If get_thread() is not a stop point, then we have the problem at terminate, that do_event() will wait for get_thread() and then schedule a new query() even if the server should be stopping. We need to check for the sql-thread being killed before it tells a worker thread to start with new queries... Kristian> (I can imagine that in a later version we want to implement a smarter worker Kristian> thread manager, but for the first version I tried to keep things simple, there Kristian> are plenty of other problems that we need to tackle...)
I think it's relatively safe to use get_thread() as a end-synchronize point as we should never have anything half-done when we call get_thread()). As get_thread() can take a very long time and is not called in many places, it's also sounds like a resonable thing to do.
Kristian> Sorry, I do not understand. "End-synchronize point" for what? Sorry, wrong word. What I meant was a point where it's safe to detect that the server or replication thread has been killed.
+ /* + Events like ROTATE and FORMAT_DESCRIPTION. Do not run in worker thread. + Same for events not preceeded by GTID (we should not see those normally, + but they might be from an old master). + */ + qev->rgi= serial_rgi; + rpt_handle_event(qev, NULL); + delete_or_keep_event_post_apply(rli, typ, qev->ev);
Why is delete_or_keep_event_post_apply() not part of rpt_handle_event()? You always call these two together...
Kristian> It is to share code with the non-parallel case. Kristian> When parallel replication is disabled, everything is done in the SQL thread in Kristian> slave.cc. Like this: Kristian> exec_res= apply_event_and_update_pos(ev, thd, serial_rgi, NULL); Kristian> delete_or_keep_event_post_apply(serial_rgi, typ, ev); Kristian> Or did you mean that I could move the call of Kristian> delete_or_keep_event_post_apply() into the function rpt_handle_event()? Kristian> That might be possible, though it was not 100% clear to me as there are a few Kristian> differences in how it is done in the different cases. Yes, I meant moving the function into rpt_handle_event(). Kristian> But in any case that part of the code needs more work. Especially with regard Kristian> to error handling, and related to updating the last executed event Kristian> position. Probably it makes sense to revisit this as part of that work. ok. Regards, Monty