Monty, Thanks for the comments so far. I have pushed fixes to 10.0-knielsen, and some specific replies are below. - Kristian. Michael Widenius <monty@askmonty.org> writes:
+ --slave-parallel-threads=# + If non-zero, number of threads to spawn to apply in + parallel events on the slave that were group-committed on + the master or were logged with GTID in different + replication domains.
What happens if --slave-parallel-threads is 0 ? Does this mean that there will be no parallel threads? Is this the same as 1 ?
If same as 1, why not have the option to take values starting from 1 ?
I think it is logical that 0 means disabled and >0 means enabled with that many threads. But I don't mind using 1 as disabled and needing >=2 for enabled. Though the ability to use a pool with just 1 thread might be useful for testing. It doesn't seem critical, maybe we can revisit this if time allows when the basic stuff is done.
=== modified file 'mysql-test/suite/perfschema/r/dml_setup_instruments.result' --- mysql-test/suite/perfschema/r/dml_setup_instruments.result 2013-03-26 09:35:34 +0000 +++ mysql-test/suite/perfschema/r/dml_setup_instruments.result 2013-08-29 08:18:22 +0000 @@ -38,14 +38,14 @@ order by name limit 10; NAME ENABLED TIMED wait/synch/cond/sql/COND_flush_thread_cache YES YES wait/synch/cond/sql/COND_manager YES YES +wait/synch/cond/sql/COND_parallel_entry YES YES +wait/synch/cond/sql/COND_prepare_ordered YES YES wait/synch/cond/sql/COND_queue_state YES YES wait/synch/cond/sql/COND_rpl_status YES YES +wait/synch/cond/sql/COND_rpl_thread YES YES +wait/synch/cond/sql/COND_rpl_thread_pool YES YES wait/synch/cond/sql/COND_server_started YES YES wait/synch/cond/sql/COND_thread_cache YES YES -wait/synch/cond/sql/COND_thread_count YES YES -wait/synch/cond/sql/Delayed_insert::cond YES YES -wait/synch/cond/sql/Delayed_insert::cond_client YES YES -wait/synch/cond/sql/Event_scheduler::COND_state YES YES
Any idea why the above was deleted? Don't see anything in your patch that could affect that.
I believe it does a SELECT with a LIMIT - so when new stuff is added before the limit, stuff below gets pushed out. But if you ask why the test is that way, I cannot answer. The purpose of some of those perfschema tests is somewhat of a mystery, they seem to serve little other purpose than add maintenance cost for result file update whenever something changes ...
I didn't find any test results for rpl_parallel.test or rpl_paralell2 in your tree. Did you forget to commit these?
There was no time to work on test cases yet. As we discussed I can look into that as the next thing next week.
@@ -6541,44 +6543,201 @@ MYSQL_BIN_LOG::write_transaction_to_binl }
Could you please document in detail the arguments and return value for the following function.
I added the following comment: /* Put a transaction that is ready to commit in the group commit queue. The transaction is identified by the ENTRY object passed into this function. To facilitate group commit for the binlog, we first queue up ourselves in this function. Then later the first thread to enter the queue waits for the LOCK_log mutex, and commits for everyone in the queue once it gets the lock. Any other threads in the queue just wait for the first one to finish the commit and wake them up. This way, all transactions in the queue get committed in a single disk operation. The return value of this function is TRUE if queued as the first entry in the queue (meaning this is the leader), FALSE otherwise. The main work in this function is when the commit in one transaction has been marked to wait for the commit of another transaction to happen first. This is used to support in-order parallel replication, where transactions can execute out-of-order but need to be committed in-order with how they happened on the master. The waiting of one commit on another needs to be integrated with the group commit queue, to ensure that the waiting transaction can participate in the same group commit as the waited-for transaction. So when we put a transaction in the queue, we check if there were other transactions already prepared to commit but just waiting for the first one to commit. If so, we add those to the queue as well, transitively for all waiters. */ bool MYSQL_BIN_LOG::queue_for_group_commit(group_commit_entry *entry) I also moved some code from the caller into this function, which clarifies things by keeping related code together. And I added more comments about what is going on with the different list pointers, and who is doing what.
/* To facilitate group commit for the binlog, we first queue up ourselves in the group commit queue. Then the first thread to enter the queue waits for the LOCK_log mutex, and commits for everyone in the queue once it gets the lock. Any other threads in the queue just wait for the first one to finish the commit and wake them up.
Do you still use LOCK_log for this? It's a bit confusing to mention LOCK_log here when you take the LOCK_prepare_ordered in the function.
Yeah, this comment got moved around and is no longer appropriate. I tried to clarify it (the taking of LOCK_log happens later, after this function).
I did find it confusing that here you threat next_subsequent_commit as a way to avoid recursion, but you use this list for other things in register_wait_for_prior_commit().
I added this comment: We keep a list of all the waiters that need to be processed in `list', linked through the next_subsequent_commit pointer. Initially this list contains only the entry passed into this function. We process entries in the list one by one. The element currently being processed is pointed to by `cur`, and the element at the end of the list is pointed to by `last` (we do not use NULL to terminate the list). As we process an element, it is first added to the group_commit_queue. Then any waiters for that element are added at the end of the list, to be processed in subsequent iterations. This continues until the list is exhausted, with all elements ever added eventually processed. The end result is a breath-first traversal of the tree of waiters, re-using the next_subsequent_commit pointers in place of extra stack space in a recursive traversal.
+ /* Now we need to clear the wakeup_subsequent_commits_running flags. */ + if (list) + { + for (;;) + { + if (list->wakeup_subsequent_commits_running) + { + mysql_mutex_lock(&list->LOCK_wait_commit); + list->wakeup_subsequent_commits_running= false; + mysql_mutex_unlock(&list->LOCK_wait_commit);
Why do we need the mutex above? The only place where we test this variable, as far as I can find, is in wait_for_commit::register_wait_for_prior_commit()
Yes, it looks strange. Unfortunately I cannot remember the reason 100% right now (I wrote this code 9 months ago). I remember that there _was_ a reason. I should have written a comment to explain this, shame on me! I think the code basically needs to do the same as wait_for_commit::wakeup_subsequent_commits2(), but integrated with the tree traversal. I agree this needs clarification and looks suspicious, but I wanted to get the rest of the comments sent to you today, I will try to re-visit this later (and the similar code in wakeup_subsequent_commits2()).
This thread doesn't know if the other thread is just before @@ -6597,7 +6756,10 @@ MYSQL_BIN_LOG::write_transaction_to_binl
if (next) { - next->thd->signal_wakeup_ready(); + if (next->queued_by_other) + next->thd->wait_for_commit_ptr->wakeup();
What does the above code mean ? (add a comment).
I added this comment: /* Wake up the next thread in the group commit. The next thread can be waiting in two different ways, depending on whether it put itself in the queue, or if it was put in queue by us because it had to wait for us to commit first. So execute the appropriate wakeup, identified by the queued_by_other field. */
+ mysql_mutex_unlock(&LOCK_log);
Here you are going to take LOCK_log and LOCK_prepare_ordered in the wrong order. safe_mutex() should have noticed this.
In other code you are first taking LOCK_log and then LOCK_prepare_ordered. If someone else is calling trx_group_commit_leader() while you are in this code you would get a mutex deadlock.
Well spottet, thanks. I fixed it like this: /* We must not wait for LOCK_log while holding LOCK_prepare_ordered. LOCK_log can be held for long periods (eg. we do I/O under it), while LOCK_prepare_ordered must only be held for short periods. In addition, waiting for LOCK_log while holding LOCK_prepare_ordered would violate locking order of LOCK_log-before-LOCK_prepare_ordered. This could cause SAFEMUTEX warnings (even if it cannot actually deadlock with current code, as there can be at most one group commit leader thread at a time). So release and re-acquire LOCK_prepare_ordered if we need to wait for the LOCK_log. */ if (!mysql_mutex_trylock(&LOCK_log)) { mysql_mutex_unlock(&LOCK_prepare_ordered); mysql_mutex_lock(&LOCK_log); mysql_mutex_lock(&LOCK_prepare_ordered); }
=== modified file 'sql/log_event.h' --- sql/log_event.h 2013-06-06 15:51:28 +0000 +++ sql/log_event.h 2013-08-29 08:18:22 +0000 @@ -1317,9 +1317,9 @@ class Log_event
@see do_apply_event */ - int apply_event(Relay_log_info const *rli) + int apply_event(struct rpl_group_info *rgi)
What is the simple rule to know when to use Relay_log_info and when to use rpl_group_info ?
Relay_log_info is for stuff that needs to be shared among all events executing in parallel (eg. current position in relay log). rpl_group_info is for things that should be private to each event group being replicated in parallel (such as the THD).
=== modified file 'sql/log_event_old.cc' --- sql/log_event_old.cc 2013-02-19 10:45:29 +0000 +++ sql/log_event_old.cc 2013-08-29 08:18:22 +0000 @@ -67,7 +68,7 @@ Old_rows_log_event::do_apply_event(Old_r do_apply_event(). We still check here to prevent future coding errors. */ - DBUG_ASSERT(rli->sql_thd == ev_thd); + DBUG_ASSERT(rgi->thd == ev_thd);
Should we change all test for rli->sql_thd to rgi->thd ? You have at least this test in log_event.c and slave.cc.
I assume that rli should not have a sql_thd member anymore ?
Yes, this is work in progress, to check all fields in the old Relay_log_info class and move away those that should use rpl_group_info instead. And yes, rli->sql_thd should go away. I suspect we might still need a THD somewhere in the Relay_log_info for the main SQL thread (not sure), but if we do it is probably best to rename it so that we do not later accidentally merge code without fixing any references to sql_thd (which will most likely be wrong).
+ - We should fail if we connect to the master with opt_slave_parallel_threads + greater than zero and master does not support GTID. Just to avoid a bunch + of potential problems, we won't be able to do any parallel replication + in this case anyway.
If the master doesn't support GTID, I think we should instead of stopping, just issue a warning in the log and continue with normal replication.
Agree, that sounds right.
A much easier way to achive the wait is to have the caller lock mysql_mutex_lock(&rpt->LOCK_rpl_thread) and then just release it when it's time for this thread to continue. - No need for signaling or checking if the thread got the signal - No need to have a cond wait here - No need to have a delay_start flag
I consider this method simpler and clearer. But it is not critical code, feel free to change it if you like.
Note that events may be undefined here if thd->killed was set above (for the first loop)
Ok, fixed.
It would be good to check for thd->killed or rpt->stop here and then abort the loop.
Another question: do we really need rpt->stop ? Isn't it enough with using the thd->killed for this ?
I think there is a need to generally go through and check for proper handling of thd->killed, not just here but in all cases. I did not have time to do this yet, and I am in any case a bit unsure about exactly how kill semantics should work. Now that I think about it, shouldn't they remain different? I think thd->killed is set if user does an explicit KILL of the thread? This should then abort and fail the currently executing transaction. But it should not cause the thread to terminate, should it? That could leave us with no threads at all in the replication thread pool, which would cause replication to hang. Or should the thread exit and be re-spawned? This needs more thought, I think ... certainly something looks not right.
void Relay_log_info::stmt_done(my_off_t event_master_log_pos, - time_t event_creation_time, THD *thd) + time_t event_creation_time, THD *thd, + struct rpl_group_info *rgi)
Do we need thd anymore as this is in rgi->thd?
If they are not the same, please add a comment about this. It's not otherwise clear when to use rgi->thd and thd as this function uses both.
This is ongoing work, so far I've just added the rgi parameter on some functions that needed it. It is probably true that we do not need to THD argument (on the other hand it was probably not needed before either, as we have rli->sql_thd).
=== modified file 'sql/sql_class.cc' --- sql/sql_class.cc 2013-06-06 15:51:28 +0000 +++ sql/sql_class.cc 2013-08-29 08:18:22 +0000
+void +wait_for_commit::wakeup() +{ + /* + We signal each waiter on their own condition and mutex (rather than using + pthread_cond_broadcast() or something like that). + + Otherwise we would need to somehow ensure that they were done + waking up before we could allow this THD to be destroyed, which would + be annoying and unnecessary. + */ + mysql_mutex_lock(&LOCK_wait_commit); + waiting_for_commit= false; + mysql_cond_signal(&COND_wait_commit); + mysql_mutex_unlock(&LOCK_wait_commit); +}
In this particular case it looks safe to move the cond signal out of the mutex. I don't see how anyone could miss the signal.
Probably. Generally I prefer to write it as above, to not have to make sure that moving it out is safe.
+ The waiter needs to lock the waitee to delete itself from the list in + unregister_wait_for_prior_commit(). Thus wakeup_subsequent_commits() can not + hold its own lock while locking waiters, lest we deadlock.
lest ?
https://en.wiktionary.org/wiki/lest But I've changed it to use a more common English wording "as this could lead to deadlock".
+ The reason for separate register and wait calls is that this allows to + register the wait early, at a point where the waited-for THD is known to + exist. And then the actual wait can be done much later, where the + waited-for THD may have been long gone. By registering early, the waitee + can signal before disappearing. +*/
Instead of taking locks to ensure that THD will disapper, why not keep all THD objects around until all transactions in a group has been executed?
I think it is not just about THD disappearing. IIRC, the more important problem is that the THD may have moved on to do something different. So even if the data in the other THD is not freed, it may still contain completely unrelated data (and thus wrong data, for the accessing thread). But this is actually something I consider a very important thing, and one that I spent quite some effort on and was quite pleased to get in the current state. If arbitrary THD can have pointers into arbitrary other THD, the life-time issues can quickly get very nasty, and it is important to handle this well to avoid having to understand the entire codebase to avoid subtle bugs.