[Maria-developers] Review of MDEV-4506, parallel replication, part 1
Hi! Here is the first part of the review. I have 1/2 of one file left (rpl_parallel.cc) to review but I wanted you to have a chance to read what I have done so far. I plan to finish the review tomorrow and start working on the code during the weekend.
"Kristian" == Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Kristian> Hi Monty, Kristian> Here is a copy of the status mail I sent earlier. The status is mostly still Kristian> current. Kristian> The code is in this tree: Kristian> lp:~maria-captains/maria/10.0-knielsen Kristian> In that tree, you can extract the parallel replication code as a single patch Kristian> with a command like this: Kristian> bzr diff -rrevid:knielsen@knielsen-hq.org-20130608103621-g91eaielv4n22aha Kristian> In the forwarded mail, I describe getting a 4.5 times speedup on a quad-core Kristian> laptop, so that's quite promising. === modified file 'mysql-test/r/mysqld--help.result' --- mysql-test/r/mysqld--help.result 2013-05-28 11:28:31 +0000 +++ mysql-test/r/mysqld--help.result 2013-08-29 08:18:22 +0000 <cut> @@ -783,6 +794,11 @@ --slave-net-timeout=# Number of seconds to wait for more data from any master/slave connection before aborting the read + --slave-parallel-threads=# + If non-zero, number of threads to spawn to apply in + parallel events on the slave that were group-committed on + the master or were logged with GTID in different + replication domains. What happens if --slave-parallel-threads is 0 ? Does this mean that there will be no parallel threads? Is this the same as 1 ? If same as 1, why not have the option to take values starting from 1 ? IRC: K> yes, if --slave-parallel-threads is 0, then the feature is disabled. The old code is used, and the SQL thread applies the events itself K> if --slave-parallel-threads is > 0, then a pool of threads are spawned. The SQL threads do not apply events themselves, they just put it in queues for the parallel threads to apply M> do we ever need to do this with only one thread? K> (we might want to give an error if user tries to do this, as it does not make much sense) M> So why not have the value from 1 to N M> where 1 is old code and > 1 is using new code ? K> That might make sense === modified file 'mysql-test/suite/perfschema/r/dml_setup_instruments.result' --- mysql-test/suite/perfschema/r/dml_setup_instruments.result 2013-03-26 09:35:34 +0000 +++ mysql-test/suite/perfschema/r/dml_setup_instruments.result 2013-08-29 08:18:22 +0000 @@ -38,14 +38,14 @@ order by name limit 10; NAME ENABLED TIMED wait/synch/cond/sql/COND_flush_thread_cache YES YES wait/synch/cond/sql/COND_manager YES YES +wait/synch/cond/sql/COND_parallel_entry YES YES +wait/synch/cond/sql/COND_prepare_ordered YES YES wait/synch/cond/sql/COND_queue_state YES YES wait/synch/cond/sql/COND_rpl_status YES YES +wait/synch/cond/sql/COND_rpl_thread YES YES +wait/synch/cond/sql/COND_rpl_thread_pool YES YES wait/synch/cond/sql/COND_server_started YES YES wait/synch/cond/sql/COND_thread_cache YES YES -wait/synch/cond/sql/COND_thread_count YES YES -wait/synch/cond/sql/Delayed_insert::cond YES YES -wait/synch/cond/sql/Delayed_insert::cond_client YES YES -wait/synch/cond/sql/Event_scheduler::COND_state YES YES Any idea why the above was deleted? Don't see anything in your patch that could affect that. <cut> === added file 'mysql-test/suite/rpl/t/rpl_parallel.test' --- mysql-test/suite/rpl/t/rpl_parallel.test 1970-01-01 00:00:00 +0000 +++ mysql-test/suite/rpl/t/rpl_parallel.test 2013-08-29 08:18:22 +0000 I didn't find any test results for rpl_parallel.test or rpl_paralell2 in your tree. Did you forget to commit these? I was also missing some comment in these files what they was testing. === modified file 'sql/log.cc' --- sql/log.cc 2013-06-06 15:51:28 +0000 +++ sql/log.cc 2013-08-29 08:18:22 +0000 @@ -6541,44 +6543,201 @@ MYSQL_BIN_LOG::write_transaction_to_binl } Could you please document in detail the arguments and return value for the following function. - I spent 3 hours to try to understand this function in detail and having a full description of the arguments would have helped a bit. bool -MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry) +MYSQL_BIN_LOG::queue_for_group_commit(group_commit_entry *entry, + wait_for_commit *wfc) { + group_commit_entry *orig_queue; + wait_for_commit *list, *cur, *last; + /* To facilitate group commit for the binlog, we first queue up ourselves in the group commit queue. Then the first thread to enter the queue waits for the LOCK_log mutex, and commits for everyone in the queue once it gets the lock. Any other threads in the queue just wait for the first one to finish the commit and wake them up. Do you still use LOCK_log for this? It's a bit confusing to mention LOCK_log here when you take the LOCK_prepare_ordered in the function. If LOCK_log is required, shouldn't there be an assert for it in this function? + + To support in-order parallel replication with group commit, after we add + some transaction to the queue, we check if there were other transactions + already prepared to commit but just waiting for the first one to commit. + If so, we add those to the queue as well, transitively for all waiters. */ entry->thd->clear_wakeup_ready(); mysql_mutex_lock(&LOCK_prepare_ordered); <cut> + orig_queue= group_commit_queue; + + /* + Iteratively process everything added to the queue, looking for waiters, + and their waiters, and so on. If a waiter is ready to commit, we + immediately add it to the queue; if not we just wake it up. + + This would be natural to do with recursion, but we want to avoid + potentially unbounded recursion blowing the C stack, so we use the list + approach instead. Add here: cur->next_subsequent_commit is at this point an empty list. 'last' will always point to the last element of the list. We use this list to simulate recursion. I did find it confusing that here you threat next_subsequent_commit as a way to avoid recursion, but you use this list for other things in register_wait_for_prior_commit(). Suggestion: - If you want to reuse memory for the list, please use an union over next_subsequent_commit and have two names for it, so that you don't in the code use the same name for two different list. + */ + list= wfc; + cur= list; + last= list; + for (;;) + { + /* Add the entry to the group commit queue. */ + entry->next= group_commit_queue; + group_commit_queue= entry; + + if (entry->cache_mngr->using_xa) + { + DEBUG_SYNC(entry->thd, "commit_before_prepare_ordered"); + run_prepare_ordered(entry->thd, entry->all); + DEBUG_SYNC(entry->thd, "commit_after_prepare_ordered"); + } + + if (!cur) + break; // Can happen if initial entry has no wait_for_commit + Add a comment what the following if is testing (As there is several lists involved, the code is not that easy to understand) + if (cur->subsequent_commits_list) + { + bool have_lock; + wait_for_commit *waiter; + + mysql_mutex_lock(&cur->LOCK_wait_commit); + have_lock= true; + waiter= cur->subsequent_commits_list; + /* Check again, now safely under lock. */ + if (waiter) + { + /* Grab the list of waiters and process it. */ + cur->subsequent_commits_list= NULL; + do + { + wait_for_commit *next= waiter->next_subsequent_commit; + group_commit_entry *entry2= + (group_commit_entry *)waiter->opaque_pointer; + if (entry2) + { + /* + This is another transaction ready to be written to the binary + log. We can put it into the queue directly, without needing a + separate context switch to the other thread. We just set a flag + so that the other thread will know when it wakes up that it was + already processed. + + So put it at the end of the list to be processed in a subsequent + iteration of the outer loop. + */ + entry2->queued_by_other= true; + last->next_subsequent_commit= waiter; + last= waiter; + /* + As a small optimisation, we do not actually need to set + waiter->next_subsequent_commit to NULL, as we can use the + pointer `last' to check for end-of-list. + */ + } + else + { + /* + Wake up the waiting transaction. + + For this, we need to set the "wakeup running" flag and release + the waitee lock to avoid a deadlock, see comments on + THD::wakeup_subsequent_commits2() for details. + */ + if (have_lock) + { + cur->wakeup_subsequent_commits_running= true; + mysql_mutex_unlock(&cur->LOCK_wait_commit); + have_lock= false; Move have_lock= false to just after the 'if (have_lock)' - Easier to see that you are really reseting the variable. - Faster as there is no need to reload the variable from stack after mysql_mutex_unlock() function. - I try to nowadays in my code to always reset flag variables just after testing them to achive the above. + } + waiter->wakeup(); + } + waiter= next; + } while (waiter); + } You can optimze away one row, one 'if level' and one 'if' away by doing this: Move: 'cur->subsequent_commits_list= NULL;' before the if. Change 'if (waiter)' to 'while (waiter)' Remove 'do' and the corresponding 'while (waiter);' + if (have_lock) + mysql_mutex_unlock(&cur->LOCK_wait_commit); + } + if (cur == last) + break; Add a comment here of what is in next_subsequent_commit list. + cur= cur->next_subsequent_commit; + entry= (group_commit_entry *)cur->opaque_pointer; + DBUG_ASSERT(entry != NULL); } + + /* Now we need to clear the wakeup_subsequent_commits_running flags. */ + if (list) + { + for (;;) + { + if (list->wakeup_subsequent_commits_running) + { + mysql_mutex_lock(&list->LOCK_wait_commit); + list->wakeup_subsequent_commits_running= false; + mysql_mutex_unlock(&list->LOCK_wait_commit); Why do we need the mutex above? The only place where we test this variable, as far as I can find, is in wait_for_commit::register_wait_for_prior_commit() This thread doesn't know if the other thread is just before mysql_mutex_lock(&waitee->LOCK_wait_commit), inside the lock or after the lock. So when we reset the variable should not matter. But somehow this sounds strange. What happens if the above variable is reset just before mysql_mutex_lock(&waitee->LOCK_wait_commit) in wait_for_commit::register_wait_for_prior_commit() ? Isn't the waitee added to the list when it shouldn't? + } + if (list == last) + break; + list= list->next_subsequent_commit; + } + } + + if (opt_binlog_commit_wait_count > 0) + mysql_cond_signal(&COND_prepare_ordered); mysql_mutex_unlock(&LOCK_prepare_ordered); DEBUG_SYNC(entry->thd, "commit_after_release_LOCK_prepare_ordered"); + return orig_queue == NULL; +} + +bool +MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry) +{ + wait_for_commit *wfc; + bool is_leader; + + wfc= entry->thd->wait_for_commit_ptr; + entry->queued_by_other= false; + if (wfc && wfc->waiting_for_commit) + { Add comment, something like: /* We have to wait for the threads in wfc->waiting_for_commit */ + mysql_mutex_lock(&wfc->LOCK_wait_commit); + /* Do an extra check here, this time safely under lock. */ + if (wfc->waiting_for_commit) + { + wfc->opaque_pointer= entry; + do + { + mysql_cond_wait(&wfc->COND_wait_commit, &wfc->LOCK_wait_commit); + } while (wfc->waiting_for_commit); + wfc->opaque_pointer= NULL; + } + mysql_mutex_unlock(&wfc->LOCK_wait_commit); + } <cut> @@ -6597,7 +6756,10 @@ MYSQL_BIN_LOG::write_transaction_to_binl if (next) { - next->thd->signal_wakeup_ready(); + if (next->queued_by_other) + next->thd->wait_for_commit_ptr->wakeup(); What does the above code mean ? (add a comment). Why do we wakeup the thread we where waiting for instead of ourselves? + else + next->thd->signal_wakeup_ready(); } else { <cut> @@ -6968,6 +7140,48 @@ MYSQL_BIN_LOG::write_transaction_or_stmt return 0; } + +void +MYSQL_BIN_LOG::wait_for_sufficient_commits() +{ + size_t count; + group_commit_entry *e; + group_commit_entry *last_head; + struct timespec wait_until; + + mysql_mutex_assert_owner(&LOCK_log); + mysql_mutex_assert_owner(&LOCK_prepare_ordered); + + count= 0; + for (e= last_head= group_commit_queue; e; e= e->next) + ++count; + if (count >= opt_binlog_commit_wait_count) + return; Assuming that the queue is bigger may be better: for (e= last_head= group_commit_queue, count=0; e ; e= e->next) if (++count >= opt_binlog_commit_wait_count) return; + mysql_mutex_unlock(&LOCK_log); Here you are going to take LOCK_log and LOCK_prepare_ordered in the wrong order. safe_mutex() should have noticed this. In other code you are first taking LOCK_log and then LOCK_prepare_ordered. If someone else is calling trx_group_commit_leader() while you are in this code you would get a mutex deadlock. + set_timespec_nsec(wait_until, (ulonglong)1000*opt_binlog_commit_wait_usec); + + for (;;) + { + int err; + group_commit_entry *head; + + err= mysql_cond_timedwait(&COND_prepare_ordered, &LOCK_prepare_ordered, + &wait_until); + if (err == ETIMEDOUT) + break; + head= group_commit_queue; + for (e= head; e && e != last_head; e= e->next) + ++count; + if (count >= opt_binlog_commit_wait_count) + break; + last_head= head; + } + + mysql_mutex_lock(&LOCK_log); +} <cut> === modified file 'sql/log_event.cc' --- sql/log_event.cc 2013-06-06 15:51:28 +0000 +++ sql/log_event.cc 2013-08-29 08:18:22 +0000 <cut> @@ -6101,6 +6110,18 @@ Gtid_log_event::Gtid_log_event(const cha domain_id= uint4korr(buf); buf+= 4; flags2= *buf; + if (flags2 & FL_GROUP_COMMIT_ID) + { + if (event_len < (uint)header_size + GTID_HEADER_LEN + 2) + { + seq_no= 0; // So is_valid() returns false + return; + } + ++buf; + commit_id= uint8korr(buf); + } + else + commit_id= 0; } Setting commit_id=0 first instead in the if will generate smaller and faster code (setting a variable is faster than an a jmp). === modified file 'sql/log_event.h' --- sql/log_event.h 2013-06-06 15:51:28 +0000 +++ sql/log_event.h 2013-08-29 08:18:22 +0000 @@ -1317,9 +1317,9 @@ class Log_event @see do_apply_event */ - int apply_event(Relay_log_info const *rli) + int apply_event(struct rpl_group_info *rgi) What is the simple rule to know when to use Relay_log_info and when to use rpl_group_info ? If I understand things correct, relay_log_info should only be used for direct manipulation of data in the relay log. By the way, you should probably change 'struct rpl_group_info' to just rpl_group_info; Now you are mixing both versions in the code. === modified file 'sql/log_event_old.cc' --- sql/log_event_old.cc 2013-02-19 10:45:29 +0000 +++ sql/log_event_old.cc 2013-08-29 08:18:22 +0000 @@ -67,7 +68,7 @@ Old_rows_log_event::do_apply_event(Old_r do_apply_event(). We still check here to prevent future coding errors. */ - DBUG_ASSERT(rli->sql_thd == ev_thd); + DBUG_ASSERT(rgi->thd == ev_thd); Should we change all test for rli->sql_thd to rgi->thd ? You have at least this test in log_event.c and slave.cc. I assume that rli should not have a sql_thd member anymore ? <cut> === added file 'sql/rpl_parallel.cc' --- sql/rpl_parallel.cc 1970-01-01 00:00:00 +0000 +++ sql/rpl_parallel.cc 2013-08-29 08:18:22 +0000 @@ -0,0 +1,699 @@ +#include "my_global.h" +#include "rpl_parallel.h" +#include "slave.h" +#include "rpl_mi.h" + + +/* + Code for optional parallel execution of replicated events on the slave. + + ToDo list: + + - Review every field in Relay_log_info, and all code that accesses it. + Split out the necessary parts into rpl_group_info, to avoid conflicts + between parallel execution of events. (Such as deferred events ...) + + - Error handling. If we fail in one of multiple parallel executions, we + need to make a best effort to complete prior transactions and roll back + following transactions, so slave binlog position will be correct. + And all the retry logic for temporary errors like deadlock. + + - Stopping the slave needs to handle stopping all parallel executions. And + the logic in sql_slave_killed() that waits for current event group to + complete needs to be extended appropriately... + + - Audit the use of Relay_log_info::data_lock. Make sure it is held + correctly in all needed places also when using parallel replication. + + - We need some user-configurable limit on how far ahead the SQL thread will + fetch and queue events for parallel execution (otherwise if slave gets + behind we will fill up memory with pending malloc()'ed events). + + - Fix update of relay-log.info and master.info. In non-GTID replication, + they must be serialised to preserve correctness. In GTID replication, we + should not update them at all except at slave thread stop. + + - All the waits (eg. in struct wait_for_commit and in + rpl_parallel_thread_pool::get_thread()) need to be killable. And on kill, + everything needs to be correctly rolled back and stopped in all threads, + to ensure a consistent slave replication state. I can add kill checking in all functions that waits for a mutex. Will do it with adding enter_cond()/exit_cond() in all functions waiting for mysql_cond_wait(). + + - Handle the case of a partial event group. This occurs when the master + crashes in the middle of writing the event group to the binlog. The + slave rolls back the transaction; parallel execution needs to be able + to deal with this wrt. commit_orderer and such. + + - Relay_log_info::is_in_group(). This needs to be handled correctly in all + callers. I think it needs to be split into two, one version in + Relay_log_info to be used from next_event() in slave.cc, one to be used in + per-transaction stuff. + + - We should fail if we connect to the master with opt_slave_parallel_threads + greater than zero and master does not support GTID. Just to avoid a bunch + of potential problems, we won't be able to do any parallel replication + in this case anyway. If the master doesn't support GTID, I think we should instead of stopping, just issue a warning in the log and continue with normal replication. Otherwise we would have problems when using multi-source replication if a single slave doesn't support GTID. +*/ + +struct rpl_parallel_thread_pool global_rpl_thread_pool; + + +static void +rpt_handle_event(rpl_parallel_thread::queued_event *qev, + struct rpl_parallel_thread *rpt) +{ + int err; + struct rpl_group_info *rgi= qev->rgi; + Relay_log_info *rli= rgi->rli; + THD *thd= rgi->thd; + + thd->rli_slave= rli; + thd->rpl_filter = rli->mi->rpl_filter; + /* ToDo: Get rid of rli->group_info, it is not thread safe. */ + rli->group_info= rgi; + + /* ToDo: Access to thd, and what about rli, split out a parallel part? */ + mysql_mutex_lock(&rli->data_lock); + err= apply_event_and_update_pos(qev->ev, thd, rgi, rpt); We should also get rid of the thd argument and change the function to be a bool so that we can return an error. + /* ToDo: error handling. */ +} + + +pthread_handler_t +handle_rpl_parallel_thread(void *arg) +{ <cut> + mysql_mutex_lock(&rpt->LOCK_rpl_thread); + rpt->thd= thd; + + while (rpt->delay_start) + mysql_cond_wait(&rpt->COND_rpl_thread, &rpt->LOCK_rpl_thread); A much easier way to achive the wait is to have the caller lock mysql_mutex_lock(&rpt->LOCK_rpl_thread) and then just release it when it's time for this thread to continue. - No need for signaling or checking if the thread got the signal - No need to have a cond wait here - No need to have a delay_start flag + rpt->running= true; + + while (!rpt->stop && !thd->killed) + { + rpl_parallel_thread *list; + + old_msg= thd->proc_info; + thd->enter_cond(&rpt->COND_rpl_thread, &rpt->LOCK_rpl_thread, + "Waiting for work from SQL thread"); + while (!rpt->stop && !thd->killed && !(events= rpt->event_queue)) + mysql_cond_wait(&rpt->COND_rpl_thread, &rpt->LOCK_rpl_thread); Add comment: /* Mark that this thread is now executing */ + rpt->free= false; + rpt->event_queue= rpt->last_in_queue= NULL; + thd->exit_cond(old_msg); + Note that events may be undefined here if thd->killed was set above (for the first loop) It would be good to check for thd->killed or rpt->stop here and then abort the loop. Another question: do we really need rpt->stop ? Isn't it enough with using the thd->killed for this ? Otherwise we would need to check for both rpt->top and thd->killed in every loop that may be aborted. + more_events: + while (events) + { + struct rpl_parallel_thread::queued_event *next= events->next; + Log_event_type event_type= events->ev->get_type_code(); + rpl_group_info *rgi= events->rgi; + rpl_parallel_entry *entry= rgi->parallel_entry; + uint64 wait_for_sub_id; + uint64 wait_start_sub_id; + bool end_of_group; + Add a comment why this is handled here and not in rpt_handle_event() + if (event_type == GTID_EVENT) + { + in_event_group= true; Add a comment what the following variable stands for + group_standalone= + (0 != (static_cast<Gtid_log_event *>(events->ev)->flags2 & + Gtid_log_event::FL_STANDALONE)); + Rest of this file will be reviewed tomorrow. <cut> === modified file 'sql/rpl_rli.cc' --- sql/rpl_rli.cc 2013-06-06 15:51:28 +0000 +++ sql/rpl_rli.cc 2013-08-29 08:18:22 +0000 @@ -1194,13 +1194,15 @@ bool Relay_log_info::cached_charset_comp void Relay_log_info::stmt_done(my_off_t event_master_log_pos, - time_t event_creation_time, THD *thd) + time_t event_creation_time, THD *thd, + struct rpl_group_info *rgi) Do we need thd anymore as this is in rgi->thd? If they are not the same, please add a comment about this. It's not otherwise clear when to use rgi->thd and thd as this function uses both. @@ -1265,7 +1267,7 @@ void Relay_log_info::cleanup_context(THD { DBUG_ENTER("Relay_log_info::cleanup_context"); - DBUG_ASSERT(sql_thd == thd); + DBUG_ASSERT(opt_slave_parallel_threads > 0 || sql_thd == thd); Add a comment which THD is passed to the function. <cut> + +int +event_group_new_gtid(rpl_group_info *rgi, Gtid_log_event *gev) +{ + uint64 sub_id= rpl_global_gtid_slave_state.next_subid(gev->domain_id); + if (!sub_id) + { Add a comment that this is an end of memory error. <cut> + +void +delete_or_keep_event_post_apply(Relay_log_info *rli, + Log_event_type typ, Log_event *ev) +{ + /* + ToDo: This needs to work on rpl_group_info, not Relay_log_info, to be + thread-safe for parallel replication. + */ + switch (typ) { + case FORMAT_DESCRIPTION_EVENT: + /* + Format_description_log_event should not be deleted because it + will be used to read info about the relay log's format; + it will be deleted when the SQL thread does not need it, + i.e. when this thread terminates. + */ + break; + case ANNOTATE_ROWS_EVENT: + /* + Annotate_rows event should not be deleted because after it has + been applied, thd->query points to the string inside this event. + The thd->query will be used to generate new Annotate_rows event + during applying the subsequent Rows events. + */ + rli->set_annotate_event((Annotate_rows_log_event*) ev); I assume we should create relay_group_info->set_annotate_event instead of the above. + break; + case DELETE_ROWS_EVENT: + case UPDATE_ROWS_EVENT: + case WRITE_ROWS_EVENT: + /* + After the last Rows event has been applied, the saved Annotate_rows + event (if any) is not needed anymore and can be deleted. + */ + if (((Rows_log_event*)ev)->get_flags(Rows_log_event::STMT_END_F)) + rli->free_annotate_event(); + /* fall through */ + default: + DBUG_PRINT("info", ("Deleting the event after it has been executed")); + if (!rli->is_deferred_event(ev)) + delete ev; I assume that is_deferred_event should also be moved to relay_group_info. + break; + } +} + #endif === modified file 'sql/rpl_rli.h' --- sql/rpl_rli.h 2013-06-06 15:51:28 +0000 +++ sql/rpl_rli.h 2013-08-29 08:18:22 +0000 @@ -22,6 +22,7 @@ #include "log.h" /* LOG_INFO, MYSQL_BIN_LOG */ #include "sql_class.h" /* THD */ #include "log_event.h" +#include "rpl_parallel.h" struct RPL_TABLE_LIST; class Master_info; @@ -52,6 +53,8 @@ class Master_info; *****************************************************************************/ +struct rpl_group_info; + class Relay_log_info : public Slave_reporting_capability { public: @@ -311,13 +314,9 @@ class Relay_log_info : public Slave_repo char slave_patternload_file[FN_REFLEN]; size_t slave_patternload_file_size; - /* - Current GTID being processed. - The sub_id gives the binlog order within one domain_id. A zero sub_id - means that there is no active GTID. - */ - uint64 gtid_sub_id; - rpl_gtid current_gtid; + /* ToDo: We need to remove this, always use the per-transaction one to work with parallel replication. */ + struct rpl_group_info *group_info; + rpl_parallel parallel; Relay_log_info(bool is_slave_recovery); ~Relay_log_info(); @@ -459,7 +458,8 @@ class Relay_log_info : public Slave_repo the <code>Seconds_behind_master</code> field. */ void stmt_done(my_off_t event_log_pos, - time_t event_creation_time, THD *thd); + time_t event_creation_time, THD *thd, + struct rpl_group_info *rgi); /** @@ -594,6 +594,60 @@ class Relay_log_info : public Slave_repo }; +/* + This is data for various state needed to be kept for the processing of + one event group in the SQL thread. + + For single-threaded replication it is linked from the RLI, for parallel + replication it is linked into each event group being executed in parallel. +*/ Could you please rewrite the above to make it more clear. After discussing with you, I think the following could be added: /* In single-threaded replication, there will be one global rpl_group_info and one global RPLI per master connection. They will be linked together. In parallel replication, there will be one rpl_group_info object for each running thd. All rpl_group_info will share the same RLI. */ +struct rpl_group_info +{ + Relay_log_info *rli; + THD *thd; <cut> === modified file 'sql/sql_class.cc' --- sql/sql_class.cc 2013-06-06 15:51:28 +0000 +++ sql/sql_class.cc 2013-08-29 08:18:22 +0000 +void +wait_for_commit::wakeup() +{ + /* + We signal each waiter on their own condition and mutex (rather than using + pthread_cond_broadcast() or something like that). + + Otherwise we would need to somehow ensure that they were done + waking up before we could allow this THD to be destroyed, which would + be annoying and unnecessary. + */ + mysql_mutex_lock(&LOCK_wait_commit); + waiting_for_commit= false; + mysql_cond_signal(&COND_wait_commit); + mysql_mutex_unlock(&LOCK_wait_commit); +} In this particular case it looks safe to move the cond signal out of the mutex. I don't see how anyone could miss the signal. + + +/* + Register that the next commit of this THD should wait to complete until + commit in another THD (the waitee) has completed. + + The wait may occur explicitly, with the waiter sitting in + wait_for_prior_commit() until the waitee calls wakeup_subsequent_commits(). + + Alternatively, the TC (eg. binlog) may do the commits of both waitee and + waiter at once during group commit, resolving both of them in the right + order. + + Only one waitee can be registered for a waiter; it must be removed by + wait_for_prior_commit() or unregister_wait_for_prior_commit() before a new + one is registered. But it is ok for several waiters to register a wait for + the same waitee. It is also permissible for one THD to be both a waiter and + a waitee at the same time. +*/ +void +wait_for_commit::register_wait_for_prior_commit(wait_for_commit *waitee) +{ + waiting_for_commit= true; + DBUG_ASSERT(!this->waitee /* No prior registration allowed */); + this->waitee= waitee; + + mysql_mutex_lock(&waitee->LOCK_wait_commit); + /* + If waitee is in the middle of wakeup, then there is nothing to wait for, + so we need not register. This is necessary to avoid a race in unregister, + see comments on wakeup_subsequent_commits2() for details. + */ + if (waitee->wakeup_subsequent_commits_running) + waiting_for_commit= false; + else + { Please add a comment here what the following code is doing. As this is two different lists involved, it's not trivial code. You also seam to use the list next_subsequent_commit for something different in queue_for_group_commit() code. + this->next_subsequent_commit= waitee->subsequent_commits_list; + waitee->subsequent_commits_list= this; + } + mysql_mutex_unlock(&waitee->LOCK_wait_commit); +} <cut> +/* + Wakeup anyone waiting for us to have committed. + + Note about locking: + + We have a potential race or deadlock between wakeup_subsequent_commits() in + the waitee and unregister_wait_for_prior_commit() in the waiter. + + Both waiter and waitee needs to take their own lock before it is safe to take + a lock on the other party - else the other party might disappear and invalid + memory data could be accessed. But if we take the two locks in different + order, we may end up in a deadlock. + + The waiter needs to lock the waitee to delete itself from the list in + unregister_wait_for_prior_commit(). Thus wakeup_subsequent_commits() can not + hold its own lock while locking waiters, lest we deadlock. lest ? + So we need to prevent unregister_wait_for_prior_commit() running while wakeup + is in progress - otherwise the unregister could complete before the wakeup, + leading to incorrect spurious wakeup or accessing invalid memory. + <cut> + Then also register_wait_for_prior_commit() needs to check if + wakeup_subsequent_commits() is running, and skip the registration if + so. This is needed in case a new waiter manages to register itself and + immediately try to unregister while wakeup_subsequent_commits() is + running. Else the new waiter would also wait rather than unregister, but it + would not be woken up until next wakeup, which could be potentially much + later than necessary. +*/ Add new line here +void +wait_for_commit::wakeup_subsequent_commits2() +{ <cut> === modified file 'sql/sql_class.h' --- sql/sql_class.h 2013-06-06 15:51:28 +0000 +++ sql/sql_class.h 2013-08-29 08:18:22 +0000 @@ -1553,6 +1553,116 @@ class Global_read_lock }; +/* + Class to facilitate the commit of one transactions waiting for the commit of + another transaction to complete first. + + This is used during (parallel) replication, to allow different transactions + to be applied in parallel, but still commit in order. + + The transaction that wants to wait for a prior commit must first register + to wait with register_wait_for_prior_commit(waitee). Such registration + must be done holding the waitee->LOCK_wait_commit, to prevent the other + THD from disappearing during the registration. + + Then during commit, if a THD is registered to wait, it will call + wait_for_prior_commit() as part of ha_commit_trans(). If no wait is + registered, or if the waitee for has already completed commit, then + wait_for_prior_commit() returns immediately. + + And when a THD that may be waited for has completed commit (more precisely + commit_ordered()), then it must call wakeup_subsequent_commits() to wake + up any waiters. Note that this must be done at a point that is guaranteed + to be later than any waiters registering themselves. It is safe to call + wakeup_subsequent_commits() multiple times, as waiters are removed from + registration as part of the wakeup. + + The reason for separate register and wait calls is that this allows to + register the wait early, at a point where the waited-for THD is known to + exist. And then the actual wait can be done much later, where the + waited-for THD may have been long gone. By registering early, the waitee + can signal before disappearing. +*/ Instead of taking locks to ensure that THD will disapper, why not keep all THD objects around until all transactions in a group has been executed? +struct wait_for_commit +{ <cut> + bool waiting_for_commit; + /* + Flag set when wakeup_subsequent_commits_running() is active, see commonts commonts -> comments === modified file 'sql/sys_vars.cc' --- sql/sys_vars.cc 2013-06-07 07:31:11 +0000 +++ sql/sys_vars.cc 2013-08-29 08:18:22 +0000 <cut> +static bool +check_slave_parallel_threads(sys_var *self, THD *thd, set_var *var) +{ + bool running; + + mysql_mutex_lock(&LOCK_active_mi); + running= master_info_index->give_error_if_slave_running(); + mysql_mutex_unlock(&LOCK_active_mi); + if (running) + return true; + + return false; +} + +static bool +fix_slave_parallel_threads(sys_var *self, THD *thd, enum_var_type type) +{ + bool running; + bool err= false; + + mysql_mutex_unlock(&LOCK_global_system_variables); + mysql_mutex_lock(&LOCK_active_mi); + running= master_info_index->give_error_if_slave_running(); + mysql_mutex_unlock(&LOCK_active_mi); You can call check_slave_parallel_threads() above to reuse code. + if (running || rpl_parallel_change_thread_count(&global_rpl_thread_pool, + opt_slave_parallel_threads)) + err= true; + mysql_mutex_lock(&LOCK_global_system_variables); + + return err; +} + + +static Sys_var_ulong Sys_slave_parallel_threads( + "slave_parallel_threads", + "If non-zero, number of threads to spawn to apply in parallel events " + "on the slave that were group-committed on the master or were logged " + "with GTID in different replication domains.", + GLOBAL_VAR(opt_slave_parallel_threads), CMD_LINE(REQUIRED_ARG), + VALID_RANGE(0,16383), DEFAULT(0), BLOCK_SIZE(1), NO_MUTEX_GUARD, + NOT_IN_BINLOG, ON_CHECK(check_slave_parallel_threads), + ON_UPDATE(fix_slave_parallel_threads)); #endif We have talked about having 1 to be the minimum, which would give the old behavior. <cut> Overal things looks very good and it should not be too much work to add the things you have written on the todo part.a Regards Monty
Monty, Thanks for the comments so far. I have pushed fixes to 10.0-knielsen, and some specific replies are below. - Kristian. Michael Widenius <monty@askmonty.org> writes:
+ --slave-parallel-threads=# + If non-zero, number of threads to spawn to apply in + parallel events on the slave that were group-committed on + the master or were logged with GTID in different + replication domains.
What happens if --slave-parallel-threads is 0 ? Does this mean that there will be no parallel threads? Is this the same as 1 ?
If same as 1, why not have the option to take values starting from 1 ?
I think it is logical that 0 means disabled and >0 means enabled with that many threads. But I don't mind using 1 as disabled and needing >=2 for enabled. Though the ability to use a pool with just 1 thread might be useful for testing. It doesn't seem critical, maybe we can revisit this if time allows when the basic stuff is done.
=== modified file 'mysql-test/suite/perfschema/r/dml_setup_instruments.result' --- mysql-test/suite/perfschema/r/dml_setup_instruments.result 2013-03-26 09:35:34 +0000 +++ mysql-test/suite/perfschema/r/dml_setup_instruments.result 2013-08-29 08:18:22 +0000 @@ -38,14 +38,14 @@ order by name limit 10; NAME ENABLED TIMED wait/synch/cond/sql/COND_flush_thread_cache YES YES wait/synch/cond/sql/COND_manager YES YES +wait/synch/cond/sql/COND_parallel_entry YES YES +wait/synch/cond/sql/COND_prepare_ordered YES YES wait/synch/cond/sql/COND_queue_state YES YES wait/synch/cond/sql/COND_rpl_status YES YES +wait/synch/cond/sql/COND_rpl_thread YES YES +wait/synch/cond/sql/COND_rpl_thread_pool YES YES wait/synch/cond/sql/COND_server_started YES YES wait/synch/cond/sql/COND_thread_cache YES YES -wait/synch/cond/sql/COND_thread_count YES YES -wait/synch/cond/sql/Delayed_insert::cond YES YES -wait/synch/cond/sql/Delayed_insert::cond_client YES YES -wait/synch/cond/sql/Event_scheduler::COND_state YES YES
Any idea why the above was deleted? Don't see anything in your patch that could affect that.
I believe it does a SELECT with a LIMIT - so when new stuff is added before the limit, stuff below gets pushed out. But if you ask why the test is that way, I cannot answer. The purpose of some of those perfschema tests is somewhat of a mystery, they seem to serve little other purpose than add maintenance cost for result file update whenever something changes ...
I didn't find any test results for rpl_parallel.test or rpl_paralell2 in your tree. Did you forget to commit these?
There was no time to work on test cases yet. As we discussed I can look into that as the next thing next week.
@@ -6541,44 +6543,201 @@ MYSQL_BIN_LOG::write_transaction_to_binl }
Could you please document in detail the arguments and return value for the following function.
I added the following comment: /* Put a transaction that is ready to commit in the group commit queue. The transaction is identified by the ENTRY object passed into this function. To facilitate group commit for the binlog, we first queue up ourselves in this function. Then later the first thread to enter the queue waits for the LOCK_log mutex, and commits for everyone in the queue once it gets the lock. Any other threads in the queue just wait for the first one to finish the commit and wake them up. This way, all transactions in the queue get committed in a single disk operation. The return value of this function is TRUE if queued as the first entry in the queue (meaning this is the leader), FALSE otherwise. The main work in this function is when the commit in one transaction has been marked to wait for the commit of another transaction to happen first. This is used to support in-order parallel replication, where transactions can execute out-of-order but need to be committed in-order with how they happened on the master. The waiting of one commit on another needs to be integrated with the group commit queue, to ensure that the waiting transaction can participate in the same group commit as the waited-for transaction. So when we put a transaction in the queue, we check if there were other transactions already prepared to commit but just waiting for the first one to commit. If so, we add those to the queue as well, transitively for all waiters. */ bool MYSQL_BIN_LOG::queue_for_group_commit(group_commit_entry *entry) I also moved some code from the caller into this function, which clarifies things by keeping related code together. And I added more comments about what is going on with the different list pointers, and who is doing what.
/* To facilitate group commit for the binlog, we first queue up ourselves in the group commit queue. Then the first thread to enter the queue waits for the LOCK_log mutex, and commits for everyone in the queue once it gets the lock. Any other threads in the queue just wait for the first one to finish the commit and wake them up.
Do you still use LOCK_log for this? It's a bit confusing to mention LOCK_log here when you take the LOCK_prepare_ordered in the function.
Yeah, this comment got moved around and is no longer appropriate. I tried to clarify it (the taking of LOCK_log happens later, after this function).
I did find it confusing that here you threat next_subsequent_commit as a way to avoid recursion, but you use this list for other things in register_wait_for_prior_commit().
I added this comment: We keep a list of all the waiters that need to be processed in `list', linked through the next_subsequent_commit pointer. Initially this list contains only the entry passed into this function. We process entries in the list one by one. The element currently being processed is pointed to by `cur`, and the element at the end of the list is pointed to by `last` (we do not use NULL to terminate the list). As we process an element, it is first added to the group_commit_queue. Then any waiters for that element are added at the end of the list, to be processed in subsequent iterations. This continues until the list is exhausted, with all elements ever added eventually processed. The end result is a breath-first traversal of the tree of waiters, re-using the next_subsequent_commit pointers in place of extra stack space in a recursive traversal.
+ /* Now we need to clear the wakeup_subsequent_commits_running flags. */ + if (list) + { + for (;;) + { + if (list->wakeup_subsequent_commits_running) + { + mysql_mutex_lock(&list->LOCK_wait_commit); + list->wakeup_subsequent_commits_running= false; + mysql_mutex_unlock(&list->LOCK_wait_commit);
Why do we need the mutex above? The only place where we test this variable, as far as I can find, is in wait_for_commit::register_wait_for_prior_commit()
Yes, it looks strange. Unfortunately I cannot remember the reason 100% right now (I wrote this code 9 months ago). I remember that there _was_ a reason. I should have written a comment to explain this, shame on me! I think the code basically needs to do the same as wait_for_commit::wakeup_subsequent_commits2(), but integrated with the tree traversal. I agree this needs clarification and looks suspicious, but I wanted to get the rest of the comments sent to you today, I will try to re-visit this later (and the similar code in wakeup_subsequent_commits2()).
This thread doesn't know if the other thread is just before @@ -6597,7 +6756,10 @@ MYSQL_BIN_LOG::write_transaction_to_binl
if (next) { - next->thd->signal_wakeup_ready(); + if (next->queued_by_other) + next->thd->wait_for_commit_ptr->wakeup();
What does the above code mean ? (add a comment).
I added this comment: /* Wake up the next thread in the group commit. The next thread can be waiting in two different ways, depending on whether it put itself in the queue, or if it was put in queue by us because it had to wait for us to commit first. So execute the appropriate wakeup, identified by the queued_by_other field. */
+ mysql_mutex_unlock(&LOCK_log);
Here you are going to take LOCK_log and LOCK_prepare_ordered in the wrong order. safe_mutex() should have noticed this.
In other code you are first taking LOCK_log and then LOCK_prepare_ordered. If someone else is calling trx_group_commit_leader() while you are in this code you would get a mutex deadlock.
Well spottet, thanks. I fixed it like this: /* We must not wait for LOCK_log while holding LOCK_prepare_ordered. LOCK_log can be held for long periods (eg. we do I/O under it), while LOCK_prepare_ordered must only be held for short periods. In addition, waiting for LOCK_log while holding LOCK_prepare_ordered would violate locking order of LOCK_log-before-LOCK_prepare_ordered. This could cause SAFEMUTEX warnings (even if it cannot actually deadlock with current code, as there can be at most one group commit leader thread at a time). So release and re-acquire LOCK_prepare_ordered if we need to wait for the LOCK_log. */ if (!mysql_mutex_trylock(&LOCK_log)) { mysql_mutex_unlock(&LOCK_prepare_ordered); mysql_mutex_lock(&LOCK_log); mysql_mutex_lock(&LOCK_prepare_ordered); }
=== modified file 'sql/log_event.h' --- sql/log_event.h 2013-06-06 15:51:28 +0000 +++ sql/log_event.h 2013-08-29 08:18:22 +0000 @@ -1317,9 +1317,9 @@ class Log_event
@see do_apply_event */ - int apply_event(Relay_log_info const *rli) + int apply_event(struct rpl_group_info *rgi)
What is the simple rule to know when to use Relay_log_info and when to use rpl_group_info ?
Relay_log_info is for stuff that needs to be shared among all events executing in parallel (eg. current position in relay log). rpl_group_info is for things that should be private to each event group being replicated in parallel (such as the THD).
=== modified file 'sql/log_event_old.cc' --- sql/log_event_old.cc 2013-02-19 10:45:29 +0000 +++ sql/log_event_old.cc 2013-08-29 08:18:22 +0000 @@ -67,7 +68,7 @@ Old_rows_log_event::do_apply_event(Old_r do_apply_event(). We still check here to prevent future coding errors. */ - DBUG_ASSERT(rli->sql_thd == ev_thd); + DBUG_ASSERT(rgi->thd == ev_thd);
Should we change all test for rli->sql_thd to rgi->thd ? You have at least this test in log_event.c and slave.cc.
I assume that rli should not have a sql_thd member anymore ?
Yes, this is work in progress, to check all fields in the old Relay_log_info class and move away those that should use rpl_group_info instead. And yes, rli->sql_thd should go away. I suspect we might still need a THD somewhere in the Relay_log_info for the main SQL thread (not sure), but if we do it is probably best to rename it so that we do not later accidentally merge code without fixing any references to sql_thd (which will most likely be wrong).
+ - We should fail if we connect to the master with opt_slave_parallel_threads + greater than zero and master does not support GTID. Just to avoid a bunch + of potential problems, we won't be able to do any parallel replication + in this case anyway.
If the master doesn't support GTID, I think we should instead of stopping, just issue a warning in the log and continue with normal replication.
Agree, that sounds right.
A much easier way to achive the wait is to have the caller lock mysql_mutex_lock(&rpt->LOCK_rpl_thread) and then just release it when it's time for this thread to continue. - No need for signaling or checking if the thread got the signal - No need to have a cond wait here - No need to have a delay_start flag
I consider this method simpler and clearer. But it is not critical code, feel free to change it if you like.
Note that events may be undefined here if thd->killed was set above (for the first loop)
Ok, fixed.
It would be good to check for thd->killed or rpt->stop here and then abort the loop.
Another question: do we really need rpt->stop ? Isn't it enough with using the thd->killed for this ?
I think there is a need to generally go through and check for proper handling of thd->killed, not just here but in all cases. I did not have time to do this yet, and I am in any case a bit unsure about exactly how kill semantics should work. Now that I think about it, shouldn't they remain different? I think thd->killed is set if user does an explicit KILL of the thread? This should then abort and fail the currently executing transaction. But it should not cause the thread to terminate, should it? That could leave us with no threads at all in the replication thread pool, which would cause replication to hang. Or should the thread exit and be re-spawned? This needs more thought, I think ... certainly something looks not right.
void Relay_log_info::stmt_done(my_off_t event_master_log_pos, - time_t event_creation_time, THD *thd) + time_t event_creation_time, THD *thd, + struct rpl_group_info *rgi)
Do we need thd anymore as this is in rgi->thd?
If they are not the same, please add a comment about this. It's not otherwise clear when to use rgi->thd and thd as this function uses both.
This is ongoing work, so far I've just added the rgi parameter on some functions that needed it. It is probably true that we do not need to THD argument (on the other hand it was probably not needed before either, as we have rli->sql_thd).
=== modified file 'sql/sql_class.cc' --- sql/sql_class.cc 2013-06-06 15:51:28 +0000 +++ sql/sql_class.cc 2013-08-29 08:18:22 +0000
+void +wait_for_commit::wakeup() +{ + /* + We signal each waiter on their own condition and mutex (rather than using + pthread_cond_broadcast() or something like that). + + Otherwise we would need to somehow ensure that they were done + waking up before we could allow this THD to be destroyed, which would + be annoying and unnecessary. + */ + mysql_mutex_lock(&LOCK_wait_commit); + waiting_for_commit= false; + mysql_cond_signal(&COND_wait_commit); + mysql_mutex_unlock(&LOCK_wait_commit); +}
In this particular case it looks safe to move the cond signal out of the mutex. I don't see how anyone could miss the signal.
Probably. Generally I prefer to write it as above, to not have to make sure that moving it out is safe.
+ The waiter needs to lock the waitee to delete itself from the list in + unregister_wait_for_prior_commit(). Thus wakeup_subsequent_commits() can not + hold its own lock while locking waiters, lest we deadlock.
lest ?
https://en.wiktionary.org/wiki/lest But I've changed it to use a more common English wording "as this could lead to deadlock".
+ The reason for separate register and wait calls is that this allows to + register the wait early, at a point where the waited-for THD is known to + exist. And then the actual wait can be done much later, where the + waited-for THD may have been long gone. By registering early, the waitee + can signal before disappearing. +*/
Instead of taking locks to ensure that THD will disapper, why not keep all THD objects around until all transactions in a group has been executed?
I think it is not just about THD disappearing. IIRC, the more important problem is that the THD may have moved on to do something different. So even if the data in the other THD is not freed, it may still contain completely unrelated data (and thus wrong data, for the accessing thread). But this is actually something I consider a very important thing, and one that I spent quite some effort on and was quite pleased to get in the current state. If arbitrary THD can have pointers into arbitrary other THD, the life-time issues can quickly get very nasty, and it is important to handle this well to avoid having to understand the entire codebase to avoid subtle bugs.
Hi!
"Kristian" == Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Kristian> Monty, Kristian> Thanks for the comments so far. Kristian> I have pushed fixes to 10.0-knielsen, and some specific replies are below. Thanks. I will review that today. Kristian> - Kristian. Kristian> Michael Widenius <monty@askmonty.org> writes:
+ --slave-parallel-threads=# + If non-zero, number of threads to spawn to apply in + parallel events on the slave that were group-committed on + the master or were logged with GTID in different + replication domains.
What happens if --slave-parallel-threads is 0 ? Does this mean that there will be no parallel threads? Is this the same as 1 ?
If same as 1, why not have the option to take values starting from 1 ?
Kristian> I think it is logical that 0 means disabled and >0 means enabled with that Kristian> many threads. But I don't mind using 1 as disabled and needing >=2 for Kristian> enabled. Though the ability to use a pool with just 1 thread might be useful Kristian> for testing. It doesn't seem critical, maybe we can revisit this if time Kristian> allows when the basic stuff is done. I come up with one case where 0 is better - In multi-source 0 would mean that we have one thread per connection. You can't get that behaviour if we change the meaning of the slave-parallel-thread variable.
=== modified file 'mysql-test/suite/perfschema/r/dml_setup_instruments.result' --- mysql-test/suite/perfschema/r/dml_setup_instruments.result 2013-03-26 09:35:34 +0000 +++ mysql-test/suite/perfschema/r/dml_setup_instruments.result 2013-08-29 08:18:22 +0000 @@ -38,14 +38,14 @@ order by name limit 10; NAME ENABLED TIMED wait/synch/cond/sql/COND_flush_thread_cache YES YES wait/synch/cond/sql/COND_manager YES YES +wait/synch/cond/sql/COND_parallel_entry YES YES +wait/synch/cond/sql/COND_prepare_ordered YES YES wait/synch/cond/sql/COND_queue_state YES YES wait/synch/cond/sql/COND_rpl_status YES YES +wait/synch/cond/sql/COND_rpl_thread YES YES +wait/synch/cond/sql/COND_rpl_thread_pool YES YES wait/synch/cond/sql/COND_server_started YES YES wait/synch/cond/sql/COND_thread_cache YES YES -wait/synch/cond/sql/COND_thread_count YES YES -wait/synch/cond/sql/Delayed_insert::cond YES YES -wait/synch/cond/sql/Delayed_insert::cond_client YES YES -wait/synch/cond/sql/Event_scheduler::COND_state YES YES
Any idea why the above was deleted? Don't see anything in your patch that could affect that.
Kristian> I believe it does a SELECT with a LIMIT - so when new stuff is added before Kristian> the limit, stuff below gets pushed out. Kristian> But if you ask why the test is that way, I cannot answer. The purpose of some Kristian> of those perfschema tests is somewhat of a mystery, they seem to serve little Kristian> other purpose than add maintenance cost for result file update whenever Kristian> something changes ... ok, I missed the 'limit part'. I can only say 'stupid test' about this... <cut>
@@ -6541,44 +6543,201 @@ MYSQL_BIN_LOG::write_transaction_to_binl }
Could you please document in detail the arguments and return value for the following function.
Kristian> I added the following comment: Kristian> /* Kristian> Put a transaction that is ready to commit in the group commit queue. Kristian> The transaction is identified by the ENTRY object passed into this function. Kristian> To facilitate group commit for the binlog, we first queue up ourselves in Kristian> this function. Then later the first thread to enter the queue waits for Kristian> the LOCK_log mutex, and commits for everyone in the queue once it gets the Kristian> lock. Any other threads in the queue just wait for the first one to finish Kristian> the commit and wake them up. This way, all transactions in the queue get Kristian> committed in a single disk operation. Kristian> The return value of this function is TRUE if queued as the first entry in Kristian> the queue (meaning this is the leader), FALSE otherwise. Kristian> The main work in this function is when the commit in one transaction has Kristian> been marked to wait for the commit of another transaction to happen Kristian> first. This is used to support in-order parallel replication, where Kristian> transactions can execute out-of-order but need to be committed in-order with Kristian> how they happened on the master. The waiting of one commit on another needs Kristian> to be integrated with the group commit queue, to ensure that the waiting Kristian> transaction can participate in the same group commit as the waited-for Kristian> transaction. Kristian> So when we put a transaction in the queue, we check if there were other Kristian> transactions already prepared to commit but just waiting for the first one Kristian> to commit. If so, we add those to the queue as well, transitively for all Kristian> waiters. Kristian> */ Kristian> bool Kristian> MYSQL_BIN_LOG::queue_for_group_commit(group_commit_entry *entry) Kristian> I also moved some code from the caller into this function, which clarifies Kristian> things by keeping related code together. And I added more comments about what Kristian> is going on with the different list pointers, and who is doing what. Thanks. That will help a lot in making thins easier to understand. <cut>
I did find it confusing that here you threat next_subsequent_commit as a way to avoid recursion, but you use this list for other things in register_wait_for_prior_commit().
Kristian> I added this comment: Kristian> We keep a list of all the waiters that need to be processed in `list', Kristian> linked through the next_subsequent_commit pointer. Initially this list Kristian> contains only the entry passed into this function. Kristian> We process entries in the list one by one. The element currently being Kristian> processed is pointed to by `cur`, and the element at the end of the list Kristian> is pointed to by `last` (we do not use NULL to terminate the list). Kristian> As we process an element, it is first added to the group_commit_queue. Kristian> Then any waiters for that element are added at the end of the list, to Kristian> be processed in subsequent iterations. This continues until the list Kristian> is exhausted, with all elements ever added eventually processed. Kristian> The end result is a breath-first traversal of the tree of waiters, Kristian> re-using the next_subsequent_commit pointers in place of extra stack Kristian> space in a recursive traversal. Great. The one thing I am missing was: The temporary list created in next_subsequent_commit is not used by the caller or any other function.
+ /* Now we need to clear the wakeup_subsequent_commits_running flags. */ + if (list) + { + for (;;) + { + if (list->wakeup_subsequent_commits_running) + { + mysql_mutex_lock(&list->LOCK_wait_commit); + list->wakeup_subsequent_commits_running= false; + mysql_mutex_unlock(&list->LOCK_wait_commit);
Why do we need the mutex above? The only place where we test this variable, as far as I can find, is in wait_for_commit::register_wait_for_prior_commit()
Kristian> Yes, it looks strange. Unfortunately I cannot remember the reason 100% right Kristian> now (I wrote this code 9 months ago). I remember that there _was_ a reason. I Kristian> should have written a comment to explain this, shame on me! Kristian> I think the code basically needs to do the same as Kristian> wait_for_commit::wakeup_subsequent_commits2(), but integrated with the tree Kristian> traversal. Kristian> I agree this needs clarification and looks suspicious, but I wanted to get the Kristian> rest of the comments sent to you today, I will try to re-visit this later (and Kristian> the similar code in wakeup_subsequent_commits2()). ok. Please take a look at this when you are ready with other things. (Add at least a 'todo tag to this code). <cut>
A much easier way to achive the wait is to have the caller lock mysql_mutex_lock(&rpt->LOCK_rpl_thread) and then just release it when it's time for this thread to continue. - No need for signaling or checking if the thread got the signal - No need to have a cond wait here - No need to have a delay_start flag
Kristian> I consider this method simpler and clearer. But it is not critical code, feel Kristian> free to change it if you like. I will do this change, probably today. It will make some of the code a lot simpler. Wait and see... <cut>
Another question: do we really need rpt->stop ? Isn't it enough with using the thd->killed for this ?
Kristian> I think there is a need to generally go through and check for proper handling Kristian> of thd->killed, not just here but in all cases. I did not have time to do this Kristian> yet, and I am in any case a bit unsure about exactly how kill semantics should Kristian> work. Kristian> Now that I think about it, shouldn't they remain different? Kristian> I think thd->killed is set if user does an explicit KILL of the thread? This Kristian> should then abort and fail the currently executing transaction. But it should Kristian> not cause the thread to terminate, should it? That could leave us with no Kristian> threads at all in the replication thread pool, which would cause replication Kristian> to hang. Or should the thread exit and be re-spawned? What should happen if you kill a replication thread is that replication should stop for that master. Kristian> This needs more thought, I think ... certainly something looks not right. After looking at the full code, I think that the logical way things should work is: 'stop' is to be used when you want to nicely take done replication. This means that the current commit groups should be given time to finish. thd->killed should mean that we should stop ASAP. - All not commited things should abort. This is needed in a 'panic shutdown' (like out soon-out-of- power) or when trying to kill the replication thread when one notices that something went horribly wrong (like ALTER TABLE stopping replication).
=== modified file 'sql/sql_class.cc' --- sql/sql_class.cc 2013-06-06 15:51:28 +0000 +++ sql/sql_class.cc 2013-08-29 08:18:22 +0000
+void +wait_for_commit::wakeup() +{ + /* + We signal each waiter on their own condition and mutex (rather than using + pthread_cond_broadcast() or something like that). + + Otherwise we would need to somehow ensure that they were done + waking up before we could allow this THD to be destroyed, which would + be annoying and unnecessary. + */ + mysql_mutex_lock(&LOCK_wait_commit); + waiting_for_commit= false; + mysql_cond_signal(&COND_wait_commit); + mysql_mutex_unlock(&LOCK_wait_commit); +}
In this particular case it looks safe to move the cond signal out of the mutex. I don't see how anyone could miss the signal.
Kristian> Probably. Generally I prefer to write it as above, to not have to make sure Kristian> that moving it out is safe. I agree that this is the way to do in general. However for cases where we are waiting almost once per query, we need to make things faster as this extra wakup will take notable resources!
+ The waiter needs to lock the waitee to delete itself from the list in + unregister_wait_for_prior_commit(). Thus wakeup_subsequent_commits() can not + hold its own lock while locking waiters, lest we deadlock.
lest ?
Kristian> https://en.wiktionary.org/wiki/lest Kristian> But I've changed it to use a more common English wording "as this could lead Kristian> to deadlock". Thanks. It's easier to read the code when you don't need to consult a dictionary...
+ The reason for separate register and wait calls is that this allows to + register the wait early, at a point where the waited-for THD is known to + exist. And then the actual wait can be done much later, where the + waited-for THD may have been long gone. By registering early, the waitee + can signal before disappearing. +*/
Instead of taking locks to ensure that THD will disapper, why not keep all THD objects around until all transactions in a group has been executed?
Kristian> I think it is not just about THD disappearing. IIRC, the more important Kristian> problem is that the THD may have moved on to do something different. So even Kristian> if the data in the other THD is not freed, it may still contain completely Kristian> unrelated data (and thus wrong data, for the accessing thread). Kristian> But this is actually something I consider a very important thing, and one that Kristian> I spent quite some effort on and was quite pleased to get in the current Kristian> state. If arbitrary THD can have pointers into arbitrary other THD, the Kristian> life-time issues can quickly get very nasty, and it is important to handle Kristian> this well to avoid having to understand the entire codebase to avoid subtle bugs. ok. Thanks for the explanation. Regards, Monty
Michael Widenius <monty@askmonty.org> writes:
What should happen if you kill a replication thread is that replication should stop for that master.
Kristian> This needs more thought, I think ... certainly something looks not right.
After looking at the full code, I think that the logical way things should work is:
'stop' is to be used when you want to nicely take done replication. This means that the current commit groups should be given time to finish.
thd->killed should mean that we should stop ASAP. - All not commited things should abort.
This is needed in a 'panic shutdown' (like out soon-out-of- power) or when trying to kill the replication thread when one notices that something went horribly wrong (like ALTER TABLE stopping replication).
Ok, so I thought more about this. There are two kinds of threads involved. One is the normal SQL thread (this is the only thread involved in non-parallel replication). There is a one-to-one correspondence between an SQL thread and the associated master connection. It seems to make sense that KILL CONNECTION on the SQL thread would stop the replication. I do not know whether this is how it works or not in current replication, but whatever it is, we should just leave the behaviour the same. The other kind of thread is the parallel replication worker thread. We have a pool of those (the size of the pool specified by --slave-parallel-threads). These worker threads are not associated with any particular master connection. They are assigned to execute an event group at a time for whichever master connection is in need for a new thread. So a KILL QUERY or KILL CONNECTION on a worker thread should abort the currently executing event - and it will, I assume, using the existing code for killing a running query operation. This will fail the execution of the query, and thus eventually stop the associated SQL thread and master connection, once the error is propagated out of the worker thread up to the parent SQL thread. But KILL of a worker thread should _not_ cause the thread to exit, I think. The pool of parallel threads is just a thread pool, and for simplicity in the first version, it has a fixed size. A KILL will disconnect the thread from the replication connection it was servicing, but the thread must remain alive, ready to serve another connection if needed. So the rpt->stop is only about maintaining the thread pool for parallel replication, not about anything to do with aborting any specific master connection replication. In rpl_parallel_change_thread_count() we set rpt->stop to ask all existing threads to exit, and afterwards spawn a new set of threads. This can only happen when all replication SQL/IO threads are stopped and no event execution is taking place. So this code of mine is actually wrong, I think: while (!rpt->stop && !thd->killed) It should just be while (!rpt->stop) { ... }. And thd->killed should not be used at all to control thread termination. Instead, it should be used (and is not currently, needs to be fixed) in various places for event execution to allow aborting currently executing event. This is needed around rpt_handle_event() I think, and also around queue_for_group_commit(). For normal stop, this is handled mostly in the SQL thread. It waits with a timeout for the current event group to finish normally, then does a hard kill if needed. This needs to be extended to handle running things in a worker thread, of course. The code is around sql_slave_killed(), I think. Does that make sense?
+ mysql_mutex_lock(&LOCK_wait_commit); + waiting_for_commit= false; + mysql_cond_signal(&COND_wait_commit); + mysql_mutex_unlock(&LOCK_wait_commit); +}
In this particular case it looks safe to move the cond signal out of the mutex. I don't see how anyone could miss the signal.
Kristian> Probably. Generally I prefer to write it as above, to not have to make sure Kristian> that moving it out is safe.
I agree that this is the way to do in general. However for cases where we are waiting almost once per query, we need to make things faster as this extra wakup will take notable resources!
Ok, no problem, I've changed it. I should benchmark this, it keeps coming up. It's not clear to me which way would be the fastest, seems it would depend on the implementation.
Hi!
Part 2 of review of parallel replication
=== added file 'sql/rpl_parallel.cc'
<cut>
+pthread_handler_t +handle_rpl_parallel_thread(void *arg) +{
<cut>
+ if (end_of_group) + { + in_event_group= false; +
Add comment:
/* All events for this group has now been executed and logged (but not necessaerly synced). Inform the other event groups that are waiting for this thread that we are done. */
+ rgi->commit_orderer.unregister_wait_for_prior_commit(); + thd->wait_for_commit_ptr= NULL; + + /* + Record that we have finished, so other event groups will no + longer attempt to wait for us to commit. + + We can race here with the next transactions, but that is fine, as + long as we check that we do not decrease last_committed_sub_id. If + this commit is done, then any prior commits will also have been + done and also no longer need waiting for. + */ + mysql_mutex_lock(&entry->LOCK_parallel_entry); + if (entry->last_committed_sub_id < event_gtid_sub_id) + { + entry->last_committed_sub_id= event_gtid_sub_id; + mysql_cond_broadcast(&entry->COND_parallel_entry); + } + mysql_mutex_unlock(&entry->LOCK_parallel_entry); +
Add:
/* Tell storage engines that they can stop waiting. See comments in sql_class.cc::wakeup_subsequent_commits2() for why we can't do this part of the wakeup in unregister_wait_for_prior_commit(). */
I think there is some misunderstanding here. I added some comments that will hopefully clarify, but the situation is a bit different than how I read this suggestion. The situation is that we have for example three threads, A, B, and C. We are thread B at this point in the code. Thread B needed to wait for A to commit first. C needs to wait for B to commit. The call to unregister_wait_for_prior_commit() relates to B waiting for A, it cannot be used in relation to C waiting for B, irrespectively of comments in the function wakeup_subsequent_commits2() ... The unregister_wait_for_prior_commit() removes B from the list of threads waiting for A. This is usually redundant, because B already removed itself with wait_for_prior_commit() during the commit step. However, if there is for example an error we might not reach commit, and thus we need to remove the dangling entry in A. The wakeup_subsequent_commits() signals C that B is done. Again, this is often redundant, because we already signalled C during B's commit step. However, C could have registered to wait for B between B's commit and this place in the code, so it is needed for correctness. Note that in the cases where this is redundant, no extra mutexes are taken. The unregister_wait_for_prior_commit() and wakeup_subsequent_commits() functions are inlines that do just a single test of one field in the struct.
As far as I can see, rpt->free is always 0 here. - It's set to 0 at the top of this loop and only set to 1 here.
Can we remove the rpt->free variable?
Yes, it looks like it. I have removed it. It was supposed to mark if the worker thread was free to be reused for another event group, but this is marked already by the thread being included in the rpt->pool->free_list.
Add:
/* The following loops wait until each thread is done it's work for each event group and then stops it when the thread has nothing to do.
Note that this may take some time as rpl_paralell:do_event() may request threads at the same time. */
I added this comment instead: /* Grab each old thread in turn, and signal it to stop. Note that since we require all replication threads to be stopped before changing the parallel replication worker thread pool, all the threads will be already idle and will terminate immediately. */ Hope that clarifies things.
+void +rpl_parallel::wait_for_done() +{ + struct rpl_parallel_entry *e; + uint32 i; + + for (i= 0; i < domain_hash.records; ++i) + { + e= (struct rpl_parallel_entry *)my_hash_element(&domain_hash, i); + mysql_mutex_lock(&e->LOCK_parallel_entry); + while (e->current_sub_id > e->last_committed_sub_id) + mysql_cond_wait(&e->COND_parallel_entry, &e->LOCK_parallel_entry); + mysql_mutex_unlock(&e->LOCK_parallel_entry); + } +}
How to handle errors in the above case? For example when we get an error and we had to abort things?
I think even if we have to abort event execution in a worker thread, we should still increment e->last_committed_sub_id. So the wait here is ok. But we do need code to be able to force all worker threads to stop. An important piece of this puzzle is in sql_slave_killed(). I do not understand the full details of this code yet, but I believe the idea is to wait with some timeout for event groups to finish normally, then hard kill if the timeout triggers (I vaguely remember that for InnoDB we can just kill and rollback, while for MyISAM the timeout is necessary to avoid leaving half an update executed). We need to extend sql_slave_killed() to also handle waiting for each worker thread to stop, with the timeout. Once we have that, it is hopefully clear how to do wait_for_done(); maybe it can call sql_slave_killed(), or maybe it is no longer needed (wait_for_done() is just a temporary thing I put in to get things working for the benchmark). We also need to extend the API around struct wait_for_commit to include error handling. I suppose wait_for_prior_commit() should check for thd->killed when it waits, and be extended to return an error if killed. And then also wakeup_subsequent_commits() should be extended to take an error argument, so that if B is waiting for A, and A gets an error, then A can call wakeup_subsequent_commits() with error=true, which will cause B to wake up but also get an error. Then if we have a number of event groups processing in parallel, A->B->C->D. And B ends up with an error. Then A can complete, but B will signal an error to C (and then to D), so both B, C, and D will fail. Then somewhere, perhaps around the end of apply_event_and_update_pos(), we must check an error flag set for B, C, and D, and signal the error back to the main SQL thread so that it can halt and print the error to the error log, as usual. Something like that, I think. Some work to do, but should be possible at least.
The other way would be to, on error, set last_committed_sub_id back to it's last value.
What do you suggest? Assuming setting the value back would be the easiest solution..
Yes (if I understood correctly what you meant with "setting back")... I think we should update things the same way whether the transaction commits successfully or fails, so we get the same wakeup of later transactions, but with the extra error flag added to wait_for_commit so that later transactions can see the failure and thus fail and rollback themselves. And then the error can be propagated up the call stack, as normal. Of course, this becomes more complex due to the need to be able to re-try a transaction in case of certain temporary errors like deadlocks and so. Hopefully something can be made to work.
+ /* + We are already executing something else in this domain. But the two + event groups were committed together in the same group commit on the + master, so we can still do them in parallel here on the slave. + + However, the commit of this event must wait for the commit of the prior + event, to preserve binlog commit order and visibility across all + servers in the replication hierarchy. + */ + rpl_parallel_thread *rpt= global_rpl_thread_pool.get_thread(e);
Here we could add a test if get_thread() is 0, which could happen on shutdown or when killing a replication thread.
I do not think we need to do this. I made it so that the worker thread pool can never change size without stopping all replication threads first, so we avoid a lot of these issues. And as I wrote above, killing a worker thread should terminate the event group that it is currently servicing (and the associated SQL thread and master connection), but not remove the thread from the pool. (I can imagine that in a later version we want to implement a smarter worker thread manager, but for the first version I tried to keep things simple, there are plenty of other problems that we need to tackle...)
I think it's relatively safe to use get_thread() as a end-synchronize point as we should never have anything half-done when we call get_thread()). As get_thread() can take a very long time and is not called in many places, it's also sounds like a resonable thing to do.
Sorry, I do not understand. "End-synchronize point" for what?
+ /* + Events like ROTATE and FORMAT_DESCRIPTION. Do not run in worker thread. + Same for events not preceeded by GTID (we should not see those normally, + but they might be from an old master). + */ + qev->rgi= serial_rgi; + rpt_handle_event(qev, NULL); + delete_or_keep_event_post_apply(rli, typ, qev->ev);
Why is delete_or_keep_event_post_apply() not part of rpt_handle_event()? You always call these two together...
It is to share code with the non-parallel case. When parallel replication is disabled, everything is done in the SQL thread in slave.cc. Like this: exec_res= apply_event_and_update_pos(ev, thd, serial_rgi, NULL); delete_or_keep_event_post_apply(serial_rgi, typ, ev); Or did you mean that I could move the call of delete_or_keep_event_post_apply() into the function rpt_handle_event()? That might be possible, though it was not 100% clear to me as there are a few differences in how it is done in the different cases. But in any case that part of the code needs more work. Especially with regard to error handling, and related to updating the last executed event position. Probably it makes sense to revisit this as part of that work. - Kristian.
Hi!
"Kristian" == Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Kristian> Michael Widenius <monty@askmonty.org> writes:
What should happen if you kill a replication thread is that replication should stop for that master.
<cut> Kristian> It seems to make sense that KILL CONNECTION on the SQL thread would stop the Kristian> replication. I do not know whether this is how it works or not in current Kristian> replication, but whatever it is, we should just leave the behaviour the same. This is how things are now (as far as I know). <cut> Kristian> But KILL of a worker thread should _not_ cause the thread to exit, I Kristian> think. The pool of parallel threads is just a thread pool, and for simplicity Kristian> in the first version, it has a fixed size. A KILL will disconnect the thread Kristian> from the replication connection it was servicing, but the thread must remain Kristian> alive, ready to serve another connection if needed. Correct. The KILL command only kills the query or the connection (ie, THD). The thread is always reused. Kristian> So the rpt->stop is only about maintaining the thread pool for parallel Kristian> replication, not about anything to do with aborting any specific master Kristian> connection replication. In rpl_parallel_change_thread_count() we set rpt->stop Kristian> to ask all existing threads to exit, and afterwards spawn a new set of Kristian> threads. This can only happen when all replication SQL/IO threads are stopped Kristian> and no event execution is taking place. Yes. The questions is when rpt->stop should take effect. My suggestion is that the replication threads should only examine rpt->stop after commits. This way we can ensure that when all threads are stopped, everything is committed to a certain point and we never have to do a rollback. Kristian> So this code of mine is actually wrong, I think: Kristian> while (!rpt->stop && !thd->killed) Kristian> It should just be while (!rpt->stop) { ... }. Kristian> And thd->killed should not be used at all to control thread Kristian> termination. Instead, it should be used (and is not currently, needs to be Kristian> fixed) in various places for event execution to allow aborting currently Kristian> executing event. This is needed around rpt_handle_event() I think, and also Kristian> around queue_for_group_commit(). Yes. I will be working on this today & tomorrow. Note that every loop where we wait needs to be stoppable, either with 'stop' or with KILL. Kristian> For normal stop, this is handled mostly in the SQL thread. It waits with a Kristian> timeout for the current event group to finish normally, then does a hard kill Kristian> if needed. This needs to be extended to handle running things in a worker Kristian> thread, of course. The code is around sql_slave_killed(), I think. Kristian> Does that make sense? Makes sence. I will know more tomorrow when I have dug into the code a bit more.
+ mysql_mutex_lock(&LOCK_wait_commit); + waiting_for_commit= false; + mysql_cond_signal(&COND_wait_commit); + mysql_mutex_unlock(&LOCK_wait_commit); +}
In this particular case it looks safe to move the cond signal out of the mutex. I don't see how anyone could miss the signal.
Kristian> Probably. Generally I prefer to write it as above, to not have to make sure Kristian> that moving it out is safe.
I agree that this is the way to do in general. However for cases where we are waiting almost once per query, we need to make things faster as this extra wakup will take notable resources!
Kristian> Ok, no problem, I've changed it. Kristian> I should benchmark this, it keeps coming up. It's not clear to me which way Kristian> would be the fastest, seems it would depend on the implementation. I did a benchmark of this a long time ago (+10 years) on Linux and then using the signal after unlock was notable faster. The gain is two thread switches + one thread wakeup per cond_signal.
Hi!
Part 2 of review of parallel replication
=== added file 'sql/rpl_parallel.cc'
<cut>
+pthread_handler_t +handle_rpl_parallel_thread(void *arg) +{
<cut>
+ if (end_of_group) + { + in_event_group= false; +
Add comment:
/* All events for this group has now been executed and logged (but not necessaerly synced). Inform the other event groups that are waiting for this thread that we are done. */
+ rgi->commit_orderer.unregister_wait_for_prior_commit(); + thd->wait_for_commit_ptr= NULL; + + /* + Record that we have finished, so other event groups will no + longer attempt to wait for us to commit. + + We can race here with the next transactions, but that is fine, as + long as we check that we do not decrease last_committed_sub_id. If + this commit is done, then any prior commits will also have been + done and also no longer need waiting for. + */ + mysql_mutex_lock(&entry->LOCK_parallel_entry); + if (entry->last_committed_sub_id < event_gtid_sub_id) + { + entry->last_committed_sub_id= event_gtid_sub_id; + mysql_cond_broadcast(&entry->COND_parallel_entry); + } + mysql_mutex_unlock(&entry->LOCK_parallel_entry); +
Add:
/* Tell storage engines that they can stop waiting. See comments in sql_class.cc::wakeup_subsequent_commits2() for why we can't do this part of the wakeup in unregister_wait_for_prior_commit(). */
Kristian> I think there is some misunderstanding here. I added some comments that will Kristian> hopefully clarify, but the situation is a bit different than how I read this Kristian> suggestion. I may have missunderstood some of the code. Then it's even more important to get a good comment! Kristian> The situation is that we have for example three threads, A, B, and C. We are Kristian> thread B at this point in the code. Thread B needed to wait for A to commit Kristian> first. C needs to wait for B to commit. Kristian> The call to unregister_wait_for_prior_commit() relates to B waiting for A, it Kristian> cannot be used in relation to C waiting for B, irrespectively of comments in Kristian> the function wakeup_subsequent_commits2() ... ok. I thought that when we called unregister_wait_for_prior_commit() we had already handled the event including logging. (Apparently rpt_handle_event() does less than what I expected). Kristian> The unregister_wait_for_prior_commit() removes B from the list of threads Kristian> waiting for A. This is usually redundant, because B already removed itself Kristian> with wait_for_prior_commit() during the commit step. However, if there is for Kristian> example an error we might not reach commit, and thus we need to remove the Kristian> dangling entry in A. ok, this was a bit unclear. I thought that B could not stop waiting for A until A had written everything to the binary log. I don't see how we B can stop waiting for A until A at least has done all commands and written all it's logs. Kristian> The wakeup_subsequent_commits() signals C that B is done. Again, this is often Kristian> redundant, because we already signalled C during B's commit step. However, C Kristian> could have registered to wait for B between B's commit and this place in the Kristian> code, so it is needed for correctness. This part is clear. <cut>
/* The following loops wait until each thread is done it's work for each event group and then stops it when the thread has nothing to do.
Note that this may take some time as rpl_paralell:do_event() may request threads at the same time. */
Kristian> I added this comment instead: Kristian> /* Kristian> Grab each old thread in turn, and signal it to stop. Kristian> Note that since we require all replication threads to be stopped before Kristian> changing the parallel replication worker thread pool, all the threads will Kristian> be already idle and will terminate immediately. Kristian> */ Kristian> Hope that clarifies things. Yes, it does. Thanks.
+void +rpl_parallel::wait_for_done() +{ + struct rpl_parallel_entry *e; + uint32 i; + + for (i= 0; i < domain_hash.records; ++i) + { + e= (struct rpl_parallel_entry *)my_hash_element(&domain_hash, i); + mysql_mutex_lock(&e->LOCK_parallel_entry); + while (e->current_sub_id > e->last_committed_sub_id) + mysql_cond_wait(&e->COND_parallel_entry, &e->LOCK_parallel_entry); + mysql_mutex_unlock(&e->LOCK_parallel_entry); + } +}
How to handle errors in the above case? For example when we get an error and we had to abort things?
Kristian> I think even if we have to abort event execution in a worker thread, we should Kristian> still increment e->last_committed_sub_id. So the wait here is ok. But we do Kristian> need code to be able to force all worker threads to stop. Kristian> An important piece of this puzzle is in sql_slave_killed(). I do not Kristian> understand the full details of this code yet, but I believe the idea is to Kristian> wait with some timeout for event groups to finish normally, then hard kill if Kristian> the timeout triggers (I vaguely remember that for InnoDB we can just kill and Kristian> rollback, while for MyISAM the timeout is necessary to avoid leaving half an Kristian> update executed). We could extend this to check if we are using transactional tables or not and wait longer if not transaction tables has been used for the query. Kristian> We need to extend sql_slave_killed() to also handle waiting for each worker Kristian> thread to stop, with the timeout. Once we have that, it is hopefully clear how Kristian> to do wait_for_done(); maybe it can call sql_slave_killed(), or maybe it is no Kristian> longer needed (wait_for_done() is just a temporary thing I put in to get Kristian> things working for the benchmark). Kristian> We also need to extend the API around struct wait_for_commit to include error Kristian> handling. I suppose wait_for_prior_commit() should check for thd->killed when Kristian> it waits, and be extended to return an error if killed. And then also Kristian> wakeup_subsequent_commits() should be extended to take an error argument, so Kristian> that if B is waiting for A, and A gets an error, then A can call Kristian> wakeup_subsequent_commits() with error=true, which will cause B to wake up but Kristian> also get an error. Yes, that sounds right. Kristian> Then if we have a number of event groups processing in parallel, A-> B->C->D. And B ends up with an error. Then A can complete, but B will Kristian> signal an error to C (and then to D), so both B, C, and D will fail. One problem we have is that if B, C ,D are using non transactional tables, it would be better to have them complete than abort. This issue we can leave for now. Kristian> Then somewhere, perhaps around the end of apply_event_and_update_pos(), we Kristian> must check an error flag set for B, C, and D, and signal the error back to the Kristian> main SQL thread so that it can halt and print the error to the error log, as Kristian> usual. Kristian> Something like that, I think. Some work to do, but should be possible at least. Yes. Not too hard, I think.
The other way would be to, on error, set last_committed_sub_id back to it's last value.
What do you suggest? Assuming setting the value back would be the easiest solution..
Kristian> Yes (if I understood correctly what you meant with "setting back")... Kristian> I think we should update things the same way whether the transaction commits Kristian> successfully or fails, so we get the same wakeup of later transactions, but Kristian> with the extra error flag added to wait_for_commit so that later transactions Kristian> can see the failure and thus fail and rollback themselves. And then the error Kristian> can be propagated up the call stack, as normal. Kristian> Of course, this becomes more complex due to the need to be able to re-try a Kristian> transaction in case of certain temporary errors like deadlocks and so. Kristian> Hopefully something can be made to work. As we should never get a deadlock or temporary error on a slave, this should not be too hard for a first version.
+ /* + We are already executing something else in this domain. But the two + event groups were committed together in the same group commit on the + master, so we can still do them in parallel here on the slave. + + However, the commit of this event must wait for the commit of the prior + event, to preserve binlog commit order and visibility across all + servers in the replication hierarchy. + */ + rpl_parallel_thread *rpt= global_rpl_thread_pool.get_thread(e);
Here we could add a test if get_thread() is 0, which could happen on shutdown or when killing a replication thread.
Kristian> I do not think we need to do this. I made it so that the worker thread pool Kristian> can never change size without stopping all replication threads first, so we Kristian> avoid a lot of these issues. And as I wrote above, killing a worker thread Kristian> should terminate the event group that it is currently servicing (and the Kristian> associated SQL thread and master connection), but not remove the thread from Kristian> the pool. I was thinking about the case where all worker threads are occupied for a long time. In this case we will get do_event() stop a long time until a new thread is free. I do see some small advantages in beeing able to free the sql-thread early. If get_thread() is not a stop point, then we have the problem at terminate, that do_event() will wait for get_thread() and then schedule a new query() even if the server should be stopping. We need to check for the sql-thread being killed before it tells a worker thread to start with new queries... Kristian> (I can imagine that in a later version we want to implement a smarter worker Kristian> thread manager, but for the first version I tried to keep things simple, there Kristian> are plenty of other problems that we need to tackle...)
I think it's relatively safe to use get_thread() as a end-synchronize point as we should never have anything half-done when we call get_thread()). As get_thread() can take a very long time and is not called in many places, it's also sounds like a resonable thing to do.
Kristian> Sorry, I do not understand. "End-synchronize point" for what? Sorry, wrong word. What I meant was a point where it's safe to detect that the server or replication thread has been killed.
+ /* + Events like ROTATE and FORMAT_DESCRIPTION. Do not run in worker thread. + Same for events not preceeded by GTID (we should not see those normally, + but they might be from an old master). + */ + qev->rgi= serial_rgi; + rpt_handle_event(qev, NULL); + delete_or_keep_event_post_apply(rli, typ, qev->ev);
Why is delete_or_keep_event_post_apply() not part of rpt_handle_event()? You always call these two together...
Kristian> It is to share code with the non-parallel case. Kristian> When parallel replication is disabled, everything is done in the SQL thread in Kristian> slave.cc. Like this: Kristian> exec_res= apply_event_and_update_pos(ev, thd, serial_rgi, NULL); Kristian> delete_or_keep_event_post_apply(serial_rgi, typ, ev); Kristian> Or did you mean that I could move the call of Kristian> delete_or_keep_event_post_apply() into the function rpt_handle_event()? Kristian> That might be possible, though it was not 100% clear to me as there are a few Kristian> differences in how it is done in the different cases. Yes, I meant moving the function into rpt_handle_event(). Kristian> But in any case that part of the code needs more work. Especially with regard Kristian> to error handling, and related to updating the last executed event Kristian> position. Probably it makes sense to revisit this as part of that work. ok. Regards, Monty
Michael Widenius <monty@askmonty.org> writes:
+ if (list->wakeup_subsequent_commits_running) + { + mysql_mutex_lock(&list->LOCK_wait_commit); + list->wakeup_subsequent_commits_running= false; + mysql_mutex_unlock(&list->LOCK_wait_commit);
Why do we need the mutex above?
I checked this again. We _do_ need a full memory barrier here (memory barrier is implied in taking a mutex). Otherwise the compiler or CPU could re-order the setting of the wakeup_subsequent_commits_running flag with the reads and writes done in the list manipulations. This could cause the two threads to corrupt the list as they both manipulate it at the same time. But after carefully checking, it seems to me a barrier would be enough, we do not need to lock and unlock the mutex. Unfortunately, we do not have any good mechanisms for doing memory barriers in the MariaDB source currently, AFAIK. I will put a comment here that the lock/unlock is only needed for the memory barrier, and could be changed later. Then we can re-visit it when the more critical things have been fixed (and if we get memory barrier primitives in the source tree). What do you think of this suggestion? [I was wondering why I did it like this in the first place. Due to chaining between the waiter's mutex and the waitees mutex, the extra lock/unlock does prevent that wakeup_subsequent_commits2() can change the flag in the middle of unregister_wait_for_prior_commit2() doing its operation. Maybe that is why I did it this way. But I could not actually think of any problems from such a change in the middle. Or maybe I did it this way to emphasise that the code conceptually is locking both the waitee and the waiter - but to prevent deadlocks, it temporarily releases the waitee lock in the middle. So instead of LOCK a; LOCK b; UNLOCK b; UNLOCK a; we inject a temporary unlock of a: LOCK a; [UNLOCK a]; LOCK b; UNLOCK b; [LOCK a]; UNLOCK a; But it does not really make things any clearer.]
The only place where we test this variable, as far as I can find, is in wait_for_commit::register_wait_for_prior_commit()
The critical place is in wait_for_commit::unregister_wait_for_prior_commit2()
This thread doesn't know if the other thread is just before mysql_mutex_lock(&waitee->LOCK_wait_commit), inside the lock or after the lock. So when we reset the variable should not matter.
What we need to ensure is that - All the reads of the list in thread 1 see only writes from other threads that happened _before_ the variable was reset. - All changes to the list in thread 2 happen only _after_ the variable was reset in thread 1. So a memory barrier is needed, but as you say, lock/unlock should not be needed.
But somehow this sounds strange. What happens if the above variable is reset just before mysql_mutex_lock(&waitee->LOCK_wait_commit) in wait_for_commit::register_wait_for_prior_commit() ? Isn't the waitee added to the list when it shouldn't?
That should be ok. The caller ensures that wakeup_subsequent_commits2() will be called _after_ the last call to register_wait_for_prior_commit() has completed. This is done in this code in handle_rpl_parallel_thread(): mysql_mutex_lock(&entry->LOCK_parallel_entry); if (entry->last_committed_sub_id < event_gtid_sub_id) { entry->last_committed_sub_id= event_gtid_sub_id; mysql_cond_broadcast(&entry->COND_parallel_entry); } mysql_mutex_unlock(&entry->LOCK_parallel_entry); rgi->commit_orderer.wakeup_subsequent_commits(); and mysql_mutex_lock(&entry->LOCK_parallel_entry); if (wait_for_sub_id > entry->last_committed_sub_id) { wait_for_commit *waitee= &rgi->wait_commit_group_info->commit_orderer; rgi->commit_orderer.register_wait_for_prior_commit(waitee); } mysql_mutex_unlock(&entry->LOCK_parallel_entry); Thanks, - Kristian.
Hi!
"Kristian" == Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Kristian> Michael Widenius <monty@askmonty.org> writes:
+ if (list->wakeup_subsequent_commits_running) + { + mysql_mutex_lock(&list->LOCK_wait_commit); + list->wakeup_subsequent_commits_running= false; + mysql_mutex_unlock(&list->LOCK_wait_commit);
Why do we need the mutex above?
Kristian> I checked this again. Kristian> We _do_ need a full memory barrier here (memory barrier is implied in taking a Kristian> mutex). Otherwise the compiler or CPU could re-order the setting of the Kristian> wakeup_subsequent_commits_running flag with the reads and writes done in the Kristian> list manipulations. This could cause the two threads to corrupt the list as Kristian> they both manipulate it at the same time. This I would like to understand better. (It always nice to know what the cpu/compiler is really doing).. In this case the code is looping over a list and just setting list->wakeup_subsequent_commits_running= false; I don't see how the compiler or CPU could reorder the code to things in different order (as we are not updating anything else than this flag in the loop). This is especially true as setting of cur->wakeup_subsequent_commits_running= true; is done within a mutex and that is the last write to this element of the list. So there is a barrier between we set it and potentially clear it. As the 'clear' may now happen 'any time' (from other threads point of view) I don't see why it needs to be protected. Kristian> But after carefully checking, it seems to me a barrier would be enough, we do Kristian> not need to lock and unlock the mutex. Kristian> Unfortunately, we do not have any good mechanisms for doing memory barriers in Kristian> the MariaDB source currently, AFAIK. I will put a comment here that the Kristian> lock/unlock is only needed for the memory barrier, and could be changed Kristian> later. Then we can re-visit it when the more critical things have been fixed Kristian> (and if we get memory barrier primitives in the source tree). What do you Kristian> think of this suggestion? That is fine with me. Kristian> [I was wondering why I did it like this in the first place. Kristian> Due to chaining between the waiter's mutex and the waitees mutex, the extra Kristian> lock/unlock does prevent that wakeup_subsequent_commits2() can change the flag Kristian> in the middle of unregister_wait_for_prior_commit2() doing its Kristian> operation. Maybe that is why I did it this way. But I could not actually think Kristian> of any problems from such a change in the middle. Kristian> Or maybe I did it this way to emphasise that the code conceptually is locking Kristian> both the waitee and the waiter - but to prevent deadlocks, it temporarily Kristian> releases the waitee lock in the middle. So instead of Kristian> LOCK a; LOCK b; UNLOCK b; UNLOCK a; Kristian> we inject a temporary unlock of a: Kristian> LOCK a; [UNLOCK a]; LOCK b; UNLOCK b; [LOCK a]; UNLOCK a; Kristian> But it does not really make things any clearer.]
The only place where we test this variable, as far as I can find, is in wait_for_commit::register_wait_for_prior_commit()
Kristian> The critical place is in wait_for_commit::unregister_wait_for_prior_commit2()
This thread doesn't know if the other thread is just before mysql_mutex_lock(&waitee->LOCK_wait_commit), inside the lock or after the lock. So when we reset the variable should not matter.
The same is true for unregister_wait_for_prior_commit2. Kristian> What we need to ensure is that Kristian> - All the reads of the list in thread 1 see only writes from other threads Kristian> that happened _before_ the variable was reset. Kristian> - All changes to the list in thread 2 happen only _after_ the variable was Kristian> reset in thread 1. Kristian> So a memory barrier is needed, but as you say, lock/unlock should not be Kristian> needed. I still don't understand the current code fully. The issue is that unregister_wait_for_prior_commit2() can be called by thread 2 just before or just after thread 1 is resetting wakeup_subsequent_commits_running. There is not other variables involved. This means that in unregister_wait_for_prior_commit2() if thread 2 is there just before list->wakeup_subsequent_commits_running is set to false, we will go into this code: if (loc_waitee->wakeup_subsequent_commits_running) { /* When a wakeup is running, we cannot safely remove ourselves from the list without corrupting it. Instead we can just wait, as wakeup is already in progress and will thus be immediate. See comments on wakeup_subsequent_commits2() for more details. */ mysql_mutex_unlock(&loc_waitee->LOCK_wait_commit); while (waiting_for_commit) mysql_cond_wait(&COND_wait_commit, &LOCK_wait_commit); } I don't however see any code in :queue_for_group_commit() that will signal COND_wait_commit. What code do we have that will wake up the above code ? Looking at wait_for_commit::wakeup() that makes this: mysql_mutex_lock(&LOCK_wait_commit); waiting_for_commit= false; mysql_cond_signal(&COND_wait_commit); mysql_mutex_unlock(&LOCK_wait_commit); However, in wait_for_commit::register_wait_for_prior_commit() We set wait_for_commit() without a mutex protection ? Is this right? I assume that the code works because things are called in a specific order and some threads can't call other functions until at some certain point when things are safe. Like that wakeup() for a thread can never be called when that thread is just before the first mutex in wait_for_commit::register_wait_for_prior_commit() as in this case 'waiting_for_cond' variable may be set to false (in wakeup()) even if it should be true. Regards, Monty
Michael Widenius <monty@askmonty.org> writes:
Kristian> We _do_ need a full memory barrier here (memory barrier is implied in taking a Kristian> mutex). Otherwise the compiler or CPU could re-order the setting of the Kristian> wakeup_subsequent_commits_running flag with the reads and writes done in the Kristian> list manipulations. This could cause the two threads to corrupt the list as Kristian> they both manipulate it at the same time.
This I would like to understand better. (It always nice to know what the cpu/compiler is really doing)..
In this case the code is looping over a list and just setting list->wakeup_subsequent_commits_running= false;
I don't see how the compiler or CPU could reorder the code to things in different order (as we are not updating anything else than this flag in the loop).
This is especially true as setting of cur->wakeup_subsequent_commits_running= true; is done within a mutex and that is the last write to this element of the list.
So there is a barrier between we set it and potentially clear it. As the 'clear' may now happen 'any time' (from other threads point of view) I don't see why it needs to be protected.
Right. I looked into this again. So we do need a memory barrier, I will try to explain this more below. However, it turns out we already have this memory barrier. Because in all relevant cases we call wait_for_commit::wakeup() before we set the wakeup_subsequent_commits_running flag back to false. And wakeup() takes a mutex, which implies a full memory barrier. So I now removed the extra redundant locking (after adding comments explaining why this is ok), and pushed that to 10.0-knielsen. Let me see if I can explain why the memory barrier is needed. The potential race (or one of them at least) is as follows: Thread A is doing wakeup_subsequent_commits2(). Thread B is doing unregister_wait_for_prior_commit2(). We assume B is on A's list of waiters. A will do a read of B's next pointer, and a write of B's waiting_for_commit flag. After that A will clear the wakeup_subsequent_commits_running flag. B will read the wakeup_subsequent_commits_running flag, and if it is not set, then it will remove itself from the list and later possibly put itself on the list of another thread or whatever. Without a memory barrier, the clearing of the flag might become visible to B early. B could then remove itself from the list, and perhaps add itself to the list of another thread. And then the write from A to B->waiting_for_commit could become visible (which would cause a wrong spurious wakeup of B), or B's writing of new next pointer could become visible to A's list traversal, causing A to walk into the wrong list. (The latter problem sounds rather unlikely, but at least theoretically it is allowed behaviour by compiler/CPU). But due to the wakeup() call in A we do have a memory barrier between the list manipulations and the clearing of the flag. And we have another memory barrier in B (a full mutex lock actually) between checking the flag and manipulating the list. This prevents the race.
Kristian> What we need to ensure is that
Kristian> - All the reads of the list in thread 1 see only writes from other threads Kristian> that happened _before_ the variable was reset.
Kristian> - All changes to the list in thread 2 happen only _after_ the variable was Kristian> reset in thread 1.
Kristian> So a memory barrier is needed, but as you say, lock/unlock should not be Kristian> needed.
I still don't understand the current code fully.
The issue is that unregister_wait_for_prior_commit2() can be called by thread 2 just before or just after thread 1 is resetting wakeup_subsequent_commits_running. There is not other variables involved.
This means that in unregister_wait_for_prior_commit2() if thread 2 is there just before list->wakeup_subsequent_commits_running is set to false, we will go into this code:
if (loc_waitee->wakeup_subsequent_commits_running) { /* When a wakeup is running, we cannot safely remove ourselves from the list without corrupting it. Instead we can just wait, as wakeup is already in progress and will thus be immediate.
See comments on wakeup_subsequent_commits2() for more details. */ mysql_mutex_unlock(&loc_waitee->LOCK_wait_commit); while (waiting_for_commit) mysql_cond_wait(&COND_wait_commit, &LOCK_wait_commit); }
I don't however see any code in :queue_for_group_commit() that will signal COND_wait_commit. What code do we have that will wake up the above code ?
In queue_for_group_commit() we call waiter->wakeup(), which signals COND_wait_commit.
Looking at wait_for_commit::wakeup() that makes this:
mysql_mutex_lock(&LOCK_wait_commit); waiting_for_commit= false; mysql_cond_signal(&COND_wait_commit); mysql_mutex_unlock(&LOCK_wait_commit);
However, in wait_for_commit::register_wait_for_prior_commit()
We set wait_for_commit() without a mutex protection ? Is this right?
It should be ok. When register_wait_for_prior_commit() is called, the thread should not be registered to wait for anything, so no other thread can access the variable. However, it would be fine to move setting waiting_for_commit= true under the mutex protection, if that would make the code easier to understand.
I assume that the code works because things are called in a specific order and some threads can't call other functions until at some certain point when things are safe.
Yes, waiting_for_commit is initialised to true, and only after do we lock the waitee and put ourself in the list. Only once we are in the wait list of another thread can the variable be accessed from that thread.
Like that wakeup() for a thread can never be called when that thread is just before the first mutex in
wait_for_commit::register_wait_for_prior_commit()
Yes. Because thread B can not be woken up by thread A until B has put itself into A's waiter list, which is done under the first mutex in register_wait_for_prior_commit(). Hope this helps, - Kristian.
Hi!
"Kristian" == Kristian Nielsen <knielsen@knielsen-hq.org> writes:
<cut>
So there is a barrier between we set it and potentially clear it. As the 'clear' may now happen 'any time' (from other threads point of view) I don't see why it needs to be protected.
Kristian> Right. Kristian> I looked into this again. So we do need a memory barrier, I will try to Kristian> explain this more below. However, it turns out we already have this memory Kristian> barrier. Because in all relevant cases we call wait_for_commit::wakeup() Kristian> before we set the wakeup_subsequent_commits_running flag back to false. And Kristian> wakeup() takes a mutex, which implies a full memory barrier. Kristian> So I now removed the extra redundant locking (after adding comments explaining Kristian> why this is ok), and pushed that to 10.0-knielsen. Kristian> Let me see if I can explain why the memory barrier is needed. The potential Kristian> race (or one of them at least) is as follows: Kristian> Thread A is doing wakeup_subsequent_commits2(). Thread B is doing Kristian> unregister_wait_for_prior_commit2(). We assume B is on A's list of waiters. Thanks for spending time explaning this! This one is clear. What was not clear how the code that only goes and resets wakeup_subsequent_commits_running for all threads could cause a problem. However, things are good enough so we can leave the code according to your latest version. <cut> Kristian> Hope this helps, yes, it helped a lot. Thanks! Regards, Monty
Hi Monty, So as promised, I took a look at the existing code for STOP SLAVE, and came up with some ideas for how to extend this to handle parallel replication. In existing code, STOP SLAVE ends up in terminate_slave_threads(). The interesting part here is the SQL thread; stopping the IO thread should not really be affected by the parallel replication feature. What happens is basically this: mi->rli.abort_slave=1; terminate_slave_thread(mi->rli.sql_thd, &mi->rli.run_lock, &mi->rli.stop_cond, &mi->rli.slave_running) So the rli->abort_slave is the flag by which main server can tell the SQL thread to stop. What terminate_slave_thread() does is to repeatedly execute the following every 2 seconds until rli->slave_running becomes false: pthread_kill(thd->real_id, SIGALRM); // Or SIGUSR1 thd->awake(NOT_KILLED); Since it uses NOT_KILLED, I assume this means that any currrently executing event/query is never terminated by STOP SLAVE. It seems to me the only thing this can wake up is if the SQL thread is waiting for more events to arrive in the relay log, but maybe there are other things I did not think of.
From the SQL thread's side, the rli->abort_slave flag is checked in sql_slave_killed(). This function is checked in a few places, basically when waiting for a new event in the relay log and before executing a new event. So normally, once STOP SLAVE sets rli->abort_slave, the SQL thread will complete execution of the current event, if any, and then stop.
However, if the SQL thread is in the middle of executing an event group that modifies non-transactional tables, then there are changes that cannot be rolled back, so it is not safe to just exit in the middle of the event group. In this case, more events are executed, until either the event group is completed, or a fixed timeout of 60 seconds has elapsed. -- It seems fairly straight-forward to extend this to work in the parallel case: - terminate_slave_thread() should be extended so that it also signals any active worker threads for that master connection. So we should introduce rgi->abort_slave, and set it during STOP SLAVE for all queued rgi entries. - Then all the worker threads should check for rgi->abort_slave in appropriate places using a similar function to sql_slave_killed(). If the worker thread is in the middle of executing an event group with non-transactional updates, then it should try to finish that group with a timeout, else it should stop. Then when a worker thread ends the current event group, it should roll back any active transaction, unregister itself if it had a wait for a previous commit, and wakeup any other transactions that might be waiting for it. So overall, STOP SLAVE will set an abort_slave flag both for the main SQL thread and for any active worker threads. These threads will then stop once they have finished executing the current event (or possibly event group). And the rpl_parallel::wait_for_done() function can be used as it currently is to make sure that all the workers have time to complete or abort before the main SQL thread exits. Seems simple enough. I suggest that we do this after we have implemented error handling (the case where a query fails in some worker and the slave has to abort). The normal stop case would probably integrate naturally into the same mechanisms for propagating errors between worker threads and the main SQL thread. - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
So we should introduce rgi->abort_slave, and set it during STOP SLAVE for all queued rgi entries.
Actually, we do not need to introduce another flag. We can just check rgi->rli->abort_slave. - Kristian.
participants (2)
-
Kristian Nielsen
-
Michael Widenius