Hi, Kristian! Please, find the review below! I had only few minor comments. On Sep 14, Kristian Nielsen wrote:
At http://bazaar.launchpad.net/~maria-captains/maria/10.0
------------------------------------------------------------ revno: 3435 revision-id: knielsen@knielsen-hq.org-20120914124453-zsap6hjclq3vrb6n parent: knielsen@knielsen-hq.org-20120913123129-kaujy4cw0jc9o08k committer: knielsen@knielsen-hq.org branch nick: work-10.0-mdev225-181-232 timestamp: Fri 2012-09-14 14:44:53 +0200 message: MDEV-532: Async InnoDB commit checkpoint.
Make the commit checkpoint inside InnoDB be asynchroneous. Implement a background thread in binlog to do the writing and flushing of binlog checkpoint events to disk.
=== modified file 'sql/log.cc' --- a/sql/log.cc 2012-09-13 12:31:29 +0000 +++ b/sql/log.cc 2012-09-14 12:44:53 +0000 @@ -106,6 +107,14 @@ static SHOW_VAR binlog_status_vars_detai {NullS, NullS, SHOW_LONG} };
+/* Variables for the binlog background thread. */
Please add to the comment: ", protected by the LOCK_binlog_thread mutex"
+static bool binlog_thread_started= false; +static bool binlog_background_thread_stop= false;
I'd prefer if both variables would start from "binlog_thread" or both from "binlog_background_thread". Either way, but the same for both variables. Also, the mutex is LOCK_binlog_thread. The function is start_binlog_background_thread(). The condition is COND_binlog_thread. The queue is binlog_background_thread_queue. Could you somehow decide whether the official name for it is a "binlog background thread" or simply a "binlog thread" and use it consistently everywhere? It's also important for the configuration or status variables, if any, for the warnings, status or error messages, for the documentation pages, etc.
+static MYSQL_BIN_LOG::xid_count_per_binlog * + binlog_background_thread_queue= NULL; + +static bool start_binlog_background_thread(); +
/** purge logs, master and slave sides both, related error code @@ -8116,9 +8145,128 @@ int TC_LOG_BINLOG::unlog(ulong cookie, m void TC_LOG_BINLOG::commit_checkpoint_notify(void *cookie) { - mark_xid_done(((xid_count_per_binlog *)cookie)->binlog_id, true); + xid_count_per_binlog *entry= static_cast<xid_count_per_binlog *>(cookie); + mysql_mutex_lock(&LOCK_binlog_thread); + entry->next_in_queue= binlog_background_thread_queue; + binlog_background_thread_queue= entry; + mysql_cond_signal(&COND_binlog_thread); + mysql_mutex_unlock(&LOCK_binlog_thread); }
+/* + Binlog service thread. + + This thread is used to log binlog checkpoints in the background, rather than + in the context of random storage engine threads that happen to call + commit_checkpoint_notify_ha() and may not like the delays while syncing + binlog to disk or may not be setup with all my_thread_init() and other + necessary stuff. + + In the future, this thread could also be used to do log rotation in the + background, which could elimiate all stalls around binlog rotations. +*/ +pthread_handler_t +binlog_background_thread(void *arg __attribute__((unused))) +{ + bool stop; + MYSQL_BIN_LOG::xid_count_per_binlog *queue, *next; + THD *thd; + + my_thread_init(); + thd= new THD; + thd->system_thread= SYSTEM_THREAD_BINLOG_BACKGROUND; + my_pthread_setspecific_ptr(THR_THD, thd);
^^^ the normal way of setting up a THD, would be to call thd->store_globals(). Why did that not work for you and you had to resort to the manual my_pthread_setspecific_ptr() ?
+ mysql_mutex_lock(&LOCK_thread_count); + thd->thread_id= thread_id++; + mysql_mutex_unlock(&LOCK_thread_count); + + for (;;) + { + /* + Wait until there is something in the queue to process, or we are asked + to shut down. + */ + thd_proc_info(thd, "Waiting for background binlog tasks");
In 10.0 you need to use THD_STAGE_INFO() (here and below - everywhere instead of thd_proc_info)
+ mysql_mutex_lock(&mysql_bin_log.LOCK_binlog_thread); + for (;;) + { + stop= binlog_background_thread_stop; + queue= binlog_background_thread_queue; + if (stop || queue) + break; + mysql_cond_wait(&mysql_bin_log.COND_binlog_thread, + &mysql_bin_log.LOCK_binlog_thread); + } + /* Grab the queue, if any. */ + binlog_background_thread_queue= NULL; + mysql_mutex_unlock(&mysql_bin_log.LOCK_binlog_thread); ... === modified file 'storage/innobase/handler/ha_innodb.cc' --- a/storage/innobase/handler/ha_innodb.cc 2012-09-13 12:31:29 +0000 +++ b/storage/innobase/handler/ha_innodb.cc 2012-09-14 12:44:53 +0000 @@ -3008,6 +3015,16 @@ innobase_rollback_trx( DBUG_RETURN(convert_error_code_to_mysql(error, 0, NULL)); }
+ +struct pending_checkpoint { + struct pending_checkpoint *next; + handlerton *hton; + void *cookie; + ib_uint64_t lsn; +}; +static struct pending_checkpoint *pending_checkpoint_list; +static struct pending_checkpoint *pending_checkpoint_list_end; + /*****************************************************************//** Handle a commit checkpoint request from server layer. We simply flush the redo log immediately and do the notify call.*/
The comment seems to be wrong now, doesn't it?
@@ -3017,8 +3034,113 @@ innobase_checkpoint_request( handlerton *hton, void *cookie) { - log_buffer_flush_to_disk(); - commit_checkpoint_notify_ha(hton, cookie); + ib_uint64_t lsn; + ib_uint64_t flush_lsn; + struct pending_checkpoint * entry; + + /* Do the allocation outside of lock to reduce contention. The normal + case is that not everything is flushed, so we will need to enqueue. */ + entry = static_cast<struct pending_checkpoint *> + (my_malloc(sizeof(*entry), MYF(MY_WME))); + if (!entry) { + sql_print_error("Failed to allocate %u bytes." + " Commit checkpoint will be skipped.", + static_cast<unsigned>(sizeof(*entry))); + return; + } + + entry->next = NULL; + entry->hton = hton;
why do you need to remember the handlerton here? It's always innodb_hton_ptr, isn't it? Why do you even pass hton down as an argument, to be able to reuse one checkpoint request function for many engines?
+ entry->cookie = cookie; + + mysql_mutex_lock(&pending_checkpoint_mutex); + lsn = log_get_lsn(); + flush_lsn = log_get_flush_lsn(); + if (lsn > flush_lsn) { + /* Put the request in queue. + When the log gets flushed past the lsn, we will remove the + entry from the queue and notify the upper layer. */ + entry->lsn = lsn; + if (pending_checkpoint_list_end) { + pending_checkpoint_list_end->next = entry;
please add an assert here that pending_checkpoint_list_end->lsn < entry->lsn
+ } else { + pending_checkpoint_list = entry; + } + pending_checkpoint_list_end = entry; + entry = NULL; + } + mysql_mutex_unlock(&pending_checkpoint_mutex); + + if (entry) { + /* We are already flushed. Notify the checkpoint immediately. */ + commit_checkpoint_notify_ha(entry->hton, entry->cookie); + my_free(entry);
I'm a bit worried, whether you're going to do a lot of small allocations. Perhaps it'd be better to use an allocator of a fixed-size objects, with a pool. To avoid going to the heap too often. The same applies to the new code log.cc. Lock-free allocator can do that. May be we have other fixed-size allocators with a pool too.
+ } +}
Regards, Sergei