Hi! This is a review of the latest code pushed to your tree: bzr diff -r 3662.. Note that I will apply some of the trivial changes myself to the tree to speed up things (this is mostly about comments). === added file 'mysql-test/suite/rpl/r/rpl_parallel.result' --- mysql-test/suite/rpl/r/rpl_parallel.result 1970-01-01 00:00:00 +0000 +++ mysql-test/suite/rpl/r/rpl_parallel.result 2013-09-22 20:25:20 +0000 Great that you got these done! Will make my life much easier when I work the parallel replication code! <cut> === modified file 'sql/log.cc' --- sql/log.cc 2013-07-04 22:26:15 +0000 +++ sql/log.cc 2013-09-22 20:25:20 +0000 @@ -6542,27 +6542,93 @@ MYSQL_BIN_LOG::write_transaction_to_binl } } + +/* + Put a transaction that is ready to commit in the group commit queue. + The transaction is identified by the ENTRY object passed into this function. + + To facilitate group commit for the binlog, we first queue up ourselves in + this function. Then later the first thread to enter the queue waits for + the LOCK_log mutex, and commits for everyone in the queue once it gets the + lock. Any other threads in the queue just wait for the first one to finish + the commit and wake them up. This way, all transactions in the queue get + committed in a single disk operation. + + The return value of this function is TRUE if queued as the first entry in + the queue (meaning this is the leader), FALSE otherwise. I moved this last in the comment as: @retval TRUE If queued as the first entry in the queue (meaning this is the leader) @retval FALSE Otherwise + + The main work in this function is when the commit in one transaction has + been marked to wait for the commit of another transaction to happen + first. This is used to support in-order parallel replication, where + transactions can execute out-of-order but need to be committed in-order with + how they happened on the master. The waiting of one commit on another needs + to be integrated with the group commit queue, to ensure that the waiting + transaction can participate in the same group commit as the waited-for + transaction. + + So when we put a transaction in the queue, we check if there were other + transactions already prepared to commit but just waiting for the first one + to commit. If so, we add those to the queue as well, transitively for all + waiters. +*/ + <cut> @@ -6574,10 +6640,28 @@ MYSQL_BIN_LOG::queue_for_group_commit(gr This would be natural to do with recursion, but we want to avoid potentially unbounded recursion blowing the C stack, so we use the list approach instead. + + We keep a list of all the waiters that need to be processed in `list', + linked through the next_subsequent_commit pointer. Initially this list + contains only the entry passed into this function. + + We process entries in the list one by one. The element currently being + processed is pointed to by `cur`, and the element at the end of the list + is pointed to by `last` (we do not use NULL to terminate the list). + + As we process an element, it is first added to the group_commit_queue. + Then any waiters for that element are added at the end of the list, to + be processed in subsequent iterations. This continues until the list + is exhausted, with all elements ever added eventually processed. + + The end result is a breath-first traversal of the tree of waiters, + re-using the next_subsequent_commit pointers in place of extra stack + space in a recursive traversal. I added: The temporary list created in next_subsequent_commit is not used by the caller or any other function. @@ -6670,6 +6763,18 @@ MYSQL_BIN_LOG::queue_for_group_commit(gr { if (list->wakeup_subsequent_commits_running) { + /* + ToDo: We should not need a full lock/unlock of LOCK_wait_commit + here. All we need is a single (full) memory barrier, to ensure that + the reads of the list above are not reordered with the write of + wakeup_subsequent_commits_running, or with the writes to the list + from other threads that is allowed to happen after + wakeup_subsequent_commits_running has been set to false. + + We do not currently have explicit memory barrier primitives in the + source tree, but if we get them the below mysql_mutex_lock() could + be replaced with a full memory barrier just before the loop. + */ ok (for now). Another way would of course to only do a mutex lock for the first entry we find... No more comments. Everything looked fine! I will start coordinate with you tomorrow about things to do. Regards, Monty