----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Efficient group commit for binary log CREATION DATE..: Mon, 26 Apr 2010, 13:28 SUPERVISOR.....: Knielsen IMPLEMENTOR....: COPIES TO......: Serg CATEGORY.......: Server-RawIdeaBin TASK ID........: 116 (http://askmonty.org/worklog/?tid=116) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 49 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=- High Level Description modified. --- /tmp/wklog.116.old.14234 2010-05-25 13:18:07.000000000 +0000 +++ /tmp/wklog.116.new.14234 2010-05-25 13:18:07.000000000 +0000 @@ -21,3 +21,69 @@ http://kristiannielsen.livejournal.com/12408.html http://kristiannielsen.livejournal.com/12553.html +---- + +Implementing group commit in MySQL faces some challenges from the handler +plugin architecture: + +1. Because storage engine handlers have separate transaction log from the +mysql binlog (and from each other), there are multiple fsync() calls per +commit that need the group commit optimisation (2 per participating storage +engine + 1 for binlog). + +2. The code handling commit is split in several places, in main server code +and in storage engine code. With pluggable binlog it will be split even +more. This requires a good abstract yet powerful API to be able to implement +group commit simply and efficiently in plugins without the different parts +having to rely on iternals of the others. + +3. We want the order of commits to be the same in all engines participating in +multiple transactions. This requirement is the reason that InnoDB currently +breaks group commit with the infamous prepare_commit_mutex. + +While currently there is no server guarantee to get same commit order in +engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are +several reasons why this could be desirable: + + - InnoDB hot backup needs to be able to extract a binlog position that is + consistent with the hot backup to be able to provision a new slave, and + this is impossible without imposing at least partial consistent ordering + between InnoDB and binlog. + + - Other backup methods could have similar needs, eg. XtraBackup or + `mysqldump --single-transaction`, to have consistent commit order between + binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK + or similar expensive blocking operation. (other backup methods, like LVM + snapshot, don't need consistent commit order, as they can restore + out-of-order commits during crash recovery using XA). + + - If we have consistent commit order, we can think about optimising commit to + need only one fsync (for binlog); lost commits in storage engines can then + be recovered from the binlog at crash recovery by re-playing against the + engine from a particular point in the binlog. + + - With consistent commit order, we can get better semantics for START + TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we + could even get it to return also a matching binlog position). Currently, + this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage + engines. + + - In InnoDB, the performance in the presense of hotspots can be improved if + we can release row locks early in the commit phase, but this requires that we +release them in + the same order as commits in the binlog to ensure consistency between + master and slaves. + + - There was some discussions around Galera [1] synchroneous replication and + global transaction ID that it needed consistent commit order among + participating engines. + + - I believe there could be other applications for guaranteed consistent + commit order, and that the architecture described in this worklog can + implement such guarantee with reasonable overhead. + + +References: + +[1] Galera: http://www.codership.com/products/galera_replication + -=-=(Knielsen - Tue, 25 May 2010, 08:28)=-=- More thoughts on and changes to the archtecture. Got to something now that I am satisfied with and that seems to be able to handle all issues. Implement new prepare_ordered and commit_ordered handler methods and the logic in ha_commit_trans(). Implement TC_LOG::group_log_xid() method and logic in ha_commit_trans(). Implement XtraDB part, using commit_ordered() rather than prepare_commit_mutex. Fix test suite failures. Proof-of-concept patch series complete now. Do initial benchmark, getting good results. With 64 threads, see 26x improvement in queries-per-sec. Next step: write up the architecture description. Worked 21 hours and estimate 0 hours remain (original estimate increased by 21 hours). -=-=(Knielsen - Wed, 12 May 2010, 06:41)=-=- Started work on a Quilt patch series, refactoring the binlog code to prepare for implementing the group commit, and working on the design of group commit in parallel. Found and fixed several problems in error handling when writing to binlog. Removed redundant table map version locking. Split binlog writing into two parts in preparations for group commit. When ready to write to the binlog, threads enter a queue, and the first thread in the queue handles the binlog writing for everyone. When it obtains the LOCK_log, it first loops over all threads, executing the first part of binlog writing (the write(2) syscall essentially). It then runs the second part (fsync(2) essentially) only once, and then wakes up the remaining threads in the queue. Still to be done: Finish the proof-of-concept group commit patch, by 1) implementing the prepare_fast() and commit_fast() callbacks in handler.cc 2) move the binlog thread enqueue from log_xid() to binlog_prepare_fast(), 3) move fast part of InnoDB commit to innobase_commit_fast(), removing the prepare_commit_mutex(). Write up the final design in this worklog. Evaluate the design to see if we can do better/different. Think about possible next steps, such as releasing innodb row locks early (in innobase_prepare_fast), and doing crash recovery by replaying transactions from the binlog (removing the need for engine durability and 2 of 3 fsync() in commit). Worked 28 hours and estimate 0 hours remain (original estimate increased by 28 hours). -=-=(Serg - Mon, 26 Apr 2010, 14:10)=-=- Observers changed: Serg DESCRIPTION: Currently, in order to ensure that the server can recover after a crash to a state in which storage engines and binary log are consistent with each other, it is necessary to use XA with durable commits for both storage engines (innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1). This is _very_ expensive, since the server needs to do three fsync() operations for every commit, as there is no working group commit when the binary log is enabled. The idea is to - Implement/fix group commit to work properly with the binary log enabled. - (Optionally) avoid the need to fsync() in the engine, and instead rely on replaying any lost transactions from the binary log against the engine during crash recovery. For background see these articles: http://kristiannielsen.livejournal.com/12254.html http://kristiannielsen.livejournal.com/12408.html http://kristiannielsen.livejournal.com/12553.html ---- Implementing group commit in MySQL faces some challenges from the handler plugin architecture: 1. Because storage engine handlers have separate transaction log from the mysql binlog (and from each other), there are multiple fsync() calls per commit that need the group commit optimisation (2 per participating storage engine + 1 for binlog). 2. The code handling commit is split in several places, in main server code and in storage engine code. With pluggable binlog it will be split even more. This requires a good abstract yet powerful API to be able to implement group commit simply and efficiently in plugins without the different parts having to rely on iternals of the others. 3. We want the order of commits to be the same in all engines participating in multiple transactions. This requirement is the reason that InnoDB currently breaks group commit with the infamous prepare_commit_mutex. While currently there is no server guarantee to get same commit order in engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are several reasons why this could be desirable: - InnoDB hot backup needs to be able to extract a binlog position that is consistent with the hot backup to be able to provision a new slave, and this is impossible without imposing at least partial consistent ordering between InnoDB and binlog. - Other backup methods could have similar needs, eg. XtraBackup or `mysqldump --single-transaction`, to have consistent commit order between binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK or similar expensive blocking operation. (other backup methods, like LVM snapshot, don't need consistent commit order, as they can restore out-of-order commits during crash recovery using XA). - If we have consistent commit order, we can think about optimising commit to need only one fsync (for binlog); lost commits in storage engines can then be recovered from the binlog at crash recovery by re-playing against the engine from a particular point in the binlog. - With consistent commit order, we can get better semantics for START TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we could even get it to return also a matching binlog position). Currently, this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage engines. - In InnoDB, the performance in the presense of hotspots can be improved if we can release row locks early in the commit phase, but this requires that we release them in the same order as commits in the binlog to ensure consistency between master and slaves. - There was some discussions around Galera [1] synchroneous replication and global transaction ID that it needed consistent commit order among participating engines. - I believe there could be other applications for guaranteed consistent commit order, and that the architecture described in this worklog can implement such guarantee with reasonable overhead. References: [1] Galera: http://www.codership.com/products/galera_replication ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)