InnoDB fixed group commit in the InnoDB plugin. This performs as expected when the binlog is disabled. This does not perform as I expect when the binlog is enabled. Is this a problem for PBXT? The problems for InnoDB are: 1) commit is serialized on the binlog write/fsync 2) row locks are not released until the commit step of XA prepare/commit 3) per-table auto inc locks not released until the commit step of XA I think that 2) and 3) can be fixed without significant changes. They cause a lot of convoys today for high-throughput OLTP -- too many connections needlessly wait on row locks and the per-table auto-inc lock. Doing the binlog fsync one connection at a time also causes a lot of convoys. This makes MySQL much slower than it should be for some workloads even with battery backed RAID write caches. Problem 1) occurs because: * there is no group commit for the binlog fsync * InnoDB locks prepare_commit_mutex in the prepare step Even if there were group commit for the binlog fsync, it would be useless for InnoDB because prepare_commit_mutex is locked in the prepare step and not unlocked until the commit step and the binlog write/fsync is done between these two steps. There is a MySQL worklog for this (4007) that: * doesn't intend to add group commit for the binlog fsync * doesn't mention the problem of prepare_commit_mutex I have started to work on this, but don't have any code to share yet. Pseudo-code for commit with the InnoDB plugin when the binlog is enabled: ha_commit_trans() * ht->prepare() == innobase_xa_prepare() o trx_prepare_for_mysql(trx) + force to disk the trx log buffer for all changes from this trx + fsync done here, group prepare may amortize that o lock prepare_commit_mutex * tc_log->log_xid(thd, xid) o writes SQL to binlog, XID to binlog, optionally fsync binlog * ha_commit_one_phase() o ht->commit() == innobase_commit() + innobase_commit_low(() # write commit record to trx log buffer, release locks from this trx # for auto-commit statements, the per-table auto-inc lock is released here + unlock prepare_commit_mutex + trx_commit_complete_for_mysql() # force to disk the trx log buffer including commit record for this trx # fsync done here, group commit may amortize that -- Mark Callaghan mdcallag@gmail.com