Hi Mark, On Dec 26, 2009, at 3:40 AM, MARK CALLAGHAN wrote:
InnoDB fixed group commit in the InnoDB plugin. This performs as expected when the binlog is disabled. This does not perform as I expect when the binlog is enabled.
Is this a problem for PBXT?
PBXT is also affected by the lack of group commit on the binlog. As Sergei mentioned, most other problems comes from the need to support statement based replication, which is not supported by PBXT.
The problems for InnoDB are: 1) commit is serialized on the binlog write/fsync 2) row locks are not released until the commit step of XA prepare/ commit 3) per-table auto inc locks not released until the commit step of XA
I think that 2) and 3) can be fixed without significant changes. They cause a lot of convoys today for high-throughput OLTP -- too many connections needlessly wait on row locks and the per-table auto-inc lock. Doing the binlog fsync one connection at a time also causes a lot of convoys. This makes MySQL much slower than it should be for some workloads even with battery backed RAID write caches.
Problem 1) occurs because: * there is no group commit for the binlog fsync
Yes, and this will remain so, as long as the transactions are not interleaved in the binlog. With RBR this should be possible.
* InnoDB locks prepare_commit_mutex in the prepare step
What is the purpose of this lock?
Even if there were group commit for the binlog fsync, it would be useless for InnoDB because prepare_commit_mutex is locked in the prepare step and not unlocked until the commit step and the binlog write/fsync is done between these two steps.
There is a MySQL worklog for this (4007) that: * doesn't intend to add group commit for the binlog fsync * doesn't mention the problem of prepare_commit_mutex
I have started to work on this, but don't have any code to share yet.
Pseudo-code for commit with the InnoDB plugin when the binlog is enabled:
ha_commit_trans() * ht->prepare() == innobase_xa_prepare() o trx_prepare_for_mysql(trx) + force to disk the trx log buffer for all changes from this trx + fsync done here, group prepare may amortize that o lock prepare_commit_mutex * tc_log->log_xid(thd, xid) o writes SQL to binlog, XID to binlog, optionally fsync binlog * ha_commit_one_phase() o ht->commit() == innobase_commit() + innobase_commit_low(() # write commit record to trx log buffer, release locks from this trx # for auto-commit statements, the per-table auto-inc lock is released here + unlock prepare_commit_mutex + trx_commit_complete_for_mysql() # force to disk the trx log buffer including commit record for this trx # fsync done here, group commit may amortize that
-- Mark Callaghan mdcallag@gmail.com
-- Paul McCullagh PrimeBase Technologies www.primebase.org www.blobstreaming.org pbxt.blogspot.com