[Maria-developers] commit performance when the binlog is enabled
InnoDB fixed group commit in the InnoDB plugin. This performs as expected when the binlog is disabled. This does not perform as I expect when the binlog is enabled. Is this a problem for PBXT? The problems for InnoDB are: 1) commit is serialized on the binlog write/fsync 2) row locks are not released until the commit step of XA prepare/commit 3) per-table auto inc locks not released until the commit step of XA I think that 2) and 3) can be fixed without significant changes. They cause a lot of convoys today for high-throughput OLTP -- too many connections needlessly wait on row locks and the per-table auto-inc lock. Doing the binlog fsync one connection at a time also causes a lot of convoys. This makes MySQL much slower than it should be for some workloads even with battery backed RAID write caches. Problem 1) occurs because: * there is no group commit for the binlog fsync * InnoDB locks prepare_commit_mutex in the prepare step Even if there were group commit for the binlog fsync, it would be useless for InnoDB because prepare_commit_mutex is locked in the prepare step and not unlocked until the commit step and the binlog write/fsync is done between these two steps. There is a MySQL worklog for this (4007) that: * doesn't intend to add group commit for the binlog fsync * doesn't mention the problem of prepare_commit_mutex I have started to work on this, but don't have any code to share yet. Pseudo-code for commit with the InnoDB plugin when the binlog is enabled: ha_commit_trans() * ht->prepare() == innobase_xa_prepare() o trx_prepare_for_mysql(trx) + force to disk the trx log buffer for all changes from this trx + fsync done here, group prepare may amortize that o lock prepare_commit_mutex * tc_log->log_xid(thd, xid) o writes SQL to binlog, XID to binlog, optionally fsync binlog * ha_commit_one_phase() o ht->commit() == innobase_commit() + innobase_commit_low(() # write commit record to trx log buffer, release locks from this trx # for auto-commit statements, the per-table auto-inc lock is released here + unlock prepare_commit_mutex + trx_commit_complete_for_mysql() # force to disk the trx log buffer including commit record for this trx # fsync done here, group commit may amortize that -- Mark Callaghan mdcallag@gmail.com
Hi, MARK! On Dec 25, MARK CALLAGHAN wrote:
InnoDB fixed group commit in the InnoDB plugin. This performs as expected when the binlog is disabled. This does not perform as I expect when the binlog is enabled.
The problems for InnoDB are: 1) commit is serialized on the binlog write/fsync 2) row locks are not released until the commit step of XA prepare/commit 3) per-table auto inc locks not released until the commit step of XA
I think that 2) and 3) can be fixed without significant changes.
It's not that easy, I think. What InnoDB needs locks for ? Not for protecting uncommitted changes - it uses versioning for it. For serializability (when innodb_locks_unsafe_for_binlog=true or on SERIALIZABLE level) and for explicit SELECT ... IN SHARE MORE or FOR UPDATE. Explicit locks are typically used when one reads the data and later modifies them in the same transaction based on the read values, right ? After xa_prepare no data can be modified anymore, it's safe to release these explicit locks. If InnoDB locks would be protecting uncommitted data from beeing seen by another transaction, they would have to stay until commit - but InnoDB doesn't use locks for this. Safe too. But locks that help to maintain serializability still have to be released on commit, I'm afraid. Otherwise you'll have trn1> start transaction; insert t1 select * from t2; trn1> commit; trn1>> ... xa_prepare() ... trn2> start transaction; insert t2 values (1); commit; trn2>> xa_prepare(); binlog.write(); xa_commit(); trn1> ... binlog.write(); xa_commit(); and you have incorrect transaction order in binlog. To summarize - you can release InnoDB locks on prepare only if innodb_locks_unsafe_for_binlog=false or RBR, and not SERIALIZABLE. Which could be the only case you care about anyway :) Regards / Mit vielen Grüßen, Sergei -- __ ___ ___ ____ __ / |/ /_ __/ __/ __ \/ / Sergei Golubchik <serg@sun.com> / /|_/ / // /\ \/ /_/ / /__ Principal Software Engineer/Server Architect /_/ /_/\_, /___/\___\_\___/ Sun Microsystems GmbH, HRB München 161028 <___/ Sonnenallee 1, 85551 Kirchheim-Heimstetten Geschäftsführer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin Häring
On Mon, Dec 28, 2009 at 9:20 AM, Sergei Golubchik <sergii@pisem.net> wrote:
Hi, MARK!
On Dec 25, MARK CALLAGHAN wrote:
InnoDB fixed group commit in the InnoDB plugin. This performs as expected when the binlog is disabled. This does not perform as I expect when the binlog is enabled.
The problems for InnoDB are: 1) commit is serialized on the binlog write/fsync 2) row locks are not released until the commit step of XA prepare/commit 3) per-table auto inc locks not released until the commit step of XA
I think that 2) and 3) can be fixed without significant changes.
It's not that easy, I think.
What InnoDB needs locks for ? Not for protecting uncommitted changes - it uses versioning for it. For serializability (when innodb_locks_unsafe_for_binlog=true or on SERIALIZABLE level) and for explicit SELECT ... IN SHARE MORE or FOR UPDATE. Explicit locks are typically used when one reads the data and later modifies them in the same transaction based on the read values, right ?
After xa_prepare no data can be modified anymore, it's safe to release these explicit locks.
If InnoDB locks would be protecting uncommitted data from beeing seen by another transaction, they would have to stay until commit - but InnoDB doesn't use locks for this. Safe too.
But locks that help to maintain serializability still have to be released on commit, I'm afraid. Otherwise you'll have
trn1> start transaction; insert t1 select * from t2; trn1> commit; trn1>> ... xa_prepare() ...
trn2> start transaction; insert t2 values (1); commit; trn2>> xa_prepare(); binlog.write(); xa_commit();
trn1> ... binlog.write(); xa_commit();
and you have incorrect transaction order in binlog.
There are several issues here: * for SBR, tm1 cannot release row locks until it is guaranteed that it writes the binlog ahead of any dependent transactions. This is guaranteed by locking prepare_commit_mutex at the end of innobase_xa_prepare and not unlocking until row locks are released during the call to innobase_commit. * at least for the plugin the order in which InnoDB prepare is done might not match the order in which transactions are written to the binlog. InnoDB locks prepare_commit_mutex in innobase_xa_prepare after doing a prepare (the call to trx_prepare_for_mysql). It is unlocked after the commit record is written to the InnoDB transaction buffer and before that buffer is flushed to disk. What does match today is the order of transactions in the binlog and the commit records in the InnoDB transaction log. * Traditional implementations of group commit require releasing locks earlier in the commit cycle. Group commit works by pausing commit processing in the hope that other commits will be done so they can share 1 fsync. It is a bad idea to hold locks during this pause. I don't know whether InnoDB requires: 1) that transactions in the binlog and commit records in the innodb transaction log record things in the same order or 2) all of 1) above and the binlog is at most one trx ahead of the innodb transaction log prepare_commit_mutex provides 2) today and that makes group commit for the binlog unlikely or impossible. I am trying to determine myself whether 2) is required and get an answer from the InnoDB team. If 1) is required instead of 2) then group commit on the binlog is possible for InnoDB. Group commit with SBR is possible as long as the per-transaction lock release order determines the order in which the binlog is written. -- Mark Callaghan mdcallag@gmail.com
Hi, MARK! On Dec 29, MARK CALLAGHAN wrote:
On Mon, Dec 28, 2009 at 9:20 AM, Sergei Golubchik <sergii@pisem.net> wrote:
trn1> start transaction; insert t1 select * from t2; trn1> commit; trn1>> ... xa_prepare() ...
trn2> start transaction; insert t2 values (1); commit; trn2>> xa_prepare(); binlog.write(); xa_commit();
trn1> ... binlog.write(); xa_commit();
and you have incorrect transaction order in binlog.
There are several issues here: * for SBR, tm1 cannot release row locks until it is guaranteed that it writes the binlog ahead of any dependent transactions. This is guaranteed by locking prepare_commit_mutex at the end of innobase_xa_prepare and not unlocking until row locks are released during the call to innobase_commit.
I don't see what prepare_commit_mutex has to do with it. It is guaranteed by row locks released at commit time, no matter whether prepare_commit_mutex exists or not.
* at least for the plugin the order in which InnoDB prepare is done might not match the order in which transactions are written to the binlog. InnoDB locks prepare_commit_mutex in innobase_xa_prepare after doing a prepare (the call to trx_prepare_for_mysql). It is unlocked after the commit record is written to the InnoDB transaction buffer and before that buffer is flushed to disk. What does match today is the order of transactions in the binlog and the commit records in the InnoDB transaction log.
Yes, and this is what prepare_commit_mutex is for. Regards / Mit vielen Grüßen, Sergei -- __ ___ ___ ____ __ / |/ /_ __/ __/ __ \/ / Sergei Golubchik <serg@sun.com> / /|_/ / // /\ \/ /_/ / /__ Principal Software Engineer/Server Architect /_/ /_/\_, /___/\___\_\___/ Sun Microsystems GmbH, HRB München 161028 <___/ Sonnenallee 1, 85551 Kirchheim-Heimstetten Geschäftsführer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin Häring
On Tue, Dec 29, 2009 at 11:07 AM, Sergei Golubchik <sergii@pisem.net> wrote:
Hi, MARK!
On Dec 29, MARK CALLAGHAN wrote:
On Mon, Dec 28, 2009 at 9:20 AM, Sergei Golubchik <sergii@pisem.net> wrote:
trn1> start transaction; insert t1 select * from t2; trn1> commit; trn1>> ... xa_prepare() ...
trn2> start transaction; insert t2 values (1); commit; trn2>> xa_prepare(); binlog.write(); xa_commit();
trn1> ... binlog.write(); xa_commit();
and you have incorrect transaction order in binlog.
There are several issues here: * for SBR, tm1 cannot release row locks until it is guaranteed that it writes the binlog ahead of any dependent transactions. This is guaranteed by locking prepare_commit_mutex at the end of innobase_xa_prepare and not unlocking until row locks are released during the call to innobase_commit.
I don't see what prepare_commit_mutex has to do with it. It is guaranteed by row locks released at commit time, no matter whether prepare_commit_mutex exists or not.
Yes, prepare_commit_mutex isn't the issue here. I want to release row locks during the call to innobase_xa_prepare after trx_prepare_for_mysql() has been called. I expect the mythical group commit for the binlog to potentially pause (make a committing connection sleep) and I don't want the pause to occur when the transaction holds locks that may be blocking other transactions. If group commit for the binlog doesn't introduce a pause there isn't much chance of forming a group of transactions doing a binlog write/fsync concurrently. If the row locks continue to be released during the call to innobase commit (after the binlog write/fsync) as they are today, then convoys will form on the locks held by those transactions. These performance problems are limited to high-throughput workloads, but those are the workloads for which group commit is needed. Synchronization will be needed to gurantee that the the order of XID events in the binlog match the order of commit records in InnoDB despite the changes mentioned above. -- Mark Callaghan mdcallag@gmail.com
Hi Mark, On Dec 26, 2009, at 3:40 AM, MARK CALLAGHAN wrote:
InnoDB fixed group commit in the InnoDB plugin. This performs as expected when the binlog is disabled. This does not perform as I expect when the binlog is enabled.
Is this a problem for PBXT?
PBXT is also affected by the lack of group commit on the binlog. As Sergei mentioned, most other problems comes from the need to support statement based replication, which is not supported by PBXT.
The problems for InnoDB are: 1) commit is serialized on the binlog write/fsync 2) row locks are not released until the commit step of XA prepare/ commit 3) per-table auto inc locks not released until the commit step of XA
I think that 2) and 3) can be fixed without significant changes. They cause a lot of convoys today for high-throughput OLTP -- too many connections needlessly wait on row locks and the per-table auto-inc lock. Doing the binlog fsync one connection at a time also causes a lot of convoys. This makes MySQL much slower than it should be for some workloads even with battery backed RAID write caches.
Problem 1) occurs because: * there is no group commit for the binlog fsync
Yes, and this will remain so, as long as the transactions are not interleaved in the binlog. With RBR this should be possible.
* InnoDB locks prepare_commit_mutex in the prepare step
What is the purpose of this lock?
Even if there were group commit for the binlog fsync, it would be useless for InnoDB because prepare_commit_mutex is locked in the prepare step and not unlocked until the commit step and the binlog write/fsync is done between these two steps.
There is a MySQL worklog for this (4007) that: * doesn't intend to add group commit for the binlog fsync * doesn't mention the problem of prepare_commit_mutex
I have started to work on this, but don't have any code to share yet.
Pseudo-code for commit with the InnoDB plugin when the binlog is enabled:
ha_commit_trans() * ht->prepare() == innobase_xa_prepare() o trx_prepare_for_mysql(trx) + force to disk the trx log buffer for all changes from this trx + fsync done here, group prepare may amortize that o lock prepare_commit_mutex * tc_log->log_xid(thd, xid) o writes SQL to binlog, XID to binlog, optionally fsync binlog * ha_commit_one_phase() o ht->commit() == innobase_commit() + innobase_commit_low(() # write commit record to trx log buffer, release locks from this trx # for auto-commit statements, the per-table auto-inc lock is released here + unlock prepare_commit_mutex + trx_commit_complete_for_mysql() # force to disk the trx log buffer including commit record for this trx # fsync done here, group commit may amortize that
-- Mark Callaghan mdcallag@gmail.com
-- Paul McCullagh PrimeBase Technologies www.primebase.org www.blobstreaming.org pbxt.blogspot.com
Hi, Paul! On Dec 29, Paul McCullagh wrote:
On Dec 26, 2009, at 3:40 AM, MARK CALLAGHAN wrote:
* InnoDB locks prepare_commit_mutex in the prepare step
What is the purpose of this lock?
As far as I understand (disclaimer!), it's purpose is to ensure that commit records in the InnoDB transactional log are written in the same order as Xid events in the binlog. And the only reason for enforcing this order - as far as I understand - is innodb hotbackup. It reads InnoDB logs (as files) and grabs a copy of the binlog. And after recovery all data must be consistent. If binlog contains more transactions that innodb logs, it's no problem - binlog can be truncated. But at no point binlog can have *less* transactions. If prepare_commit_mutex is removed, I can create an ordering of commits where innodb log *always* has committed transactions that are not in a binlog. Regards / Mit vielen Grüßen, Sergei P.S. Disclaimer: besides the last statement everything else is just my speculation about how innodb hot backup works. -- __ ___ ___ ____ __ / |/ /_ __/ __/ __ \/ / Sergei Golubchik <serg@sun.com> / /|_/ / // /\ \/ /_/ / /__ Principal Software Engineer/Server Architect /_/ /_/\_, /___/\___\_\___/ Sun Microsystems GmbH, HRB München 161028 <___/ Sonnenallee 1, 85551 Kirchheim-Heimstetten Geschäftsführer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin Häring
On Tue, Dec 29, 2009 at 11:23 AM, Sergei Golubchik <sergii@pisem.net> wrote:
Hi, Paul!
On Dec 29, Paul McCullagh wrote:
On Dec 26, 2009, at 3:40 AM, MARK CALLAGHAN wrote:
* InnoDB locks prepare_commit_mutex in the prepare step
What is the purpose of this lock?
As far as I understand (disclaimer!), it's purpose is to ensure that commit records in the InnoDB transactional log are written in the same order as Xid events in the binlog.
And the only reason for enforcing this order - as far as I understand - is innodb hotbackup. It reads InnoDB logs (as files) and grabs a copy of the binlog. And after recovery all data must be consistent. If binlog contains more transactions that innodb logs, it's no problem - binlog can be truncated. But at no point binlog can have *less* transactions.
If prepare_commit_mutex is removed, I can create an ordering of commits where innodb log *always* has committed transactions that are not in a binlog.
Is this a potential problem? * order of transactions in binlog don't commit record order for InnoDB in transaction log * binlog rotation occurs * last binlog has XIDs 1, 3, 5 * current binlog has XIDs 2, 4 * server crashes * XID 5 is in state PREPARED (not committed) before the crash If crash recovery uses the latest binlog then it won't know to rollback XID 5 during crash recovery. I thought someone explained to me the constraints on binlog rotation that might be related to this, but I don't remember the details. -- Mark Callaghan mdcallag@gmail.com
Hi, MARK! On Dec 29, MARK CALLAGHAN wrote:
On Tue, Dec 29, 2009 at 11:23 AM, Sergei Golubchik <sergii@pisem.net> wrote:
Is this a potential problem? * order of transactions in binlog don't commit record order for InnoDB in transaction log * binlog rotation occurs * last binlog has XIDs 1, 3, 5 * current binlog has XIDs 2, 4 * server crashes * XID 5 is in state PREPARED (not committed) before the crash
No, it's not a problem. Because on recovery MySQL only reads the *last* binlog, it needs to ensure somehow that last binlog has all the information that recovery needs. Currently a simple solution is used - binlog rotation waits for all prepared transaction to commit. That is, you can be sure that last binlog has only XIDs for committed transactions.
I thought someone explained to me the constraints on binlog rotation that might be related to this, but I don't remember the details.
That could've been me :) Regards / Mit vielen Grüßen, Sergei -- __ ___ ___ ____ __ / |/ /_ __/ __/ __ \/ / Sergei Golubchik <serg@sun.com> / /|_/ / // /\ \/ /_/ / /__ Principal Software Engineer/Server Architect /_/ /_/\_, /___/\___\_\___/ Sun Microsystems GmbH, HRB München 161028 <___/ Sonnenallee 1, 85551 Kirchheim-Heimstetten Geschäftsführer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin Häring
participants (3)
-
MARK CALLAGHAN
-
Paul McCullagh
-
Sergei Golubchik