[Maria-developers] A problem with implementing Group Commit with Binlog with MyRocks
Hello, This is about https://jira.mariadb.org/browse/MDEV-11934. I've encountered an insteresting issue here, so I thought I would consult on both MyRocks and MariaDB lists. == Some background == "group commit with binlog" feature needs to accomplish two goals: 1. Keep the binlog and the storage engine in sync. This is done by employing XA between the binlog and the storage engine. It works by making these calls: /* Make the transaction's changes to be ready to be committed (no conflicts with other transactions, etc) but do not commit them yet. The effects of the prepare operation must be synced to disk, as the storage engine needs to be able to recover (i.e. commit) the prepared transaction after a crash */ storage_engine->prepare(sync=true); /* After this call, the transaction is considered committed. In case of a crash, the recovery process will use the contents of the binlog to determine which of the prepared transactions are to be committed and which are to be rolled back. */ binlog->write(sync=true); /* Commit the transaction in the storage engine. This makes its changes visible to other transactions (and also releases its locks and so forth) Note that most of the time(*) we don't need to sync there. In case of a crash we will be able to recover using the binlog. */ storage_engine->commit(); 2. The second goal is to make operation performant. We need two coordinated disk flushes per transaction, the idea is to do "Group Commit" where multiple transactions share disk flushes. So, we need to do group commit and keep the storage engine and the binlog in sync while doing that. == Group Commit with Binlog in MySQL == MySQL (and fb/mysql-5.6 in particular) does in the following phases: Phase #1: Call storage_engine->prepare() for all transactions in the group. The call itself is not persistent. Phase #2: Call storage->engine->flush_logs(). This makes the effect of all Prepare operations from Phase#1 persistent. Phase #3: Write and sync the binary log. Phase #4: Call storage_engine->commit(). This does not need to be persistent. MyRocks implements them. == Group Commit with Binlog in MariaDB == MariaDB does not have these phases described above:
Phase #1: Call storage_engine->prepare() for all transactions in the group. The call itself is not persistent.
Phase #2: Call storage->engine->flush_logs(). This makes the effect of all Prepare operations from Phase#1 persistent.
A quote from Kristian's description at https://lists.launchpad.net/maria-developers/msg10832.html
So the idea is to do group prepare with the same group of transactions that will later group commit to the binlog. In MariaDB, this concept does not exist. Storage engine prepares are allowed to run in parallel and in any order compared to binlog commit.
Initially this looked like it could work for MyRocks. MyRocks has a group commit implementation, both Prepare() and Commit() operations participate in groups. However when I implemented a group commit implementation I found its performance to be close to what one would expect if there was no commit grouping, and commit() call flushed to disk https://jira.mariadb.org/browse/MDEV-11934 has the details. == The issue == (I'm 95% certain about this. It's not 100% yet but it is very likely) RocksDB's Group Write (see rocksdb/rocksdb/db/db_impl_write.cc, DBImpl::WriteImpl function) handles both Prepare() and Commit() commands and does the following: 1. Controls writing the commited data into the MemTable 2. Writes transactions to WAL 3. Syncs the WAL. All three steps are done for the whole group. This has a consequence: a Commit() operation that does not need to sync the WAL will still be delayed if another operation in the group needs the WAL to be synced. This delay has a disastrous effect, because SQL layer tries to have the same order of transactions in the storage engine and in the binlog. In order to do that, it calls rocksdb_commit_ordered() for each transaction sequentially. Delaying one transaction causes a delay of the entire SQL-level commit group. == Possible solutions == I am not sure what to do. - Make the SQL layer's Group Commit implementation invoke hton->flush_logs() explicitly, like MySQL does? - Modify RocksDB so that Transaction::Commit(sync=false) do not use Group Write? I am not sure if this is possible: Group Write is not about only performance, it's about preventing concurrent MemTable writes. AFAIU one cannot just tell a certain DBImpl::WriteImpl() call to not participate in write groups and work as if there was no other activity. - Modify RocksDB so that Transaction::Commit(sync=false) does not wait until its write group finishes WAL sync? This could be doable but is potentially complex. BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog
Sergey Petrunia <sergey@mariadb.com> writes:
== Some background ==
"group commit with binlog" feature needs to accomplish two goals:
1. Keep the binlog and the storage engine in sync. storage_engine->prepare(sync=true); binlog->write(sync=true); storage_engine->commit();
2. The second goal is to make operation performant. We need two coordinated disk
== Group Commit with Binlog in MySQL ==
MySQL (and fb/mysql-5.6 in particular) does in the following phases:
Phase #1: Call storage_engine->prepare() for all transactions in the group. The call itself is not persistent.
Phase #2: Call storage->engine->flush_logs(). This makes the effect of all Prepare operations from Phase#1 persistent.
Phase #3: Write and sync the binary log.
Phase #4: Call storage_engine->commit(). This does not need to be persistent.
Interesting. Phase #2 is a mysql 5.7 feature, it is not in 5.6. Did Facebook backport this to their 5.6 tree? Or did MySQL 5.7 get this from Facebook work?
MariaDB does not have these phases described above:
Phase #1: Call storage_engine->prepare() for all transactions in the group. The call itself is not persistent.
Phase #2: Call storage->engine->flush_logs(). This makes the effect of all Prepare operations from Phase#1 persistent.
Right, it combines them in a single "phase". storage_engine->prepare() is expected to be persistent and do its own group prepare. While MySQL 5.7 builds a list of transactions to group prepare.
RocksDB's Group Write (see rocksdb/rocksdb/db/db_impl_write.cc, DBImpl::WriteImpl function) handles both Prepare() and Commit() commands and does the following:
1. Controls writing the commited data into the MemTable 2. Writes transactions to WAL 3. Syncs the WAL.
Can you explain, at a high level, how RocksDB transaction visibility, lock release, and persistency works? Is it like - once a transaction is written to the MemTable, it is visible to other transactions and its commit order is determined wrt. other transactions? And persistency is guaranteed after write+sync of the WAL? When are locks released in this sequence?
All three steps are done for the whole group. This has a consequence: a Commit() operation that does not need to sync the WAL will still be delayed if another operation in the group needs the WAL to be synced.
So do I understand correctly: DbImpl::WriteImpl() does both group commit (in-memory) to the MemTable, as well as group commit (on disk) to the WAL? And it uses the _same_ grouping of transactions for these two operations? And so, the first commit_ordered() joins some prepare() that wants to sync the WAL. And only once that sync is done can the next commit_ordered() start - and it might easily end up joining a thread that recently completed its delayed commit_ordered() and is now doing prepare for a new transaction. Indeed, this is not acceptable performance-wise for commit_ordered(). It must not wait for disk operations. So why is this not a problem in the MySQL case? MySQL runs the handlerton->commit() calls under LOCK_commit just like MariaDB does commit_ordered(). My guess is that this is because in your tests, WriteImpl() was _always_ called with WAL sync disabled. I wonder what would happen if you were to run a mix of binlogged and non-binlogged (SET sql_log_bin=0) transactions, where the latter would end up in WriteImpl(sync=true); maybe a similar problem would occur. So I think there is something interesting to look at here. If I understand correctly, WriteImpl() tries to reduce contention between threads doing commit (even in-memory commit) by making them group up and having a single thread commit for multiple transactions, rather than jumping from one thread to another for each commit. This is a valid technique, but it seems to fit badly with what MySQL and MariaDB is doing. In MariaDB, commits are _already_ grouped and done from a single thread. In MySQL, _both_ prepare and commits are so grouped from a single thread (though I think one thread can do group prepare in parallel with another doing group commit). So there seems to be an opportunity to simplify WriteImpl() for the MySQL and MariaDB binlog case. If my understanding is correct, there will not be a lot of writer grouping. (Of course for the non-binlog case, the situation is different). Maybe this comment is relevant here? // Requesting sync with concurrent_prepare_ is expected to be very rare. We // hance provide a simple implementation that is not necessarily efficient.
== Possible solutions ==
I am not sure what to do.
- Make the SQL layer's Group Commit implementation invoke hton->flush_logs() explicitly, like MySQL does?
That is an option, though I hope you will not do it like that. What does flush_logs() have to do with making prepare() persistent? You could instead add a new handlerton method group_prepare() or something. If non-NULL, the storage engine may omit persistency in prepare(), but must then in group_prepare() ensure that all prepares are persistent that have completed prepare_ordered(). And if the method is non-null, the SQL layer will call group_prepare() just before binlog write (under a new mutex LOCK_group_prepare that is chained before LOCK_log. This way, the extra lock can be avoided for storage engines that do not need group_prepare(). And storage engines have freedom to implement group_prepare() in a way that suites them. Of course, Rocksdb can just implement group_prepare() as flush_logs() to make _all_ prepares persistent, just like in MySQL. So for rocksdb the functionality is identical to in MySQL, while flexibility is preserved for other storage engines. However, it still seems to me that there is opportunity to do better here. For example, the upper layer could present to Rocksdb the actual list of transactions that need to be group prepared / group committed. Then Rocksdb could do them in a single writer without having to coordinate the threads manually in WriteImpl().
- Modify RocksDB so that Transaction::Commit(sync=false) do not use Group Write? I am not sure if this is possible: Group Write is not about only performance, it's about preventing concurrent MemTable writes. AFAIU one cannot just tell a certain DBImpl::WriteImpl() call to not participate in write groups and work as if there was no other activity.
What about if rocksdb got the list of transactions to commit (to memtable, sync=false) explicitly, rather than as individual commit() or commit_ordered() calls? Then it could commit them all in a single writer, which should be more efficient. And similar for prepare? In the current MySQL (facebook patch) code, isn't it the case that each commit() has to create a new writer and write to memtable a single commit individually? While all of these calls are in fact from a single thread from an explicit, known list. This does not seem optimal.
- Modify RocksDB so that Transaction::Commit(sync=false) does not wait until its write group finishes WAL sync? This could be doable but is potentially complex.
That should probably be done so that a write group would only write to the memtable (and the in-memory WAL buffer?). After that, it would release all non-syncing participants, and the remaining syncing participants could form a new write group to do the sync independently. Of course, if a write group rarely syncs, this is of little benefit. From my limited understanding of the code, flush_logs() which ends up in SyncWAL() does not use a write group. So this already ended up as a huge email, but I thought some background on commit_ordered() could also help here. Note that in MySQL >=5.6, their commit() is very similar to commit_ordered(). commit_ordered exists for three main purposes, if I recall correctly: 1. To synchronise commit order in binlog and storage engine. This ensures that if a physical backup is taken of the storage engine and used to provision a slave, the storage engine state corresponds to a unique point in the binlog (MySQL has this). 2. To make START TRANSACTION WITH CONSISTENT SNAPSHOT actually correctly synchronise snapshots between multiple storage engines (MySQL does not have this, I think). 3. To avoid having to do an extra fsync() for every commit, on top of the one for prepare and the one for binlog write (MySQL has this). I think those are the main reason for commit_ordered() (I might have forgotten some). For this problem, I suppose (3) is the main interest? MySQL handles (3) by stopping all transactions around binlog rotate and doing a flush_logs(). It needs this because after binlog rotation, binlog crash recovery has only an empty binlog, so _all_ transactions must be durably committed at this point. MariaDB avoids this stall around binlog rotate. Instead it extends binlog crash recovery to be able to look into multiple binlog files. So there is no need to force commits to disk around binlog rotate. To eventually be able to drop binlog files, there is the binlog_checkpoint_request() mechanism. This allows the storage engine to inform the upper layer when all the transactions in a binlog have ended up durably committed, in the normal course of action of the storage engine. So to just get (3), RocksDB could just implement no commit_ordered(), or perhaps an empty commit_ordered(). And then also implement binlog_checkpoint_request() to record the latest prepared transaction at that point - and when the WAL is later synced, reply back to the upper layer to allow it to release the old binlog file. This seems doable without support for quickly committing a transaction to memory, which current RocksDB WriteImpl() seems poor at doing simultaneously with persistent prepare(). Though I think it would be good if the full functionality of commit_ordered() was implemented in RocksDB. Passing down into RocksDB explicitly the list of transactions to group-prepare or group-commit sounds like an interesting idea that could potentially benefit performance. Thoughts? Hope this helps, I wanted to present some background on this feature. Please let me know of any details you want explained or discussed, and I will try to answer them briefly and to the point. - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
single thread. In MySQL, _both_ prepare and commits are so grouped from a single thread (though I think one thread can do group prepare in parallel with another doing group commit).
Ehm, this is not true, of course. The prepare() calls are from multiple threads in parallel. Just the flush_logs(hton, true) call is from a single thread for a whole group of transactions.
This way, the extra lock can be avoided for storage engines that do not need group_prepare(). And storage engines have freedom to implement
And I do not think this will work either, all binlog commits must use the same lock sequence, so that a later one not taking the new lock cannot race ahead of another. It is important to use a separate lock though, så one storage engine prepare fsync can happen in parallel with one binlog write fsync. It still seems useful if the upper layer could pass down a list of the entire group of transactions being group committed (or prepared). I think prepare_ordered() can be just removed, it ended up never being useful. And maybe a group_commit_ordered(list_of_transactions) can be added as an alternative to commit_ordered(). A new group_prepare_ordered(list_of_transactions) might help the performance issue for rocksdb. It really should be make async though. Like group_prepare_ordered_start(cookie, list) and group_prepare_ordered_complete(cookie) or whatever. With the MySQL "API", it seems it is impossible for two participating storage engines to persist their prepares in parallel, which isn't great for performance. The MySQL flush_logs() during prepare really feels like a gross hack. It doesn't seem right to run fsync()'s single-threaded under a lock... - Kristian.
Kristian, Sergey, hello. While I'm still studying the Mariadb BGC (binlog group commit) let me jump in with few comments.
Sergey Petrunia <sergey@mariadb.com> writes:
== Some background ==
"group commit with binlog" feature needs to accomplish two goals:
1. Keep the binlog and the storage engine in sync. storage_engine->prepare(sync=true); binlog->write(sync=true); storage_engine->commit();
2. The second goal is to make operation performant. We need two coordinated disk
== Group Commit with Binlog in MySQL ==
MySQL (and fb/mysql-5.6 in particular) does in the following phases:
Phase #1: Call storage_engine->prepare() for all transactions in the group. The call itself is not persistent.
Phase #2: Call storage->engine->flush_logs(). This makes the effect of all Prepare operations from Phase#1 persistent.
Phase #3: Write and sync the binary log.
Phase #4: Call storage_engine->commit(). This does not need to be persistent.
Interesting. Phase #2 is a mysql 5.7 feature, it is not in 5.6. Did Facebook backport this to their 5.6 tree? Or did MySQL 5.7 get this from Facebook work?
Facebook contributed, indeed. Bug #73202 write/sync redo log before flush thread cache to binlog Submitted: 5 Jul 2014 0:31 Modified: 24 Nov 2014 2:01 Reporter: zhai weixiang (OCA)
MariaDB does not have these phases described above:
Phase #1: Call storage_engine->prepare() for all transactions in the group. The call itself is not persistent.
Phase #2: Call storage->engine->flush_logs(). This makes the effect of all Prepare operations from Phase#1 persistent.
Right, it combines them in a single "phase". storage_engine->prepare() is expected to be persistent and do its own group prepare. While MySQL 5.7 builds a list of transactions to group prepare.
RocksDB's Group Write (see rocksdb/rocksdb/db/db_impl_write.cc, DBImpl::WriteImpl function) handles both Prepare() and Commit() commands and does the following:
1. Controls writing the commited data into the MemTable 2. Writes transactions to WAL 3. Syncs the WAL.
Can you explain, at a high level, how RocksDB transaction visibility, lock release, and persistency works?
Is it like - once a transaction is written to the MemTable, it is visible to other transactions and its commit order is determined wrt. other transactions?
And persistency is guaranteed after write+sync of the WAL?
When are locks released in this sequence?
All three steps are done for the whole group. This has a consequence: a Commit() operation that does not need to sync the WAL will still be delayed if another operation in the group needs the WAL to be synced.
So do I understand correctly: DbImpl::WriteImpl() does both group commit
Sergey definitely needed to spell it out. One may even think of a SELECT that suffers delay.. A fair guess?
(in-memory) to the MemTable, as well as group commit (on disk) to the WAL? And it uses the _same_ grouping of transactions for these two operations?
And so, the first commit_ordered() joins some prepare() that wants to sync the WAL. And only once that sync is done can the next commit_ordered() start - and it might easily end up joining a thread that recently completed its delayed commit_ordered() and is now doing prepare for a new transaction.
Indeed, this is not acceptable performance-wise for commit_ordered(). It must not wait for disk operations.
So why is this not a problem in the MySQL case? MySQL runs the handlerton->commit() calls under LOCK_commit just like MariaDB does commit_ordered().
Sounds like 5.7 (5.6-fb) flush_logs() by the BGC leader triggers necessary group-prepare coordination. Specifically, I would expect (yet to look at the Sergey's branch) one fsync() for the whole group by the leader.
My guess is that this is because in your tests, WriteImpl() was _always_ called with WAL sync disabled. I wonder what would happen if you were to run a mix of binlogged and non-binlogged (SET sql_log_bin=0) transactions, where the latter would end up in WriteImpl(sync=true); maybe a similar problem would occur.
So I think there is something interesting to look at here. If I understand correctly, WriteImpl() tries to reduce contention between threads doing commit (even in-memory commit) by making them group up and having a single thread commit for multiple transactions, rather than jumping from one thread to another for each commit.
This is a valid technique, but it seems to fit badly with what MySQL and MariaDB is doing. In MariaDB, commits are _already_ grouped and done from a single thread. In MySQL, _both_ prepare and commits are so grouped from a single thread (though I think one thread can do group prepare in parallel with another doing group commit).
So there seems to be an opportunity to simplify WriteImpl() for the MySQL and MariaDB binlog case. If my understanding is correct, there will not be a lot of writer grouping.
(Of course for the non-binlog case, the situation is different).
Maybe this comment is relevant here?
// Requesting sync with concurrent_prepare_ is expected to be very rare. We // hance provide a simple implementation that is not necessarily efficient.
== Possible solutions ==
I am not sure what to do.
- Make the SQL layer's Group Commit implementation invoke hton->flush_logs() explicitly, like MySQL does?
That is an option, though I hope you will not do it like that. What does flush_logs() have to do with making prepare() persistent?
You could instead add a new handlerton method group_prepare() or something. If non-NULL, the storage engine may omit persistency in prepare(), but must then in group_prepare() ensure that all prepares are persistent that have completed prepare_ordered(). And if the method is non-null, the SQL layer will call group_prepare() just before binlog write (under a new mutex LOCK_group_prepare that is chained before LOCK_log.
This way, the extra lock can be avoided for storage engines that do not need group_prepare(). And storage engines have freedom to implement group_prepare() in a way that suites them.
Of course, Rocksdb can just implement group_prepare() as flush_logs() to make _all_ prepares persistent, just like in MySQL. So for rocksdb the functionality is identical to in MySQL, while flexibility is preserved for other storage engines.
Indeed.
However, it still seems to me that there is opportunity to do better here. For example, the upper layer could present to Rocksdb the actual list of transactions that need to be group prepared / group committed. Then Rocksdb could do them in a single writer without having to coordinate the threads manually in WriteImpl().
- Modify RocksDB so that Transaction::Commit(sync=false) do not use Group Write? I am not sure if this is possible: Group Write is not about only performance, it's about preventing concurrent MemTable writes. AFAIU one cannot just tell a certain DBImpl::WriteImpl() call to not participate in write groups and work as if there was no other activity.
What about if rocksdb got the list of transactions to commit (to memtable, sync=false) explicitly, rather than as individual commit() or commit_ordered() calls? Then it could commit them all in a single writer, which should be more efficient. And similar for prepare?
In the current MySQL (facebook patch) code, isn't it the case that each commit() has to create a new writer and write to memtable a single commit individually? While all of these calls are in fact from a single thread from an explicit, known list. This does not seem optimal.
- Modify RocksDB so that Transaction::Commit(sync=false) does not wait until its write group finishes WAL sync? This could be doable but is potentially complex.
That should probably be done so that a write group would only write to the memtable (and the in-memory WAL buffer?). After that, it would release all non-syncing participants, and the remaining syncing participants could form a new write group to do the sync independently.
Of course, if a write group rarely syncs, this is of little benefit. From my limited understanding of the code, flush_logs() which ends up in SyncWAL() does not use a write group.
So this already ended up as a huge email, but I thought some background on commit_ordered() could also help here. Note that in MySQL >=5.6, their commit() is very similar to commit_ordered().
commit_ordered exists for three main purposes, if I recall correctly:
1. To synchronise commit order in binlog and storage engine. This ensures that if a physical backup is taken of the storage engine and used to provision a slave, the storage engine state corresponds to a unique point in the binlog (MySQL has this).
2. To make START TRANSACTION WITH CONSISTENT SNAPSHOT actually correctly synchronise snapshots between multiple storage engines (MySQL does not have this, I think).
(Offtopic, but anyway what it is? Multi-engine transaction with this specific ISOLATION level?)
3. To avoid having to do an extra fsync() for every commit, on top of the one for prepare and the one for binlog write (MySQL has this).
I think those are the main reason for commit_ordered() (I might have forgotten some).
For this problem, I suppose (3) is the main interest?
MySQL handles (3) by stopping all transactions around binlog rotate and doing a flush_logs().
Maybe I am missing something, but why binlog rotation? It is not a common case. Indeed MySQL BGC reduces the number of fsync() to two by the (flush stage) leader. As to rotation, it's a specific branch of MYSQL_BIN_LOG::ordered_commit() where a rotator thread contends for the flush stage mutex, eventually gains it (which may lead to few more groups binlogged into the old being rotated file) and performs.
It needs this because after binlog rotation, binlog crash recovery has only an empty binlog, so _all_ transactions must be durably committed at this point.
MariaDB avoids this stall
You must be having a use case which I can't see..
around binlog rotate. Instead it extends binlog crash recovery to be able to look into multiple binlog files. So there is no need to force commits to disk around binlog rotate.
To eventually be able to drop binlog files, there is the binlog_checkpoint_request() mechanism. This allows the storage engine to inform the upper layer when all the transactions in a binlog have ended up durably committed, in the normal course of action of the storage engine.
So to just get (3), RocksDB could just implement no commit_ordered(), or perhaps an empty commit_ordered(). And then also implement binlog_checkpoint_request() to record the latest prepared transaction at that point - and when the WAL is later synced, reply back to the upper layer to allow it to release the old binlog file. This seems doable without support for quickly committing a transaction to memory, which current RocksDB WriteImpl() seems poor at doing simultaneously with persistent prepare().
Though I think it would be good if the full functionality of commit_ordered() was implemented in RocksDB. Passing down into RocksDB explicitly the list of transactions to group-prepare or group-commit sounds like an interesting idea that could potentially benefit performance.
Thoughts?
Let me reserve my night for more :-).
Hope this helps, I wanted to present some background on this feature. Please let me know of any details you want explained or discussed, and I will try to answer them briefly and to the point.
- Kristian.
Cheers, Andrei
andrei.elkin@pp.inet.fi writes:
2. To make START TRANSACTION WITH CONSISTENT SNAPSHOT actually correctly synchronise snapshots between multiple storage engines (MySQL does not have this, I think).
(Offtopic, but anyway what it is? Multi-engine transaction with this specific ISOLATION level?)
https://mariadb.com/kb/en/the-mariadb-library/enhancements-for-start-transac... So it is like a REPEATABLE READ across engines, applications can get a consistent view of cross-engine transactions. It also allows to do a non-blocking mysqldump without FLUSH TABLES WITH READ LOCK.
3. To avoid having to do an extra fsync() for every commit, on top of the one for prepare and the one for binlog write (MySQL has this).
MySQL handles (3) by stopping all transactions around binlog rotate and doing a flush_logs().
Maybe I am missing something, but why binlog rotation? It is not a common case. Indeed MySQL BGC reduces the number of fsync() to two by the (flush stage) leader. As to rotation, it's a specific branch of MYSQL_BIN_LOG::ordered_commit() where a rotator thread contends for the flush stage mutex, eventually gains it (which may lead to few more groups binlogged into the old being rotated file) and performs.
It used to be that there was _three_ fsyncs for every commit. The _only_ reason the fsync in commit was needed was to ensure that binlog crash recovery would still work after a binlog rotation. Which was kind of silly. So _something_ needed to be done aroung binlog rotation. To ensure that all transactions are durably committed in storage engines before they are no longer available to binlog crash recovery. If I understand correctly, MySQL ensures this by temporarily stopping binlog writes, calling flush_logs() in all (?) engines (with semantics that flush_logs() must make all prior commit()'s durable), and only then allowing new writes to the new binlog. I am not sure how MySQL ensures that all commit() calls complete before the flush_logs() call, maybe it takes both the LOCK_commit and LOCK_log mutexes around binlog rotation. The result is that binlog crash recovery is always possible from only one binlog file. MariaDB instead extends binlog crash recovery to consider multiple binlog files, if necessary. Then nothing special is needed during binlog rotation. But some "garbage collection" is needed to eventually release old binlog files.
MariaDB avoids this stall
You must be having a use case which I can't see..
I am not sure one is better than the other. MariaDB avoids flush_logs(), though in current storage engines it may not matter much. The MySQL approach is arguably simpler code, though it seems quite an abuse of flush_logs(). The MySQL approach was not public when the MariaDB approach was implemented. - Kristian.
Kristian,
andrei.elkin@pp.inet.fi writes:
2. To make START TRANSACTION WITH CONSISTENT SNAPSHOT actually correctly synchronise snapshots between multiple storage engines (MySQL does not have this, I think).
(Offtopic, but anyway what it is? Multi-engine transaction with this specific ISOLATION level?)
https://mariadb.com/kb/en/the-mariadb-library/enhancements-for-start-transac...
So it is like a REPEATABLE READ across engines, applications can get a consistent view of cross-engine transactions. It also allows to do a non-blocking mysqldump without FLUSH TABLES WITH READ LOCK.
3. To avoid having to do an extra fsync() for every commit, on top of the one for prepare and the one for binlog write (MySQL has this).
MySQL handles (3) by stopping all transactions around binlog rotate and doing a flush_logs().
Maybe I am missing something, but why binlog rotation? It is not a common case. Indeed MySQL BGC reduces the number of fsync() to two by the (flush stage) leader. As to rotation, it's a specific branch of MYSQL_BIN_LOG::ordered_commit() where a rotator thread contends for the flush stage mutex, eventually gains it (which may lead to few more groups binlogged into the old being rotated file) and performs.
It used to be that there was _three_ fsyncs for every commit. The _only_ reason the fsync in commit was needed was to ensure that binlog crash recovery would still work after a binlog rotation. Which was kind of silly.
This one must be https://github.com/mysql/mysql-server/commit/35adf21bb63a336c76efdad6c461016...
So _something_ needed to be done aroung binlog rotation. To ensure that all transactions are durably committed in storage engines before they are no longer available to binlog crash recovery.
If I understand correctly, MySQL ensures this by temporarily stopping binlog writes, calling flush_logs() in all (?) engines (with semantics that flush_logs() must make all prior commit()'s durable), and only then allowing new writes to the new binlog. I am not sure how MySQL ensures that all commit() calls complete before the flush_logs() call, maybe it takes both the LOCK_commit and LOCK_log mutexes around binlog rotation.
Almost: LOCK_log and LOCK_xids - I've checked it out, 'cos really forgot it. MySQL employs a sort of unlogging xid former (before BGC) technique. The rotator first (phtread_cond-)waits for all flushed-to-binlog-xids got committed, and then having the two mutex grant ha_flush_logs() is issued right before the new log file is set.
The result is that binlog crash recovery is always possible from only one binlog file.
MariaDB instead extends binlog crash recovery to consider multiple binlog files, if necessary. Then nothing special is needed during binlog rotation. But some "garbage collection" is needed to eventually release old binlog files.
MariaDB avoids this stall
You must be having a use case which I can't see..
I am not sure one is better than the other. MariaDB avoids flush_logs(), though in current storage engines it may not matter much. The MySQL approach is arguably simpler code, though it seems quite an abuse of flush_logs().
It must be only for good to have two well explored methods around.
The MySQL approach was not public when the MariaDB approach was implemented.
True. Thanks a lot for talking and explaining these fine bits! Andrei
participants (3)
-
andrei.elkin@pp.inet.fi
-
Kristian Nielsen
-
Sergey Petrunia