Kristian, Sergey, hello. While I'm still studying the Mariadb BGC (binlog group commit) let me jump in with few comments.
Sergey Petrunia <sergey@mariadb.com> writes:
== Some background ==
"group commit with binlog" feature needs to accomplish two goals:
1. Keep the binlog and the storage engine in sync. storage_engine->prepare(sync=true); binlog->write(sync=true); storage_engine->commit();
2. The second goal is to make operation performant. We need two coordinated disk
== Group Commit with Binlog in MySQL ==
MySQL (and fb/mysql-5.6 in particular) does in the following phases:
Phase #1: Call storage_engine->prepare() for all transactions in the group. The call itself is not persistent.
Phase #2: Call storage->engine->flush_logs(). This makes the effect of all Prepare operations from Phase#1 persistent.
Phase #3: Write and sync the binary log.
Phase #4: Call storage_engine->commit(). This does not need to be persistent.
Interesting. Phase #2 is a mysql 5.7 feature, it is not in 5.6. Did Facebook backport this to their 5.6 tree? Or did MySQL 5.7 get this from Facebook work?
Facebook contributed, indeed. Bug #73202 write/sync redo log before flush thread cache to binlog Submitted: 5 Jul 2014 0:31 Modified: 24 Nov 2014 2:01 Reporter: zhai weixiang (OCA)
MariaDB does not have these phases described above:
Phase #1: Call storage_engine->prepare() for all transactions in the group. The call itself is not persistent.
Phase #2: Call storage->engine->flush_logs(). This makes the effect of all Prepare operations from Phase#1 persistent.
Right, it combines them in a single "phase". storage_engine->prepare() is expected to be persistent and do its own group prepare. While MySQL 5.7 builds a list of transactions to group prepare.
RocksDB's Group Write (see rocksdb/rocksdb/db/db_impl_write.cc, DBImpl::WriteImpl function) handles both Prepare() and Commit() commands and does the following:
1. Controls writing the commited data into the MemTable 2. Writes transactions to WAL 3. Syncs the WAL.
Can you explain, at a high level, how RocksDB transaction visibility, lock release, and persistency works?
Is it like - once a transaction is written to the MemTable, it is visible to other transactions and its commit order is determined wrt. other transactions?
And persistency is guaranteed after write+sync of the WAL?
When are locks released in this sequence?
All three steps are done for the whole group. This has a consequence: a Commit() operation that does not need to sync the WAL will still be delayed if another operation in the group needs the WAL to be synced.
So do I understand correctly: DbImpl::WriteImpl() does both group commit
Sergey definitely needed to spell it out. One may even think of a SELECT that suffers delay.. A fair guess?
(in-memory) to the MemTable, as well as group commit (on disk) to the WAL? And it uses the _same_ grouping of transactions for these two operations?
And so, the first commit_ordered() joins some prepare() that wants to sync the WAL. And only once that sync is done can the next commit_ordered() start - and it might easily end up joining a thread that recently completed its delayed commit_ordered() and is now doing prepare for a new transaction.
Indeed, this is not acceptable performance-wise for commit_ordered(). It must not wait for disk operations.
So why is this not a problem in the MySQL case? MySQL runs the handlerton->commit() calls under LOCK_commit just like MariaDB does commit_ordered().
Sounds like 5.7 (5.6-fb) flush_logs() by the BGC leader triggers necessary group-prepare coordination. Specifically, I would expect (yet to look at the Sergey's branch) one fsync() for the whole group by the leader.
My guess is that this is because in your tests, WriteImpl() was _always_ called with WAL sync disabled. I wonder what would happen if you were to run a mix of binlogged and non-binlogged (SET sql_log_bin=0) transactions, where the latter would end up in WriteImpl(sync=true); maybe a similar problem would occur.
So I think there is something interesting to look at here. If I understand correctly, WriteImpl() tries to reduce contention between threads doing commit (even in-memory commit) by making them group up and having a single thread commit for multiple transactions, rather than jumping from one thread to another for each commit.
This is a valid technique, but it seems to fit badly with what MySQL and MariaDB is doing. In MariaDB, commits are _already_ grouped and done from a single thread. In MySQL, _both_ prepare and commits are so grouped from a single thread (though I think one thread can do group prepare in parallel with another doing group commit).
So there seems to be an opportunity to simplify WriteImpl() for the MySQL and MariaDB binlog case. If my understanding is correct, there will not be a lot of writer grouping.
(Of course for the non-binlog case, the situation is different).
Maybe this comment is relevant here?
// Requesting sync with concurrent_prepare_ is expected to be very rare. We // hance provide a simple implementation that is not necessarily efficient.
== Possible solutions ==
I am not sure what to do.
- Make the SQL layer's Group Commit implementation invoke hton->flush_logs() explicitly, like MySQL does?
That is an option, though I hope you will not do it like that. What does flush_logs() have to do with making prepare() persistent?
You could instead add a new handlerton method group_prepare() or something. If non-NULL, the storage engine may omit persistency in prepare(), but must then in group_prepare() ensure that all prepares are persistent that have completed prepare_ordered(). And if the method is non-null, the SQL layer will call group_prepare() just before binlog write (under a new mutex LOCK_group_prepare that is chained before LOCK_log.
This way, the extra lock can be avoided for storage engines that do not need group_prepare(). And storage engines have freedom to implement group_prepare() in a way that suites them.
Of course, Rocksdb can just implement group_prepare() as flush_logs() to make _all_ prepares persistent, just like in MySQL. So for rocksdb the functionality is identical to in MySQL, while flexibility is preserved for other storage engines.
Indeed.
However, it still seems to me that there is opportunity to do better here. For example, the upper layer could present to Rocksdb the actual list of transactions that need to be group prepared / group committed. Then Rocksdb could do them in a single writer without having to coordinate the threads manually in WriteImpl().
- Modify RocksDB so that Transaction::Commit(sync=false) do not use Group Write? I am not sure if this is possible: Group Write is not about only performance, it's about preventing concurrent MemTable writes. AFAIU one cannot just tell a certain DBImpl::WriteImpl() call to not participate in write groups and work as if there was no other activity.
What about if rocksdb got the list of transactions to commit (to memtable, sync=false) explicitly, rather than as individual commit() or commit_ordered() calls? Then it could commit them all in a single writer, which should be more efficient. And similar for prepare?
In the current MySQL (facebook patch) code, isn't it the case that each commit() has to create a new writer and write to memtable a single commit individually? While all of these calls are in fact from a single thread from an explicit, known list. This does not seem optimal.
- Modify RocksDB so that Transaction::Commit(sync=false) does not wait until its write group finishes WAL sync? This could be doable but is potentially complex.
That should probably be done so that a write group would only write to the memtable (and the in-memory WAL buffer?). After that, it would release all non-syncing participants, and the remaining syncing participants could form a new write group to do the sync independently.
Of course, if a write group rarely syncs, this is of little benefit. From my limited understanding of the code, flush_logs() which ends up in SyncWAL() does not use a write group.
So this already ended up as a huge email, but I thought some background on commit_ordered() could also help here. Note that in MySQL >=5.6, their commit() is very similar to commit_ordered().
commit_ordered exists for three main purposes, if I recall correctly:
1. To synchronise commit order in binlog and storage engine. This ensures that if a physical backup is taken of the storage engine and used to provision a slave, the storage engine state corresponds to a unique point in the binlog (MySQL has this).
2. To make START TRANSACTION WITH CONSISTENT SNAPSHOT actually correctly synchronise snapshots between multiple storage engines (MySQL does not have this, I think).
(Offtopic, but anyway what it is? Multi-engine transaction with this specific ISOLATION level?)
3. To avoid having to do an extra fsync() for every commit, on top of the one for prepare and the one for binlog write (MySQL has this).
I think those are the main reason for commit_ordered() (I might have forgotten some).
For this problem, I suppose (3) is the main interest?
MySQL handles (3) by stopping all transactions around binlog rotate and doing a flush_logs().
Maybe I am missing something, but why binlog rotation? It is not a common case. Indeed MySQL BGC reduces the number of fsync() to two by the (flush stage) leader. As to rotation, it's a specific branch of MYSQL_BIN_LOG::ordered_commit() where a rotator thread contends for the flush stage mutex, eventually gains it (which may lead to few more groups binlogged into the old being rotated file) and performs.
It needs this because after binlog rotation, binlog crash recovery has only an empty binlog, so _all_ transactions must be durably committed at this point.
MariaDB avoids this stall
You must be having a use case which I can't see..
around binlog rotate. Instead it extends binlog crash recovery to be able to look into multiple binlog files. So there is no need to force commits to disk around binlog rotate.
To eventually be able to drop binlog files, there is the binlog_checkpoint_request() mechanism. This allows the storage engine to inform the upper layer when all the transactions in a binlog have ended up durably committed, in the normal course of action of the storage engine.
So to just get (3), RocksDB could just implement no commit_ordered(), or perhaps an empty commit_ordered(). And then also implement binlog_checkpoint_request() to record the latest prepared transaction at that point - and when the WAL is later synced, reply back to the upper layer to allow it to release the old binlog file. This seems doable without support for quickly committing a transaction to memory, which current RocksDB WriteImpl() seems poor at doing simultaneously with persistent prepare().
Though I think it would be good if the full functionality of commit_ordered() was implemented in RocksDB. Passing down into RocksDB explicitly the list of transactions to group-prepare or group-commit sounds like an interesting idea that could potentially benefit performance.
Thoughts?
Let me reserve my night for more :-).
Hope this helps, I wanted to present some background on this feature. Please let me know of any details you want explained or discussed, and I will try to answer them briefly and to the point.
- Kristian.
Cheers, Andrei