Hello, This is about https://jira.mariadb.org/browse/MDEV-11934. I've encountered an insteresting issue here, so I thought I would consult on both MyRocks and MariaDB lists. == Some background == "group commit with binlog" feature needs to accomplish two goals: 1. Keep the binlog and the storage engine in sync. This is done by employing XA between the binlog and the storage engine. It works by making these calls: /* Make the transaction's changes to be ready to be committed (no conflicts with other transactions, etc) but do not commit them yet. The effects of the prepare operation must be synced to disk, as the storage engine needs to be able to recover (i.e. commit) the prepared transaction after a crash */ storage_engine->prepare(sync=true); /* After this call, the transaction is considered committed. In case of a crash, the recovery process will use the contents of the binlog to determine which of the prepared transactions are to be committed and which are to be rolled back. */ binlog->write(sync=true); /* Commit the transaction in the storage engine. This makes its changes visible to other transactions (and also releases its locks and so forth) Note that most of the time(*) we don't need to sync there. In case of a crash we will be able to recover using the binlog. */ storage_engine->commit(); 2. The second goal is to make operation performant. We need two coordinated disk flushes per transaction, the idea is to do "Group Commit" where multiple transactions share disk flushes. So, we need to do group commit and keep the storage engine and the binlog in sync while doing that. == Group Commit with Binlog in MySQL == MySQL (and fb/mysql-5.6 in particular) does in the following phases: Phase #1: Call storage_engine->prepare() for all transactions in the group. The call itself is not persistent. Phase #2: Call storage->engine->flush_logs(). This makes the effect of all Prepare operations from Phase#1 persistent. Phase #3: Write and sync the binary log. Phase #4: Call storage_engine->commit(). This does not need to be persistent. MyRocks implements them. == Group Commit with Binlog in MariaDB == MariaDB does not have these phases described above:
Phase #1: Call storage_engine->prepare() for all transactions in the group. The call itself is not persistent.
Phase #2: Call storage->engine->flush_logs(). This makes the effect of all Prepare operations from Phase#1 persistent.
A quote from Kristian's description at https://lists.launchpad.net/maria-developers/msg10832.html
So the idea is to do group prepare with the same group of transactions that will later group commit to the binlog. In MariaDB, this concept does not exist. Storage engine prepares are allowed to run in parallel and in any order compared to binlog commit.
Initially this looked like it could work for MyRocks. MyRocks has a group commit implementation, both Prepare() and Commit() operations participate in groups. However when I implemented a group commit implementation I found its performance to be close to what one would expect if there was no commit grouping, and commit() call flushed to disk https://jira.mariadb.org/browse/MDEV-11934 has the details. == The issue == (I'm 95% certain about this. It's not 100% yet but it is very likely) RocksDB's Group Write (see rocksdb/rocksdb/db/db_impl_write.cc, DBImpl::WriteImpl function) handles both Prepare() and Commit() commands and does the following: 1. Controls writing the commited data into the MemTable 2. Writes transactions to WAL 3. Syncs the WAL. All three steps are done for the whole group. This has a consequence: a Commit() operation that does not need to sync the WAL will still be delayed if another operation in the group needs the WAL to be synced. This delay has a disastrous effect, because SQL layer tries to have the same order of transactions in the storage engine and in the binlog. In order to do that, it calls rocksdb_commit_ordered() for each transaction sequentially. Delaying one transaction causes a delay of the entire SQL-level commit group. == Possible solutions == I am not sure what to do. - Make the SQL layer's Group Commit implementation invoke hton->flush_logs() explicitly, like MySQL does? - Modify RocksDB so that Transaction::Commit(sync=false) do not use Group Write? I am not sure if this is possible: Group Write is not about only performance, it's about preventing concurrent MemTable writes. AFAIU one cannot just tell a certain DBImpl::WriteImpl() call to not participate in write groups and work as if there was no other activity. - Modify RocksDB so that Transaction::Commit(sync=false) does not wait until its write group finishes WAL sync? This could be doable but is potentially complex. BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog