Monty asked me to fix MDEV-11937. This particular one is a performance regression in InnoDB commit. But there is a wider problem that I thought I should explain, so it can be perhaps avoided in the future. MariaDB and MySQL use different mechanisms for storage engines to avoid having to fsync during commit when binlog is enabled. In MariaDB, storage engines implement the commit_checkpoint_request() handlerton method. In MySQL, storage engines call thd_get_durability_property() to check if they can avoid fsync. The problem is that somehow the thd_get_durability_property() function was introduced into MariaDB code, but it is completely non-functional. So now there is code in InnoDB, TokuDB and RocksDB that calls this function and does not work correctly. This lead to performance regression due to extra fsync() calls. This seems to me a serious problem. Now new code can be merged and compile fine, where in reality it is wrong. There really should not be two separate and different mechanisms for the same thing, and certainly not with one of them non-functional. The "expected" approach would be to remove thd_get_durability_property() and update storage engines to use the corresponding MariaDB APIs (commit_ordered() and commit_checkpoint_request()). This should not be hard. A simple commit_checkpoint_request() implementation can just fsync all transactions immediately (similar to what MySQL does). A more detailed implementation can avoid any extra fsyncs, and just asynchroneously notify the upper layer with commit_checkpoint_notify_ha() later when such fsync happens normally (this is what InnoDB does). See comments in handler.h for details. Or was the intention to eventually replace the whole MariaDB binlog group commit implementation with the MySQL one, to make MariaDB less divergent? This would require a number of changes to MariaDB binlog and replication. The binlog recovery code should be replaced (MySQL does not have the ability to recover from more than one binlog). The binlog group commit code must be replaced as well, and the commit_ordered() mechanism removed. I think this would also require a re-design of the MariaDB in-order parallel replication. MySQL has some optional mechanism for in-order, but my understanding is that it is not sufficient to do optimistic parallel replication. I am not intimately familiar with the MySQL code though. I hope this helps. The wider problem behind MDEV-11937 is one of policy more than one of code bugs, so not too much else I can do to address it. ----------------------------------------------------------------------- Incidentally, I noticed some code in InnoDB trx0trx.cc: /* We set the HA_IGNORE_DURABILITY during prepare phase of binlog group commit to not flush redo log for every transaction here. So that we can flush prepared records of transactions to redo log in a group right before writing them to binary log during flush stage of binlog group commit. */ So the idea is to do group prepare with the same group of transactions that will later group commit to the binlog. In MariaDB, this concept does not exist. Storage engine prepares are allowed to run in parallel and in any order compared to binlog commit. So the InnoDB group prepare can include more transactions than participate in the binlog commit (but also less, of course). IIRC, the important thing is to ensure that all transactions are durably prepared in storage engines before being written to the binlog. In MariaDB, there is MYSQL_LOG_BIN::group_commit_queue that holds the list of transactions to be group committed to the binlog. A similar mechanism in MariaDB might use the group_commit_queue as the set of transactions to send to storage engines for group prepare. But if we wait for a prepare fsync() after building this list, some transactions that could prepare during this fsync may be unnecesarily delayed to the next binlog commit. The MySQL 5.7 code grabs the list of transactions to binlog group commit, and then flushes the log in _all_ storage engines, unconditionally. That just seems horrybly wrong, but I can't see any other way to read the code (from MYSQL_BIN_LOG::process_flush_stage_queue()): ha_flush_logs(NULL, true); It even does so while holding LOCK_log :-( I guess the MySQL idea is that there is only one storage engine anyway, InnoDB. - Kristian.