Kristian,
andrei.elkin@pp.inet.fi writes:
2. To make START TRANSACTION WITH CONSISTENT SNAPSHOT actually correctly synchronise snapshots between multiple storage engines (MySQL does not have this, I think).
(Offtopic, but anyway what it is? Multi-engine transaction with this specific ISOLATION level?)
https://mariadb.com/kb/en/the-mariadb-library/enhancements-for-start-transac...
So it is like a REPEATABLE READ across engines, applications can get a consistent view of cross-engine transactions. It also allows to do a non-blocking mysqldump without FLUSH TABLES WITH READ LOCK.
3. To avoid having to do an extra fsync() for every commit, on top of the one for prepare and the one for binlog write (MySQL has this).
MySQL handles (3) by stopping all transactions around binlog rotate and doing a flush_logs().
Maybe I am missing something, but why binlog rotation? It is not a common case. Indeed MySQL BGC reduces the number of fsync() to two by the (flush stage) leader. As to rotation, it's a specific branch of MYSQL_BIN_LOG::ordered_commit() where a rotator thread contends for the flush stage mutex, eventually gains it (which may lead to few more groups binlogged into the old being rotated file) and performs.
It used to be that there was _three_ fsyncs for every commit. The _only_ reason the fsync in commit was needed was to ensure that binlog crash recovery would still work after a binlog rotation. Which was kind of silly.
This one must be https://github.com/mysql/mysql-server/commit/35adf21bb63a336c76efdad6c461016...
So _something_ needed to be done aroung binlog rotation. To ensure that all transactions are durably committed in storage engines before they are no longer available to binlog crash recovery.
If I understand correctly, MySQL ensures this by temporarily stopping binlog writes, calling flush_logs() in all (?) engines (with semantics that flush_logs() must make all prior commit()'s durable), and only then allowing new writes to the new binlog. I am not sure how MySQL ensures that all commit() calls complete before the flush_logs() call, maybe it takes both the LOCK_commit and LOCK_log mutexes around binlog rotation.
Almost: LOCK_log and LOCK_xids - I've checked it out, 'cos really forgot it. MySQL employs a sort of unlogging xid former (before BGC) technique. The rotator first (phtread_cond-)waits for all flushed-to-binlog-xids got committed, and then having the two mutex grant ha_flush_logs() is issued right before the new log file is set.
The result is that binlog crash recovery is always possible from only one binlog file.
MariaDB instead extends binlog crash recovery to consider multiple binlog files, if necessary. Then nothing special is needed during binlog rotation. But some "garbage collection" is needed to eventually release old binlog files.
MariaDB avoids this stall
You must be having a use case which I can't see..
I am not sure one is better than the other. MariaDB avoids flush_logs(), though in current storage engines it may not matter much. The MySQL approach is arguably simpler code, though it seems quite an abuse of flush_logs().
It must be only for good to have two well explored methods around.
The MySQL approach was not public when the MariaDB approach was implemented.
True. Thanks a lot for talking and explaining these fine bits! Andrei