Re: [Maria-developers] A problem with implementing Group Commit with Binlog with MyRocks

12 Sep 2017

      Kristian,
...
andrei.elkin@pp.inet.fi writes:
...
...
2. To make START TRANSACTION WITH CONSISTENT SNAPSHOT actually correctly
synchronise snapshots between multiple storage engines (MySQL does not have
this, I think).
(Offtopic, but anyway what it is? Multi-engine transaction with this
specific ISOLATION level?)
https://mariadb.com/kb/en/the-mariadb-library/enhancements-for-start-transac...
So it is like a REPEATABLE READ across engines, applications can get a
consistent view of cross-engine transactions.
It also allows to do a non-blocking mysqldump without FLUSH TABLES WITH READ
LOCK.
...
...
3. To avoid having to do an extra fsync() for every commit, on top of the
one for prepare and the one for binlog write (MySQL has this).
...
...
MySQL handles (3) by stopping all transactions around binlog rotate and
doing a flush_logs().
Maybe I am missing something, but why binlog rotation? It is not a
common case.  Indeed MySQL BGC reduces the number of fsync() to two by
the (flush stage) leader.  As to rotation, it's a specific branch of
 MYSQL_BIN_LOG::ordered_commit()
where a rotator thread contends for the flush stage mutex, eventually gains it
(which may lead to few more groups binlogged into the old being rotated
file) and performs.
It used to be that there was _three_ fsyncs for every commit. The _only_
reason the fsync in commit was needed was to ensure that binlog crash
recovery would still work after a binlog rotation. Which was kind of
silly.
This one must be

https://github.com/mysql/mysql-server/commit/35adf21bb63a336c76efdad6c461016...
...
So _something_ needed to be done aroung binlog rotation. To ensure that all
transactions are durably committed in storage engines before they are no
longer available to binlog crash recovery.
If I understand correctly, MySQL ensures this by temporarily stopping binlog
writes, calling flush_logs() in all (?) engines (with semantics that
flush_logs() must make all prior commit()'s durable), and only then allowing
new writes to the new binlog. I am not sure how MySQL ensures that all
commit() calls complete before the flush_logs() call, maybe it takes both
the LOCK_commit and LOCK_log mutexes around binlog rotation.
Almost: LOCK_log and LOCK_xids -  I've checked it out, 'cos really forgot it.
MySQL employs a sort of unlogging xid former (before BGC) technique.
The rotator first (phtread_cond-)waits for all flushed-to-binlog-xids got committed,
and then having the two mutex grant ha_flush_logs() is issued right
before the new log file is set.
...
The result is
that binlog crash recovery is always possible from only one binlog file.
MariaDB instead extends binlog crash recovery to consider multiple binlog
files, if necessary. Then nothing special is needed during binlog rotation.
But some "garbage collection" is needed to eventually release old binlog
files.
...
...
MariaDB avoids this stall
You must be having a use case which I can't see..
I am not sure one is better than the other. MariaDB avoids flush_logs(),
though in current storage engines it may not matter much. The MySQL approach
is arguably simpler code, though it seems quite an abuse of flush_logs().
It must be only for good to have two well explored methods around.
...
The MySQL approach was not public when the MariaDB approach was
implemented.
True.

Thanks a lot for talking and explaining these fine bits!

Andrei

Re: [Maria-developers] A problem with implementing Group Commit with Binlog with MyRocks

andrei.elkin＠pp.inet.fi