Marko Mäkelä <marko.makela@mariadb.com> writes:
later. I understand that you were an early contributor to MariaDB and implemented some durability I/O optimization between the "distributed transaction" of binlog and the InnoDB write-ahead log (redo log, ib_logfile0).
Indeed, that was the group commit work: https://m.facebook.com/notes/mark-callaghan/group-commit-in-mariadb-is-fast/... https://knielsen-hq.org/w/fixing-mysql-group-commit-part-1/ https://knielsen-hq.org/w/fixing-mysql-group-commit-part-2/ https://knielsen-hq.org/w/fixing-mysql-group-commit-part-3/ https://knielsen-hq.org/w/fixing-mysql-group-commit-part-4-of-3/
I think that some improvement in this area is long overdue. I find it
Agree. I think it would be fantastic to combine your extensive knowledge of InnoDB with my binlog/replication expertise to improve this! There's a lot of very interesting points in your mail:
hard to read the high-level descriptions. Nowhere in https://jira.mariadb.org/browse/MDEV-232 or https://jira.mariadb.org/browse/MDEV-532 I see any mention of any key terminology:
From the binlog side, there are some basic design goals that influence how
Indeed, my understanding of InnoDB internals is far from complete (though I do understand the concept of LSN at a high level of course). the binlog group commit is implemented. 1. The APIs (handlerton methods and server calls) are designed to be storage-engine independent. So this is one reason why we would not see an InnoDB LSN anywhere in the server part. 2. Transactions (the replication term is "event group") are always written into the binlog as a single atomic (kind of) block. So no interleaving of transactions in the binlog, or transactions that span multiple binlog files. 3. Transactions in the binlogs are strictly ordered, even across different servers in the same replication topology (this is used for GTID and in-order optimistic parallel replication). 4. Identical commit order is enforced between the binlog and the storage engine. I think this is related to getting consistent replication slave position from an innobackup or LVM/BTRFS snapshot backup.
and trx_t::must_flush_log_later. There is also the confusingly named handlerton::commit_checkpoint_request that has nothing to do with log checkpoints, but with some kind of a durability request that
This was an optimisation to remove one fsync from commits when binlog is enabled: https://knielsen-hq.org/w/even-faster-group-commit/ If the server crashes, then on restart the binlog transaction coordinator needs to recover the XA state between the binlog and the engines. For this it scans the storage engines and the binlog files for any pending transactions. Any binlog file that contains pending XA transactions needs to be scanned. Once all transactions in a binlog file are durable committed in the engines, it is preferred to not scan it for efficiency reasons. commit_checkpoint_request() is called when the binlog wants to mark a point before which binlog files no longer need to be scanned (a "binlog checkpoint"). We can see in innobase_checkpoint_request() that this is mapped to an InnoDB LSN (using log_get_lsn()). This follows point (1) above, where the server-level API (binlog) is kept separate from storage engine internals (InnoDB LSN).
fails to identify the minimum logical time up to which everything is desired to be durable. (My talk
I wonder if this is a real mismatch between the API and InnoDB, or a misunderstanding about what is intended from commit_checkpoint_request(). The binlog checkpoint mechanism is completely asynchronous, and it is not a request for the storage engine to take any action to make things durable. It is merely a request for InnoDB to conservatively pick an LSN that is known to contain any transaction that has completed commit_ordered(), it may be any later LSN that is cheaper to obtain. Then later, when that LSN is eventually made durable on the engine's own initiative, the engine will call commit_checkpoint_notify_ha() to complete the binlog checkpoint. The intention is that this will be very easy for an engine to implement, without any changes to how it manages internal logs and durability. It is just a way for the engine to periodically communicate that some LSN is now durable, in a way that can match the LSN to a binlog position without assuming a particular storage engine or LSN format. But if this intention fails, we should look into it.
Some years ago, in MDEV-21534 Vladislav Vaintroub introduced write_lock and flush_lock in InnoDB. This works pretty well: a
If everything up to "my" time has been written or durably written, I do not care to wait.
So if I understand correctly, records from parallel transactions are written concurrently into the InnoDB log in an interleaved fashion. So one thread can just put log records into a buffer as they are generated and continue query execution, and later they may be written or synced asynchroneously by another thread. This is not how the binlog works because of points (2) and (3) above. The full set of log records ("binlog events") is only available during the commit step, and it needs to be written into the binlog at a specific point relative to all other transactions. So it sounds to me that a similar approach with write_lock and flush_lock cannot be used for the binlog (but correct me if I'm wrong). Instead the binlog uses a different optimisation: https://knielsen-hq.org/w/benchmarking-thread-scheduling-in-group-commit/ https://knielsen-hq.org/w/benchmarking-thread-scheduling-in-group-commit-par... Since the binlog writing is essentially sequential in nature, it is more efficient to do it in a single thread for all waiting threads, and this is how the MariaDB binlog group commit is done.
code. A colleague recently claimed to me that LOCK_commit_ordered is a huge bottleneck. As you can see, the write_lock and flush_lock are
But is it really the LOCK_commit_ordered that is a bottleneck? Or is it just this locks that appears in profiling reports because it is where the binlog sequencing happens according to points (2), (3), and (4) above? I'm very interested to learn ways of improving this, but I can easily imagine that the LOCK_commit_ordered could be misunderstood as the bottleneck of what is in reality a much deeper issue in how the binlog is operating in the current implementation.
It seems to me that the missing tracking of logical time in the data structures is causing hangs like MDEV-25611. I feel that the design is
This is a hang with RESET MASTER. I vaguely recall some very tricky locking semantics around RESET MASTER. My guess would be that complex and inconsistent locking around RESET MASTER is the cause, rather than the commit_checkpoint_request() API which is completely asynchroneous and not expected to cause hangs. But I'll need to look into the code and detail to be sure.
based on a wrong assumption that the binlog-covered transactions are the only source of InnoDB log writes. That has never been the case:
Hm, there should be no such assumption. As explained above, InnoDB is free to write whatever and however it wants to its logs. The binlog just needs to know at _some_ point in time that all the transactions in a specific binlog file are now durably committed in the engine. That point in time can be delayed for long (even hours), the only penalty is a slightly longer recovery in case of crash. Or am I missing something?
The task MDEV-18959 (ensuring that one fdatasync() per durable transaction suffices) is also related to this. I think that the
Yes, solving the problem of multiple syncs being required is a long-long time goal, this is a huge bottleneck. At the root of this is having multiple transactional/redo logs (one in each engine and one in binlog), as opposed to a single shared log.
simplest solution (assuming that moving the binlog to a bunch of transactional tables is out of the question) is to ensure that one of
This would be one way to effectively use a single log - the updates to the tables would happen through the InnoDB log. My concerns would include the unnecessary overhead of duplicating writes (to both log and tablespaces) for what is an append-only dataset, and how to make this work for multiple storage engines and multi-engine transactions.
the logs (binlog and other write-ahead logs) is always ahead. If we choose that the binlog must be ahead, then in InnoDB the write_lock
I've always had some concerns about making the binlog the "master" log on which recovery is based. One problem is that the current binlog implementation is very naive. It's not even block-structured and pre-allocated, it's just a simple file being constantly appended to. This is already double overhead on fdatasync() since the filesystem must sync both data and metadata (file length). Another problem is that the binlog is a logical log (SQL statements), and fragile to base recovery on that _must_ be 100% reliable. If we don't want to recover InnoDB transactions by replaying binlog replication event groups, then we anyway need an fdatasync() inside InnoDB in the XA prepare step before we can be sure the transaction is durable, don't we? Another idea would be to go in the opposite direction, where the storage engine could provide a logging service to the server (just like it provides tables). This way the binlog code could call into InnoDB to write the binlog records into the InnoDB log. But this requires changes on the InnoDB side (log archiving for one), and reading binlogs for replication will be slower as it's interleaved with unrelated log records.
acquisition would have to be preceded by a binlog_write_lock acquisition that will ensure that any relevant changes of persistent state (COMMIT, XA PREPARE, XA ROLLBACK, XA COMMIT) that could be part of the InnoDB log write must have been durably written to the binlog.
Yes, it is an interesting idea. What do you see as the mechanism to recover a transaction inside InnoDB in case of a crash just after writing to the binlog? Very interesting discussion and points Marko, thanks for your writeup, and looking further to future discussions with you. - Kristian.