[Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)
Hi Kristian, I am happy to see you being active on the maria-developers list. I don’t think we have collaborated in the past. I have been working on the InnoDB storage engine under MySQL 4.0 through 5.7 (and some 8.0) since 2003, and since 2016 under MariaDB Server, mostly 10.1 and later. I understand that you were an early contributor to MariaDB and implemented some durability I/O optimization between the "distributed transaction" of binlog and the InnoDB write-ahead log (redo log, ib_logfile0). I think that some improvement in this area is long overdue. I find it hard to read the high-level descriptions. Nowhere in https://jira.mariadb.org/browse/MDEV-232 or https://jira.mariadb.org/browse/MDEV-532 I see any mention of any key terminology: * InnoDB log sequence number (LSN): a logical point of time for any persistent InnoDB changes. Measured in bytes. Mini-transactions (atomic changes of InnoDB pages) start and end at some LSN. Corresponds to the Oracle system change number (SCN). * Binlog offset: a logical point of time for any changes at the replication level. Measured in bytes. Replication events such as statements or row-level events start and end at some binlog offset. * The LSN or binlog offset of a COMMIT, XA PREPARE, XA ROLLBACK, or XA COMMIT. There is quite a bit of code in InnoDB that I would like to get rid of. Some of it is related to the data members trx_t::flush_log_later and trx_t::must_flush_log_later. There is also the confusingly named handlerton::commit_checkpoint_request that has nothing to do with log checkpoints, but with some kind of a durability request that fails to identify the minimum logical time up to which everything is desired to be durable. (My talk https://fosdem.org/2022/schedule/event/mariadb_innodb/ explains what a log checkpoint is and how recovery works.) Some years ago, in MDEV-21534 Vladislav Vaintroub introduced write_lock and flush_lock in InnoDB. This works pretty well: a Sysbench workload with innodb_flush_log_at_trx_commit=1 will linearly improve throughput as you add more client connection threads. I think that something similar would be beneficial for the binlog, to keep track of the latest written offset and the latest write offset after which fdatasync() has completed. I am not familiar with the binlog code. A colleague recently claimed to me that LOCK_commit_ordered is a huge bottleneck. As you can see, the write_lock and flush_lock are better because the lock requests include a notion of logical time: If everything up to "my" time has been written or durably written, I do not care to wait. In MDEV-27774 (MariaDB Server 10.8) this was improved further: multiple mini-transactions can copy their local log buffer to the shared log_sys.buf concurrently. It seems to me that the missing tracking of logical time in the data structures is causing hangs like MDEV-25611. I feel that the design is based on a wrong assumption that the binlog-covered transactions are the only source of InnoDB log writes. That has never been the case: For example, there have always been background threads or tasks that purge the history of committed transactions. The task MDEV-18959 (ensuring that one fdatasync() per durable transaction suffices) is also related to this. I think that the simplest solution (assuming that moving the binlog to a bunch of transactional tables is out of the question) is to ensure that one of the logs (binlog and other write-ahead logs) is always ahead. If we choose that the binlog must be ahead, then in InnoDB the write_lock acquisition would have to be preceded by a binlog_write_lock acquisition that will ensure that any relevant changes of persistent state (COMMIT, XA PREPARE, XA ROLLBACK, XA COMMIT) that could be part of the InnoDB log write must have been durably written to the binlog. With best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Marko Mäkelä <marko.makela@mariadb.com> writes:
later. I understand that you were an early contributor to MariaDB and implemented some durability I/O optimization between the "distributed transaction" of binlog and the InnoDB write-ahead log (redo log, ib_logfile0).
Indeed, that was the group commit work: https://m.facebook.com/notes/mark-callaghan/group-commit-in-mariadb-is-fast/... https://knielsen-hq.org/w/fixing-mysql-group-commit-part-1/ https://knielsen-hq.org/w/fixing-mysql-group-commit-part-2/ https://knielsen-hq.org/w/fixing-mysql-group-commit-part-3/ https://knielsen-hq.org/w/fixing-mysql-group-commit-part-4-of-3/
I think that some improvement in this area is long overdue. I find it
Agree. I think it would be fantastic to combine your extensive knowledge of InnoDB with my binlog/replication expertise to improve this! There's a lot of very interesting points in your mail:
hard to read the high-level descriptions. Nowhere in https://jira.mariadb.org/browse/MDEV-232 or https://jira.mariadb.org/browse/MDEV-532 I see any mention of any key terminology:
From the binlog side, there are some basic design goals that influence how
Indeed, my understanding of InnoDB internals is far from complete (though I do understand the concept of LSN at a high level of course). the binlog group commit is implemented. 1. The APIs (handlerton methods and server calls) are designed to be storage-engine independent. So this is one reason why we would not see an InnoDB LSN anywhere in the server part. 2. Transactions (the replication term is "event group") are always written into the binlog as a single atomic (kind of) block. So no interleaving of transactions in the binlog, or transactions that span multiple binlog files. 3. Transactions in the binlogs are strictly ordered, even across different servers in the same replication topology (this is used for GTID and in-order optimistic parallel replication). 4. Identical commit order is enforced between the binlog and the storage engine. I think this is related to getting consistent replication slave position from an innobackup or LVM/BTRFS snapshot backup.
and trx_t::must_flush_log_later. There is also the confusingly named handlerton::commit_checkpoint_request that has nothing to do with log checkpoints, but with some kind of a durability request that
This was an optimisation to remove one fsync from commits when binlog is enabled: https://knielsen-hq.org/w/even-faster-group-commit/ If the server crashes, then on restart the binlog transaction coordinator needs to recover the XA state between the binlog and the engines. For this it scans the storage engines and the binlog files for any pending transactions. Any binlog file that contains pending XA transactions needs to be scanned. Once all transactions in a binlog file are durable committed in the engines, it is preferred to not scan it for efficiency reasons. commit_checkpoint_request() is called when the binlog wants to mark a point before which binlog files no longer need to be scanned (a "binlog checkpoint"). We can see in innobase_checkpoint_request() that this is mapped to an InnoDB LSN (using log_get_lsn()). This follows point (1) above, where the server-level API (binlog) is kept separate from storage engine internals (InnoDB LSN).
fails to identify the minimum logical time up to which everything is desired to be durable. (My talk
I wonder if this is a real mismatch between the API and InnoDB, or a misunderstanding about what is intended from commit_checkpoint_request(). The binlog checkpoint mechanism is completely asynchronous, and it is not a request for the storage engine to take any action to make things durable. It is merely a request for InnoDB to conservatively pick an LSN that is known to contain any transaction that has completed commit_ordered(), it may be any later LSN that is cheaper to obtain. Then later, when that LSN is eventually made durable on the engine's own initiative, the engine will call commit_checkpoint_notify_ha() to complete the binlog checkpoint. The intention is that this will be very easy for an engine to implement, without any changes to how it manages internal logs and durability. It is just a way for the engine to periodically communicate that some LSN is now durable, in a way that can match the LSN to a binlog position without assuming a particular storage engine or LSN format. But if this intention fails, we should look into it.
Some years ago, in MDEV-21534 Vladislav Vaintroub introduced write_lock and flush_lock in InnoDB. This works pretty well: a
If everything up to "my" time has been written or durably written, I do not care to wait.
So if I understand correctly, records from parallel transactions are written concurrently into the InnoDB log in an interleaved fashion. So one thread can just put log records into a buffer as they are generated and continue query execution, and later they may be written or synced asynchroneously by another thread. This is not how the binlog works because of points (2) and (3) above. The full set of log records ("binlog events") is only available during the commit step, and it needs to be written into the binlog at a specific point relative to all other transactions. So it sounds to me that a similar approach with write_lock and flush_lock cannot be used for the binlog (but correct me if I'm wrong). Instead the binlog uses a different optimisation: https://knielsen-hq.org/w/benchmarking-thread-scheduling-in-group-commit/ https://knielsen-hq.org/w/benchmarking-thread-scheduling-in-group-commit-par... Since the binlog writing is essentially sequential in nature, it is more efficient to do it in a single thread for all waiting threads, and this is how the MariaDB binlog group commit is done.
code. A colleague recently claimed to me that LOCK_commit_ordered is a huge bottleneck. As you can see, the write_lock and flush_lock are
But is it really the LOCK_commit_ordered that is a bottleneck? Or is it just this locks that appears in profiling reports because it is where the binlog sequencing happens according to points (2), (3), and (4) above? I'm very interested to learn ways of improving this, but I can easily imagine that the LOCK_commit_ordered could be misunderstood as the bottleneck of what is in reality a much deeper issue in how the binlog is operating in the current implementation.
It seems to me that the missing tracking of logical time in the data structures is causing hangs like MDEV-25611. I feel that the design is
This is a hang with RESET MASTER. I vaguely recall some very tricky locking semantics around RESET MASTER. My guess would be that complex and inconsistent locking around RESET MASTER is the cause, rather than the commit_checkpoint_request() API which is completely asynchroneous and not expected to cause hangs. But I'll need to look into the code and detail to be sure.
based on a wrong assumption that the binlog-covered transactions are the only source of InnoDB log writes. That has never been the case:
Hm, there should be no such assumption. As explained above, InnoDB is free to write whatever and however it wants to its logs. The binlog just needs to know at _some_ point in time that all the transactions in a specific binlog file are now durably committed in the engine. That point in time can be delayed for long (even hours), the only penalty is a slightly longer recovery in case of crash. Or am I missing something?
The task MDEV-18959 (ensuring that one fdatasync() per durable transaction suffices) is also related to this. I think that the
Yes, solving the problem of multiple syncs being required is a long-long time goal, this is a huge bottleneck. At the root of this is having multiple transactional/redo logs (one in each engine and one in binlog), as opposed to a single shared log.
simplest solution (assuming that moving the binlog to a bunch of transactional tables is out of the question) is to ensure that one of
This would be one way to effectively use a single log - the updates to the tables would happen through the InnoDB log. My concerns would include the unnecessary overhead of duplicating writes (to both log and tablespaces) for what is an append-only dataset, and how to make this work for multiple storage engines and multi-engine transactions.
the logs (binlog and other write-ahead logs) is always ahead. If we choose that the binlog must be ahead, then in InnoDB the write_lock
I've always had some concerns about making the binlog the "master" log on which recovery is based. One problem is that the current binlog implementation is very naive. It's not even block-structured and pre-allocated, it's just a simple file being constantly appended to. This is already double overhead on fdatasync() since the filesystem must sync both data and metadata (file length). Another problem is that the binlog is a logical log (SQL statements), and fragile to base recovery on that _must_ be 100% reliable. If we don't want to recover InnoDB transactions by replaying binlog replication event groups, then we anyway need an fdatasync() inside InnoDB in the XA prepare step before we can be sure the transaction is durable, don't we? Another idea would be to go in the opposite direction, where the storage engine could provide a logging service to the server (just like it provides tables). This way the binlog code could call into InnoDB to write the binlog records into the InnoDB log. But this requires changes on the InnoDB side (log archiving for one), and reading binlogs for replication will be slower as it's interleaved with unrelated log records.
acquisition would have to be preceded by a binlog_write_lock acquisition that will ensure that any relevant changes of persistent state (COMMIT, XA PREPARE, XA ROLLBACK, XA COMMIT) that could be part of the InnoDB log write must have been durably written to the binlog.
Yes, it is an interesting idea. What do you see as the mechanism to recover a transaction inside InnoDB in case of a crash just after writing to the binlog? Very interesting discussion and points Marko, thanks for your writeup, and looking further to future discussions with you. - Kristian.
Hi Kristian, Thank you for your detailed reply. Before I read the linked blog posts, here are a few quick comments.
1. The APIs (handlerton methods and server calls) are designed to be storage-engine independent. So this is one reason why we would not see an InnoDB LSN anywhere in the server part.
Every transactional storage engine under MariaDB should have some concept of logical time. It is not necessarily totally ordered like the InnoDB or Aria LSN. The Aria LSN consists of a 32-bit log file number and a 32-bit offset within a log file. The InnoDB LSN is a 64-bit offset, but the log file is written as a ring buffer, so the LSN will be mapped to a file offset using some modulo arithmetics. Do you know how MyRocks implements ACID? There is a function rocksdb_flush_wal(), but it is commented out in MariaDB. In LevelDB, I see some SST and log files, but they may be specific to a single LSM tree, and I would suppose that there are multiple LSM trees in MyRocks, instead of the data from all tables being stored in a single LSM tree.
commit_checkpoint_request() is called when the binlog wants to mark a point before which binlog files no longer need to be scanned (a "binlog checkpoint"). We can see in innobase_checkpoint_request() that this is mapped to an InnoDB LSN (using log_get_lsn()). This follows point (1) above, where the server-level API (binlog) is kept separate from storage engine internals (InnoDB LSN).
That function does not currently take any parameter (such as THD) to identify the transaction of interest, and it cannot indicate that the most recent state change of the transaction has already been durably written.
I wonder if this is a real mismatch between the API and InnoDB, or a misunderstanding about what is intended from commit_checkpoint_request(). The binlog checkpoint mechanism is completely asynchronous, and it is not a request for the storage engine to take any action to make things durable. It is merely a request for InnoDB to conservatively pick an LSN that is known to contain any transaction that has completed commit_ordered(), it may be any later LSN that is cheaper to obtain. Then later, when that LSN is eventually made durable on the engine's own initiative, the engine will call commit_checkpoint_notify_ha() to complete the binlog checkpoint.
Where should a notification be initiated if all changes have already been written at the time handlerton::commit_checkpoint_request is called?
So if I understand correctly, records from parallel transactions are written concurrently into the InnoDB log in an interleaved fashion. So one thread can just put log records into a buffer as they are generated and continue query execution, and later they may be written or synced asynchroneously by another thread.
Yes. Typically, durable writes are only cared about when there is a tangible change of persistent state, such as changing the state of a user transaction, or advancing the InnoDB log checkpoint. If a transaction is rolled back, redo log will be written, but nothing in that transaction should wait for write_lock or flush_lock. The rollback could have a side effect of making the commits other transactions durable.
Since the binlog writing is essentially sequential in nature, it is more efficient to do it in a single thread for all waiting threads, and this is how the MariaDB binlog group commit is done.
Yes, optimizing that would require a file format change, to replace the append-only-to-last-file with something else, such as one or multiple preallocated files, possibly arranged as a ring buffer.
Another problem is that the binlog is a logical log (SQL statements), and fragile to base recovery on that _must_ be 100% reliable. If we don't want to recover InnoDB transactions by replaying binlog replication event groups, then we anyway need an fdatasync() inside InnoDB in the XA prepare step before we can be sure the transaction is durable, don't we?
Yes, it is a balance between writing more upfront, or potentially executing more recovery.
Another idea would be to go in the opposite direction, where the storage engine could provide a logging service to the server (just like it provides tables). This way the binlog code could call into InnoDB to write the binlog records into the InnoDB log. But this requires changes on the InnoDB side (log archiving for one), and reading binlogs for replication will be slower as it's interleaved with unrelated log records.
That would work too, and in fact, I had proposed that years ago. But like you point out, there are obvious drawbacks with that. One related idea that has been floating around is to use the InnoDB log as a "doublewrite buffer" for short enough binlog event groups. The maximum mini-transaction size (dictated by the InnoDB recovery code) is something like 4 megabytes. On an InnoDB log checkpoint, any buffered binlog event groups would have to be durably written to the binlog. Likewise, if a binlog event group is too large to be buffered, it would have to be written and fdatasync()ed in advance.
acquisition would have to be preceded by a binlog_write_lock acquisition that will ensure that any relevant changes of persistent state (COMMIT, XA PREPARE, XA ROLLBACK, XA COMMIT) that could be part of the InnoDB log write must have been durably written to the binlog.
Yes, it is an interesting idea. What do you see as the mechanism to recover a transaction inside InnoDB in case of a crash just after writing to the binlog?
It is simple but potentially costly: Those transactions that were recovered in some other state than COMMIT or XA PREPARE will be rolled back. Then everything will be replayed from the binlog. InnoDB does store the latest committed binlog file name and offset. I do not know where exactly that information is being used. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Marko Mäkelä <marko.makela@mariadb.com> writes:
Do you know how MyRocks implements ACID? There is a function
Sorry, I do not. Maybe Sergey Petrunya does.
That function does not currently take any parameter (such as THD) to identify the transaction of interest, and it cannot indicate that the most recent state change of the transaction has already been durably written.
The transaction of interest is the last one that called commit_ordered() prior to this call of commit_checkpoint_request(). I guess binlog code could save it (it's trivial as commit_ordered() calls are serialised), or InnoDB could do it itself. But let's take it from what we actually want to achive. We have created a new binlog file, and now crash recovery will have to scan the two most recent binlog files. We want to know when scanning the old binlog file is no longer necessary; then we will log a CHECKPOINT_EVENT recording this fact. But we don't want to stall waiting for everything to be flushed to disk immediately; we can just log the CHECKPOINT_EVENT later when appropriate. We call commit_checkpoint_request() into all storage engines to receive a notification when all currently (binlog) committed transactions have become durable. Once the engine has synced past its corresponding LSN, it replies with commit_checkpoint_notify_ha(), and when all engines have done so, the CHECKPOINT_EVENT is binlogged. In 10.4 innobase_checkpoint_request() the code saves the current LSN: lsn = log_get_lsn(); An extra check is made if already lsn <= log_get_flush_lsn(); if so, commit_checkpoint_notify_ha() is called immediately. Else, we check again whenever the log is flushed, until the recorded lsn has been flushed. The code in 11.1 is different, though it seems to be doing something similar. There seems to be some lock-free operations, I wonder if that makes sense, this is not timing-critical code (it's called once per binlog rotation). There's also some complex logic to avoid missing a commit_checkpoint_notify_ha(). I can see how that is important and could occur in a completely idle server. But I think it can be done simpler, by simply scheduling a new check of this every second or so as long as there are pending checkpoint requests. This is all asynchronous, no point in trying to report the checkpoint immediately. If InnoDB is running non-durably, I think there is no point in checking the log at all. It doesn't help that we scan the required log files if the transactions to be recoved from InnoDB were not prepared durably and are unrecoverable anyway. So in non-durable mode InnoDB could just call commit_checkpoint_notify_ha() immediately? I notice that RESET MASTER also does a commit_checkpoint_request() and waits for the notification. I can understand why I did it this way, but in hindsight it doesn't seem like a good idea. Binlog checkpoint is async, so we should not wait on it. RESET MASTER surely cannot be time critical, so probably it would be better to just ask InnoDB to flush everything to its log, skipping the checkpoint mechanism. Or just do nothing, does it make any sense to recover the binlog into a consistent state with InnoDB when we just deleted all of the binlog?!? Anyway, these are comments on the current code. I still didn't fully understand what the problem is with the current API wrt. InnoDB? But I don't see a big problem with changing it either if InnoDB needs something different. All that is needed from the binlog side is to get a notification at some point that an old binlog file is no longer needed for recovery. If we can find a simpler way to do this, then that's good. Removing the checkpointing from RESET MASTER completely might be a good start.
Where should a notification be initiated if all changes have already been written at the time handlerton::commit_checkpoint_request is called?
Then commit_checkpoint_notify_ha() could be called immediately before returning from commit_checkpoint_request().
Since the binlog writing is essentially sequential in nature, it is more efficient to do it in a single thread for all waiting threads, and this is how the MariaDB binlog group commit is done.
Yes, optimizing that would require a file format change, to replace the append-only-to-last-file with something else, such as one or multiple preallocated files, possibly arranged as a ring buffer.
Agree, pre-allocated would be more efficient.
One related idea that has been floating around is to use the InnoDB log as a "doublewrite buffer" for short enough binlog event groups. The maximum mini-transaction size (dictated by the InnoDB recovery code) is something like 4 megabytes. On an InnoDB log checkpoint, any buffered binlog event groups would have to be durably written to the binlog. Likewise, if a binlog event group is too large to be buffered, it would have to be written and fdatasync()ed in advance.
Ah, yes, this is an interesting idea actually.
Yes, it is an interesting idea. What do you see as the mechanism to recover a transaction inside InnoDB in case of a crash just after writing to the binlog?
It is simple but potentially costly: Those transactions that were recovered in some other state than COMMIT or XA PREPARE will be rolled back. Then everything will be replayed from the binlog. InnoDB does
I have an old worklog where this idea was explored in some detail. It is not trivial, there are many corner cases where replaying binlog events will not reliably recover the state (eg. statement-based binlogging). People have learned to live with this for replication, but now that will be extended to the binlog. One "interesting" corner case is a multi-engine transaction that ended up durable in one engine but needs replay in another engine. Maybe all this is tolerable and we can require row-mode binlogging or something, the gain will be huge for some workloads.
store the latest committed binlog file name and offset. I do not know where exactly that information is being used.
(I think the information with binlog file name and offset is used to provision a new slave from an innobackup or LVM/BTRFS snapshot?). - Kristian.
On Fri, May 12, 2023 at 8:32 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
The transaction of interest is the last one that called commit_ordered() prior to this call of commit_checkpoint_request(). I guess binlog code could save it (it's trivial as commit_ordered() calls are serialised), or InnoDB could do it itself.
I think that for InnoDB to guess the last binlog-committed transaction would require adding a thread-local variable, which could cause trouble with the connection thread pool. An added function parameter would be much simpler.
But let's take it from what we actually want to achive. We have created a new binlog file, and now crash recovery will have to scan the two most recent binlog files. We want to know when scanning the old binlog file is no longer necessary; then we will log a CHECKPOINT_EVENT recording this fact. But we don't want to stall waiting for everything to be flushed to disk immediately; we can just log the CHECKPOINT_EVENT later when appropriate.
Is this the original purpose and motivation of the handlerton::commit_checkpoint_request?
There seems to be some lock-free operations, I wonder if that makes sense, this is not timing-critical code (it's called once per binlog rotation).
It was a long time ago when I worked on this. The InnoDB log_sys.mutex certainly was a bottleneck before some InnoDB log related data structures were refactored to use C++11 std::atomic. I might have originally misunderstood the source code comment in sql/handler.h that was added in MDEV-232. The 10.4 innobase_mysql_log_notify() follows the anti-pattern https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces#do... The comment that claims that it is safe might not hold for architectures that use a weak memory model, such as typical implementations of ARM, POWER, and RISC-V (RVWMO). I believe that the https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_orderi... that is employed by the 10.5 innodb_log_flush_request() and log_flush_notify() should be safer.
There's also some complex logic to avoid missing a commit_checkpoint_notify_ha(). I can see how that is important and could occur in a completely idle server. But I think it can be done simpler, by simply scheduling a new check of this every second or so as long as there are pending checkpoint requests. This is all asynchronous, no point in trying to report the checkpoint immediately.
According to a comment in https://jira.mariadb.org/browse/MDEV-25611 there is an https://rr-project.org/ trace of a hang available to Andrei Elkin. It is also pretty easy to reproduce the hang by using some debug instrumentation that will make InnoDB to try harder to avoid issuing log checkpoints (and therefore trigger another call to log_flush_notify() which will therefore "rescue" the hang). It would be very interesting to know what happened concurrently with or after the last call to innodb_log_flush_request().
If InnoDB is running non-durably, I think there is no point in checking the log at all. It doesn't help that we scan the required log files if the transactions to be recoved from InnoDB were not prepared durably and are unrecoverable anyway. So in non-durable mode InnoDB could just call commit_checkpoint_notify_ha() immediately?
I do not think that we should spend too much effort to optimize an unsafe, corruption-prone mode of operation. We should be safe by default (https://jira.mariadb.org/browse/MDEV-16589) and optimize that case.
I notice that RESET MASTER also does a commit_checkpoint_request() and waits for the notification. I can understand why I did it this way, but in hindsight it doesn't seem like a good idea. Binlog checkpoint is async, so we should not wait on it. RESET MASTER surely cannot be time critical, so probably it would be better to just ask InnoDB to flush everything to its log, skipping the checkpoint mechanism. Or just do nothing, does it make any sense to recover the binlog into a consistent state with InnoDB when we just deleted all of the binlog?!?
If RESET MASTER can rewind the binlog position, then it might make sense to ensure that the new binlog position is durably written to the persistent InnoDB transaction metadata storage.
Anyway, these are comments on the current code. I still didn't fully understand what the problem is with the current API wrt. InnoDB?
Based on your description, it does not feel like a big design problem. I would very much like to see an analysis and fix of the MDEV-25611 hang. It could be something fairly trivial, or a sign of a more serious problem in the design.
I have an old worklog where this idea was explored in some detail. It is not trivial, there are many corner cases where replaying binlog events will not reliably recover the state (eg. statement-based binlogging). People have learned to live with this for replication, but now that will be extended to the binlog. One "interesting" corner case is a multi-engine transaction that ended up durable in one engine but needs replay in another engine. Maybe all this is tolerable and we can require row-mode binlogging or something, the gain will be huge for some workloads.
Right, I did not think of multi-engine transactions. Perhaps there could be a special single-engine transaction mode? If the binlog is guaranteed to be durably written ahead of the engine log, then we should be able to skip the MySQLXid mechanism as well. However, I do not know how much that might end up saving.
store the latest committed binlog file name and offset. I do not know where exactly that information is being used.
(I think the information with binlog file name and offset is used to provision a new slave from an innobackup or LVM/BTRFS snapshot?).
Yes, something like that. See https://jira.mariadb.org/browse/MDEV-22351 and https://jira.mariadb.org/browse/MDEV-21611. One more thing: The post https://knielsen-hq.org/w/fixing-mysql-group-commit-part-3/ is missing the fact that when the "fast part" of an InnoDB transaction state change has finished, the LSN of the transaction COMMIT (or XA PREPARE or XA ROLLBACK or XA COMMIT) will be determined. It is not possible to reorder transactions after that point. In fact, after the completion of the "fast part", other transactions in the system will observe the state change of the transaction, even before it becomes durable. Another thing: Could replication make use of the log write completion mechanism https://jira.mariadb.org/browse/MDEV-24341 that was added to MariaDB Server 10.6.0? -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Marko Mäkelä <marko.makela@mariadb.com> writes:
recent binlog files. We want to know when scanning the old binlog file is no longer necessary; then we will log a CHECKPOINT_EVENT recording this fact. But we don't want to stall waiting for everything to be flushed to disk immediately; we can just log the CHECKPOINT_EVENT later when appropriate.
Is this the original purpose and motivation of the handlerton::commit_checkpoint_request?
Yes.
I think that for InnoDB to guess the last binlog-committed transaction would require adding a thread-local variable, which could cause trouble with the connection thread pool. An added function parameter would be much simpler.
I agree that a function parameter seems simpler. We're requesting to be notified when a specific commit is durable in the engine, better specify _which_ commit than for InnoDB to guess, as you say :-). And an engine is free to ignore the parameter, and just make sure all existing transactions at that point are committed (eg. current lsn), as my original InnoDB implementation did. We need to be careful about lifetime. The transaction may no longer exist as a THD or trx_t (?) in memory. But innobase_commit_ordered() could return the corresponding LSN, as was suggested in another mail, and then commit_checkpoint_request() could pass that value. Something like that seems indeed a better API.
There seems to be some lock-free operations, I wonder if that makes sense, this is not timing-critical code (it's called once per binlog rotation).
It was a long time ago when I worked on this. The InnoDB log_sys.mutex certainly was a bottleneck before some InnoDB log related data structures were refactored to use C++11 std::atomic.
I was thinking that the log_flush_request list does not need anything fancy, and could just use a normal mutex, it is so rarely accessed. And any more critical code (eg. using log_sys.mutex) could just do an unsafe check for NULL log_flush_request list, at the worst skipping a binlog checkpoint notification until the next flush. But all that becomes more complicated when RESET MASTER will hang until the checkpoint notification arrives. Then a lost checkpoint notification becomes a problem on an idle server. That's why I'm unhappy about the way RESET MASTER is done now (even though I implemented it, IIRC). It really was my intention that skipping a binlog checkpoint from time to time should be allowed to simplify things in the engine.
The 10.4 innobase_mysql_log_notify() follows the anti-pattern https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces#do...
The comment that claims that it is safe might not hold for architectures that use a weak memory model, such as typical implementations of ARM, POWER, and RISC-V (RVWMO). I believe that the https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_orderi... that is employed by the 10.5 innodb_log_flush_request() and log_flush_notify() should be safer.
Right. As long as RESET MASTER doesn't tolerate delayed notifications, all of this becomes a concern :-(
If the last write to the log was an InnoDB internal one, it could be done by log_write_up_to(lsn, false). The log_sys.flushed_to_disk_lsn would only be advanced (to somewhere closer to log_sys.lsn) if there was a subsequent call to log_write_up_to(lsn, true).
What I found in the test case I added to MDEV-25611 is that InnoDB flushes the redo log every second (assuming I read the code right). This is the function srv_sync_log_buffer_in_background(). Hm, but actually this can be controlled by the option srv_flush_log_at_timeout, so my testcase could be improved to set this to a high value instead of disabling srv_sync_log_buffer_in_background() in the code.
Right, I did not think of multi-engine transactions. Perhaps there could be a special single-engine transaction mode? If the binlog is
Yes. single-engine transaction is surely the important usecase to optimise. It's nice if multi-engine transactions still work, but if they require multiple fsync still, I think that's perfectly fine, not something to allocate a lot of resources to optimise for.
One more thing: The post https://knielsen-hq.org/w/fixing-mysql-group-commit-part-3/ is missing the fact that when the "fast part" of an InnoDB transaction state change has finished, the LSN of the transaction COMMIT (or XA PREPARE or XA ROLLBACK or XA COMMIT) will be determined. It is not possible to reorder transactions after that point. In fact, after the completion of the "fast part", other transactions in the system will observe the state change of the transaction, even before it becomes durable.
Thanks. Yes, this sounds like what I would expect.
Another thing: Could replication make use of the log write completion mechanism https://jira.mariadb.org/browse/MDEV-24341 that was added to MariaDB Server 10.6.0?
Ah, this is interesting, thanks for the pointer. I will need to look into it more to fully understand, but it does seem like something that might be of use. Not just for user threads writing to the binlog, but maybe even more so for parallel replication worker threads. Because of the wait_for_prior_commit() requirements for in-order parallel replication, a _lot_ of THDs can be waiting for binlog commit, and being able to suspend these THDs and let a worker thread do something else in-between could potentially be very beneficial, I think. I also had the idea to use fibers/coroutines as is mentined in the MDEV description, but if that can be avoided, so much the better. - Kristian.
On Sat, May 20, 2023 at 11:07 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I agree that a function parameter seems simpler. We're requesting to be notified when a specific commit is durable in the engine, better specify _which_ commit than for InnoDB to guess, as you say :-).
Given that this code is only executed when switching binlog files (every 1 gigabyte or so of written binlog), or during RESET MASTER, the slight performance degradation due to the "unnecessary but sufficient" condition on InnoDB should not matter much.
We need to be careful about lifetime. The transaction may no longer exist as a THD or trx_t (?) in memory. But innobase_commit_ordered() could return the corresponding LSN, as was suggested in another mail, and then commit_checkpoint_request() could pass that value.
Right. The trx_t objects are being allocated from a special memory pool that facilitates fast reuse. The object of an old transaction could have been reused for something else. So, it would be better to let the storage engine somehow return its logical time. 64 bits for that could be sufficient for all storage engines, and the special value 0 could imply that the current logic (ask the storage engine to write everything) will be used. [snip]
Yes. single-engine transaction is surely the important usecase to optimise. It's nice if multi-engine transactions still work, but if they require multiple fsync still, I think that's perfectly fine, not something to allocate a lot of resources to optimise for.
It's nice that we agree here.
I also had the idea to use fibers/coroutines as is mentined in the MDEV description, but if that can be avoided, so much the better.
I too like https://quoteinvestigator.com/2011/05/13/einstein-simple/ or the KISS principle. Example: If the buf_pool.mutex or fil_system.mutex is a bottleneck, fix the bottlenecks (MDEV-15053, MDEV-23855) instead of introducing complex things such as multiple buffer pool instances and multiple page cleaner threads (removed in MDEV-15058), or introducing a Fil_shard (MySQL 8.0). Or if the log_sys.mutex is a bottleneck, do not introduce a "jungle of threads" that will write to multiple log files, but just reduce the bottlenecks with increased use of std::atomic or with a file format change that allows more clever locking (MDEV-27774). Thread context switches can be expensive when system calls such as mutex waits are involved, and when not, race conditions in lock-free algorithms are hard to diagnose (often invisible to tools like https://rr-project.org). Even when there are no system calls involved in inter-thread communication, "cache line ping-pong" can quickly become expensive, especially on NUMA systems. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
participants (3)
-
Kristian Nielsen
-
Marko Mäkelä
-
Vladislav Vaintroub