Hi Kristian, I am happy to see you being active on the maria-developers list. I don’t think we have collaborated in the past. I have been working on the InnoDB storage engine under MySQL 4.0 through 5.7 (and some 8.0) since 2003, and since 2016 under MariaDB Server, mostly 10.1 and later. I understand that you were an early contributor to MariaDB and implemented some durability I/O optimization between the "distributed transaction" of binlog and the InnoDB write-ahead log (redo log, ib_logfile0). I think that some improvement in this area is long overdue. I find it hard to read the high-level descriptions. Nowhere in https://jira.mariadb.org/browse/MDEV-232 or https://jira.mariadb.org/browse/MDEV-532 I see any mention of any key terminology: * InnoDB log sequence number (LSN): a logical point of time for any persistent InnoDB changes. Measured in bytes. Mini-transactions (atomic changes of InnoDB pages) start and end at some LSN. Corresponds to the Oracle system change number (SCN). * Binlog offset: a logical point of time for any changes at the replication level. Measured in bytes. Replication events such as statements or row-level events start and end at some binlog offset. * The LSN or binlog offset of a COMMIT, XA PREPARE, XA ROLLBACK, or XA COMMIT. There is quite a bit of code in InnoDB that I would like to get rid of. Some of it is related to the data members trx_t::flush_log_later and trx_t::must_flush_log_later. There is also the confusingly named handlerton::commit_checkpoint_request that has nothing to do with log checkpoints, but with some kind of a durability request that fails to identify the minimum logical time up to which everything is desired to be durable. (My talk https://fosdem.org/2022/schedule/event/mariadb_innodb/ explains what a log checkpoint is and how recovery works.) Some years ago, in MDEV-21534 Vladislav Vaintroub introduced write_lock and flush_lock in InnoDB. This works pretty well: a Sysbench workload with innodb_flush_log_at_trx_commit=1 will linearly improve throughput as you add more client connection threads. I think that something similar would be beneficial for the binlog, to keep track of the latest written offset and the latest write offset after which fdatasync() has completed. I am not familiar with the binlog code. A colleague recently claimed to me that LOCK_commit_ordered is a huge bottleneck. As you can see, the write_lock and flush_lock are better because the lock requests include a notion of logical time: If everything up to "my" time has been written or durably written, I do not care to wait. In MDEV-27774 (MariaDB Server 10.8) this was improved further: multiple mini-transactions can copy their local log buffer to the shared log_sys.buf concurrently. It seems to me that the missing tracking of logical time in the data structures is causing hangs like MDEV-25611. I feel that the design is based on a wrong assumption that the binlog-covered transactions are the only source of InnoDB log writes. That has never been the case: For example, there have always been background threads or tasks that purge the history of committed transactions. The task MDEV-18959 (ensuring that one fdatasync() per durable transaction suffices) is also related to this. I think that the simplest solution (assuming that moving the binlog to a bunch of transactional tables is out of the question) is to ensure that one of the logs (binlog and other write-ahead logs) is always ahead. If we choose that the binlog must be ahead, then in InnoDB the write_lock acquisition would have to be preceded by a binlog_write_lock acquisition that will ensure that any relevant changes of persistent state (COMMIT, XA PREPARE, XA ROLLBACK, XA COMMIT) that could be part of the InnoDB log write must have been durably written to the binlog. With best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc