Hi Kristian, Thank you for your detailed reply. Before I read the linked blog posts, here are a few quick comments.
1. The APIs (handlerton methods and server calls) are designed to be storage-engine independent. So this is one reason why we would not see an InnoDB LSN anywhere in the server part.
Every transactional storage engine under MariaDB should have some concept of logical time. It is not necessarily totally ordered like the InnoDB or Aria LSN. The Aria LSN consists of a 32-bit log file number and a 32-bit offset within a log file. The InnoDB LSN is a 64-bit offset, but the log file is written as a ring buffer, so the LSN will be mapped to a file offset using some modulo arithmetics. Do you know how MyRocks implements ACID? There is a function rocksdb_flush_wal(), but it is commented out in MariaDB. In LevelDB, I see some SST and log files, but they may be specific to a single LSM tree, and I would suppose that there are multiple LSM trees in MyRocks, instead of the data from all tables being stored in a single LSM tree.
commit_checkpoint_request() is called when the binlog wants to mark a point before which binlog files no longer need to be scanned (a "binlog checkpoint"). We can see in innobase_checkpoint_request() that this is mapped to an InnoDB LSN (using log_get_lsn()). This follows point (1) above, where the server-level API (binlog) is kept separate from storage engine internals (InnoDB LSN).
That function does not currently take any parameter (such as THD) to identify the transaction of interest, and it cannot indicate that the most recent state change of the transaction has already been durably written.
I wonder if this is a real mismatch between the API and InnoDB, or a misunderstanding about what is intended from commit_checkpoint_request(). The binlog checkpoint mechanism is completely asynchronous, and it is not a request for the storage engine to take any action to make things durable. It is merely a request for InnoDB to conservatively pick an LSN that is known to contain any transaction that has completed commit_ordered(), it may be any later LSN that is cheaper to obtain. Then later, when that LSN is eventually made durable on the engine's own initiative, the engine will call commit_checkpoint_notify_ha() to complete the binlog checkpoint.
Where should a notification be initiated if all changes have already been written at the time handlerton::commit_checkpoint_request is called?
So if I understand correctly, records from parallel transactions are written concurrently into the InnoDB log in an interleaved fashion. So one thread can just put log records into a buffer as they are generated and continue query execution, and later they may be written or synced asynchroneously by another thread.
Yes. Typically, durable writes are only cared about when there is a tangible change of persistent state, such as changing the state of a user transaction, or advancing the InnoDB log checkpoint. If a transaction is rolled back, redo log will be written, but nothing in that transaction should wait for write_lock or flush_lock. The rollback could have a side effect of making the commits other transactions durable.
Since the binlog writing is essentially sequential in nature, it is more efficient to do it in a single thread for all waiting threads, and this is how the MariaDB binlog group commit is done.
Yes, optimizing that would require a file format change, to replace the append-only-to-last-file with something else, such as one or multiple preallocated files, possibly arranged as a ring buffer.
Another problem is that the binlog is a logical log (SQL statements), and fragile to base recovery on that _must_ be 100% reliable. If we don't want to recover InnoDB transactions by replaying binlog replication event groups, then we anyway need an fdatasync() inside InnoDB in the XA prepare step before we can be sure the transaction is durable, don't we?
Yes, it is a balance between writing more upfront, or potentially executing more recovery.
Another idea would be to go in the opposite direction, where the storage engine could provide a logging service to the server (just like it provides tables). This way the binlog code could call into InnoDB to write the binlog records into the InnoDB log. But this requires changes on the InnoDB side (log archiving for one), and reading binlogs for replication will be slower as it's interleaved with unrelated log records.
That would work too, and in fact, I had proposed that years ago. But like you point out, there are obvious drawbacks with that. One related idea that has been floating around is to use the InnoDB log as a "doublewrite buffer" for short enough binlog event groups. The maximum mini-transaction size (dictated by the InnoDB recovery code) is something like 4 megabytes. On an InnoDB log checkpoint, any buffered binlog event groups would have to be durably written to the binlog. Likewise, if a binlog event group is too large to be buffered, it would have to be written and fdatasync()ed in advance.
acquisition would have to be preceded by a binlog_write_lock acquisition that will ensure that any relevant changes of persistent state (COMMIT, XA PREPARE, XA ROLLBACK, XA COMMIT) that could be part of the InnoDB log write must have been durably written to the binlog.
Yes, it is an interesting idea. What do you see as the mechanism to recover a transaction inside InnoDB in case of a crash just after writing to the binlog?
It is simple but potentially costly: Those transactions that were recovered in some other state than COMMIT or XA PREPARE will be rolled back. Then everything will be replayed from the binlog. InnoDB does store the latest committed binlog file name and offset. I do not know where exactly that information is being used. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc