Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)

12 May 2023

      Hi Kristian,

Thank you for your detailed reply. Before I read the linked blog
posts, here are a few quick comments.
...
1. The APIs (handlerton methods and server calls) are designed to be
storage-engine independent. So this is one reason why we would not see an
InnoDB LSN anywhere in the server part.
Every transactional storage engine under MariaDB should have some
concept of logical time. It is not necessarily totally ordered like
the InnoDB or Aria LSN. The Aria LSN consists of a 32-bit log file
number and a 32-bit offset within a log file. The InnoDB LSN is a
64-bit offset, but the log file is written as a ring buffer, so the
LSN will be mapped to a file offset using some modulo arithmetics.

Do you know how MyRocks implements ACID? There is a function
rocksdb_flush_wal(), but it is commented out in MariaDB. In LevelDB, I
see some SST and log files, but they may be specific to a single LSM
tree, and I would suppose that there are multiple LSM trees in
MyRocks, instead of the data from all tables being stored in a single
LSM tree.
...
commit_checkpoint_request() is called when the binlog wants to mark a point
before which binlog files no longer need to be scanned (a "binlog
checkpoint"). We can see in innobase_checkpoint_request() that this is
mapped to an InnoDB LSN (using log_get_lsn()). This follows point (1) above,
where the server-level API (binlog) is kept separate from storage engine
internals (InnoDB LSN).
That function does not currently take any parameter (such as THD) to
identify the transaction of interest, and it cannot indicate that the
most recent state change of the transaction has already been durably
written.
...
I wonder if this is a real mismatch between the API and InnoDB, or a
misunderstanding about what is intended from commit_checkpoint_request().
The binlog checkpoint mechanism is completely asynchronous, and it is not a
request for the storage engine to take any action to make things durable. It
is merely a request for InnoDB to conservatively pick an LSN that is known
to contain any transaction that has completed commit_ordered(), it may be
any later LSN that is cheaper to obtain. Then later, when that LSN is
eventually made durable on the engine's own initiative, the engine will call
commit_checkpoint_notify_ha() to complete the binlog checkpoint.
Where should a notification be initiated if all changes have already
been written at the time handlerton::commit_checkpoint_request is
called?
...
So if I understand correctly, records from parallel transactions are written
concurrently into the InnoDB log in an interleaved fashion. So one thread
can just put log records into a buffer as they are generated and continue
query execution, and later they may be written or synced asynchroneously by
another thread.
Yes. Typically, durable writes are only cared about when there is a
tangible change of persistent state, such as changing the state of a
user transaction, or advancing the InnoDB log checkpoint. If a
transaction is rolled back, redo log will be written, but nothing in
that transaction should wait for write_lock or flush_lock. The
rollback could have a side effect of making the commits other
transactions durable.
...
Since the binlog writing is essentially sequential in nature, it is more
efficient to do it in a single thread for all waiting threads, and this is
how the MariaDB binlog group commit is done.
Yes, optimizing that would require a file format change, to replace
the append-only-to-last-file with something else, such as one or
multiple preallocated files, possibly arranged as a ring buffer.
...
Another problem is that the binlog is a logical log (SQL statements), and
fragile to base recovery on that _must_ be 100% reliable. If we don't want
to recover InnoDB transactions by replaying binlog replication event groups,
then we anyway need an fdatasync() inside InnoDB in the XA prepare step
before we can be sure the transaction is durable, don't we?
Yes, it is a balance between writing more upfront, or potentially
executing more recovery.
...
Another idea would be to go in the opposite direction, where the storage
engine could provide a logging service to the server (just like it provides
tables). This way the binlog code could call into InnoDB to write the binlog
records into the InnoDB log. But this requires changes on the InnoDB side
(log archiving for one), and reading binlogs for replication will be slower
as it's interleaved with unrelated log records.
That would work too, and in fact, I had proposed that years ago. But
like you point out, there are obvious drawbacks with that.

One related idea that has been floating around is to use the InnoDB
log as a "doublewrite buffer" for short enough binlog event groups.
The maximum mini-transaction size (dictated by the InnoDB recovery
code) is something like 4 megabytes. On an InnoDB log checkpoint, any
buffered binlog event groups would have to be durably written to the
binlog. Likewise, if a binlog event group is too large to be buffered,
it would have to be written and fdatasync()ed in advance.
...
...
acquisition would have to be preceded by a binlog_write_lock
acquisition that will ensure that any relevant changes of persistent
state (COMMIT, XA PREPARE, XA ROLLBACK, XA COMMIT) that could be part
of the InnoDB log write must have been durably written to the binlog.
Yes, it is an interesting idea. What do you see as the mechanism to recover
a transaction inside InnoDB in case of a crash just after writing to the
binlog?
It is simple but potentially costly: Those transactions that were
recovered in some other state than COMMIT or XA PREPARE will be rolled
back. Then everything will be replayed from the binlog. InnoDB does
store the latest committed binlog file name and offset. I do not know
where exactly that information is being used.

Marko
-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB plc