[MariaDB developers] Update on MDEV-34705 implementing binlog in InnoDB

4 Dec 2024

      I wanted to give an update on the progress of my work on MDEV-34705, which
is a task to implement a new binlog format that is stored as an InnoDB
tablespace (or other engine that chooses to implement it).

To recap, the motivation includes removing the costly 2-phase commit between
binlog and InnoDB; making replication crash-safe even when
--innodb-flush-log-at-trx-commit=0 (or 2) and --sync-binlog=0; remove
unnecessary complexity in the legacy binlog implementation; and removing
limitations in the legacy binlog to facilitate future developemnts for
replication. The design is described in
https://jira.mariadb.org/browse/MDEV-34705 , and the code is developed in
https://github.com/MariaDB/server/commits/knielsen_binlog_in_engine.

A few weeks ago I reached a major milestone with the first working
replication from InnoDB-implemented binlog on the master to a slave. I'm
currently half-way with the last big piece of the puzzle, which is to be
able to split event groups into multiple pieces interleaved with other event
groups in the binlog. After that there will still be many details to be
implemented, as the binlog implementation is visible in many user-facing
places (which is one part of the problem with the legacy binlog). So good
progress, but also lots of work left still.

I want to point out some design decisions that significantly changes how the
new binlog works compared to the legacy one, to facilitate the discussion.
Remember that an important goal is to remove some of the unnecessary
complexity of the legacy binlog, so support for some things will be dropped
that will be controversial, but the design is still open to suggestions with
solid technical arguments.

1. I intend to remove the option to set the base name of binlog files. File
names will be set by the storage engine (I'm using "binlog-NNNNNN.ibb" in
the current code), and identified only by their (64-bit) number. This avoids
the need for the master-bin.index file. The need to keep track of different
base file names for different binlog files creates a _log_ of complexity in
the legacy binlog (and there are still a number of bugs due to this). It
must still be possible to set the directory containing the binlog files (but
the binlog will not be possible to split amongst multiple directories).

2. There will be some delay from commit until the binlog data is readable
externally from the file. This is kind of inherent in the desire to speed up
binlogging exactly by delaying the physical disk write (aka fsync()). Using
mysqlbinlog --read-from-remote-server will work as before (eg. it will be
able to see committed transactions immediately). The code will though try to
flush binlog pages to disk with high priority, so the delay will usually be
small.

3. Binlog rotations, which are quite complex in the legacy binlog, will be
mostly invisible. Binlog tablespace files are pre-allocated in the
background, and will always have a fixed size (--max-binlog-size). Binlog
writes pass seamlessly from the end of one file to the start of the other,
and replication events can be split across binlog files.

4. I am thinking to require GTID mode in the new binlog format, disallowing
slaves to connect using filename/offset. This is not a hard decision yet,
technically I think it is not too hard to keep this. But removing this will
reduce complexity and potentially allow future storage engines to implement
its own binlog format that does not map well to filename/offset.

5. A more controversial thought is to drop support for semi-sync
replication. I think many users use semi-sync believing it does something
more than the reality. Instead of semi-sync, users can always just SELECT
MASTER_GTID_WAIT(@@last_gtid) on a slave to get arguably better
functionality. And the semi-sync implementation has always been problematic
(IMHO), what with sending the actual binlog filename string back and forth
with every commit, and causing much complexity and many bugs. Less
controversial will be to release the first version without semi-sync support
and add it later.

6. Large event groups (configurable, currently using --binlog-cache-size)
will be written out-of-band into the binlog during query execution. This
means the event group for the transaction can be binlogged in different
pieces that can be interleaved with other event groups. This removes the
limitation that even a huge transaction must be binlogged as a single
consecutive event group in a single binlog file (which can stall other
commits). It also allows a future (not in first release) enhancement where
optimistic parallel replication could optionally start applying a large
transaction on the slave while it is still executing on the master. In the
first version, the dump thread on the master will assemble the pieces before
sending to the slave.

This means that if an active binlog file N contains a commit that references
event data writtent to file (N-k), then binlog log purge will be blocked not
just from N, but from N-k. It also means that if a large transaction ends up
being rolled back, then this will leave extra unused data in the binlog
files until purged. I think this is a good trade-off, but it's easy to add
an option to disable the out-of-band binlogging, if desired in some special
uses.

7. For migration to the new binlog, I want to allow that the old part of the
binlog is in the legacy format, and the new part is using the new
implementation. This to allow switching a replication setup to use the new
implementation by simply stopping and restarting the master with the new
option --binlog-storage-engine=innodb, and the slaves can pick up from where
they left.

I also want to leave a way to roll back, that is for users to disable the
new binlog and go back to the legacy one in case of problems. But I want to
avoid a binlog that goes back-and-forth between different formats (only
allow a single point where it switches from legacy to new). So current
thinking is that rolling back to the legacy format will be with a script
that converts any newly written binlog files in the new format to the legacy
format while traffic is stopped.

As always, comments and suggestions very welcome.

 - Kristian.

[MariaDB developers] Update on MDEV-34705 implementing binlog in InnoDB

Kristian Nielsen