Points of low-level design for MDEV-34705, binlog-in-engine

30 Jun 2025

      As the new binlog format (MDEV-34705 Binlog-in-engine) is taking shape, I
wanted to write up some more details on a number of the low-level design
decisions made so far.

Re-implementing the binlog format is a huge undertaking due to way it
interacts with many different parts of the server code, as well as
user-visible interfaces. I am not sure how many people actually read these
detailed write-ups, but I think it is still important to at least try to
present the design and give others a chance to comment. So many other large
changes to or around the server code are being done without even an attempt
to proper design review, often with serious defects as a result.

All of the design elements mentioned in this mail can be seen in the code
available in the branch "knielsen_binlog_in_engine" on Github:

  https://github.com/MariaDB/server/commits/knielsen_binlog_in_engine

File format
===========

It was clear from the start that a page-based file format should be used.
Initially, the idea was to use a new type of InnoDB tablespace. But after
discussions with Marko, this was revised to use a simplified custom format
for the binlog files. This avoids much of the overhead of a full InnoDB
tablespace and page format, but at the cost of having to implement new code
to take over the functionality of the InnoDB buffer pool and recovery code.
The InnoDB redo log is still used, with only minor modifications.

The page size is currently 4kB, I'm considering 16kB to reduce per-page
runtime overhead. The page size is independent of the InnoDB page size and
can easily be made configurable. The first page of the file is reserved as a
header page; rest of pages have only a four-byte overhead for a CRC32.
The CRC32 replaces the iffy per-event checksum in the old legacy binlog.

Using a page-based format means it is possible to random-access read the
binlog file (this is not possible in the legacy binlog, as there is no easy
way to determine the boundaries between events). The GTID indexes are thus
no longer necessary; instead we simply binary-search the binlog file for the
starting GTID position. To facilitate this binary search, the current GTID
binlog state is written (compressed) every --innodb-binlog-state-interval
bytes. Similarly, at restart restart, binary search is used to find the
position at which to continue writing the last binlog file (as opposed to
forcing a binlog rotation as in the legacy binlog). A similar mechanism is
used to recover the current gtid_binlog_state without the need for a
master-bin.state file.

In the new binlog implementation, binlog files are identified solely by
their number (starting from 0). The binlog file name is engine-specific, not
user configurable. This considerably simplifies the binlog handling,
avoiding the need for the binlog index file. A new --binlog-directory option
allows to still place the binlog on a file system of choice.

Generally 64-bit numbers are used to identify binlog file numbers, sizes,
offsets, etc. to avoid limitations like the legacy binlog's 32-bit file
offsets. A simple compression scheme is used to store small numbers in few
bytes. See storage/innobase/include/ut0compr_int.h for the implementation,
which is written to be efficient on modern superscalar CPUs, and helps avoid
64-bit penalty on data storage sizes.

In the current design, the new binlog format is not very InnoDB specific,
but it is still implemented inside the storage engine, in principle in an
engine-dependent format. The advantage of this is that another engine has
the flexibility to implement a different format, perhaps even interleaving
the binlog data with other data in existing data files. On the other hand,
fixing the binlog file format like it is with InnoDB would allow to move
part of the code to the server layer and share it between engines. I believe
this decision would be best revisited when/if another engine binlog
implementation becomes relevant.

The function ibb_write_header_page() shows how the header page is written.
The function innodb_binlog_prealloc_thread() runs the background thread that
pre-allocates new binlog files and closes fully written files.

Binlog recovery
===============

Just two InnoDB tablespace IDs (for redo records) are reserved for binlog
files, used for odd/even files alternately. To be able to unambiguously
match a redo record to a binlog file during recovery, we make sure that only
the two most recent binlog files will need recovery. When the writer
switches to binlog file N, we first pre-allocate (in a background thread)
file N+1, and then we fsync() to disk the file (N-1). The page header
contains an InnoDB LSN corresponding to the start of the binlog data in the
file.

Thus, when binlog recovery processes a redo record, it can select the last
or second-to-last existing binlog file that has a matching tablespace id and
compare the record's LSN against the LSN in the file header. If the record's
LSN is smaller, then we know this record corresponds to a binlog file that
was already durably fsync()ed to disk, so we can skip it; otherwise the
record is applied to the file (possibly creating the file anew if required).

This way, we avoid any need to stall during commits to fsync() any binlog
data (except when binlog write load starts exceeding the disk I/O capacity).
And InnoDB checkpoints can span arbitrary number of binlog files, so that we
also avoid stalling binlog rotation while waiting to complete a checkpoint.

As a special case, RESET MASTER (or the initial binlog creation at first
server restart) fsync()s both the InnoDB redo log and the newly created file
header to disk before starting to write data. This simplifies the binlog
recovery as we then will always have at least one file header available with
starting LSN to compare against the LSN in redo records.

In the case of --innodb-flush-log-at-trx-commit=0|2, binlog recovery will
truncate extra binlog data for transactions that were not durably written to
the redo log. This allows the code to write binlog data out to disk
immediately, without waiting for redo log sync to happen first, unlike for
InnoDB table/index data. This works because binlog files are always written
strictly append-only; data is never modified in place.

In fact, each page is only ever written once to the file system, except for
one special case. During an InnoDB checkpoint, binlog data prior to the
checkpoint LSN need to be synced durably to disk. If there is not enough
binlog traffic during this to fill up the last page, we might need to write
a partial last page to disk. When this happens, the page is marked, and the
next write to that page will redo log the complete page (as opposed to just
the appended data). This avoids the need for any double-write buffer for the
binlog writes.

The class binlog_recovery implements the binlog recovery algorithm.

Out-of-band binlogging
======================

The new binlog implementation removes the limitation in the legacy binlog
that each event group (transaction) must be written consecutively to a
single binlog file. When the binlog transaction cache is full, instead of
spilling it to a temporary file, the cache data is written directly into the
binlog as so-called out-of-band data.

At commit time, large transactions only need to binlog a small commit record
that back-references the previously written partial event data. This avoids
a large stall at commit time of huge transactions. It also makes it possible
to keep a fixed binlog file size and still allow transactions much larger
than that. If a transaction rolls back, any already written out-of-band data
is simply left in the binlog file as garbage, to disappear when that file is
eventually purged.

The individual out-of-band records are structured on disk as a forest of
complete binary trees. This is used to allow efficient reading of the data
starting from the commit record (used to re-assemble complete event groups
to send to slaves from the binlog dump threads); and still allow to write
the binlog files strictly append-only, without needing to go back to update
already written records (eg. with forward-pointers).

The header page of each binlog contains a reference to any earlier binlog
file that may contain out-of-band data referenced by commit records in this
file. This is used to prevent the purge of old binlog files that may still
be needed by active dump threads now reading a newer file. This mechanism
can also be used in a future version to implement proper XA replication
support following the design described in MDEV-32020.

The class innodb_binlog_oob_reader implements the logic that re-assembles
the out-of-band pieces of data into a complete event group for the reader.
The struct binlog_oob_context is the corresponding object that keeps track
of the construction of the forest of complete binary trees in the writer.

Page FIFO
=========

The binlog implements a page FIFO as a replacement for the InnoDB buffer
pool used for other tablespaces. As the binlog is written strictly
append-only, the "buffer pool" functionality is much simplified. Pages are
only added to the end, and can be written always in sequence from the
start, hence "FIFO".

The page FIFO is accessed by three main parts of the code:

1. By threads writing to the binlog, allocating new pages and writing data
into them in-memory. Binlog writes are serialized by LOCK_commit_ordered.

2. By binlog dump thread readers (eg. connected slaves). Readers must read
data from pages in the page FIFO until they have been written to disk (and
removed from the FIFO); then they must be read from the file through the
file system.

3. By the flush thread which writes out completed pages to the binlog files
in the background, and a couple other similar non-performance-critical
places, such as tablespace close and InnoDB checkpoint to flush binlog files
to disk.

A simple latching mechanism is used to allow writer and readers to access
pages concurrently, to avoid readers stalling the writers. Atomic variable
binlog_cur_end_offset[] is used to make readers never read ahead of the
writer without requiring mutex locking during the read; see also the
description of binlog_cur_durable_offset[] below under "Crash-safe
replication".

Writer and readers (1) and (2) are prioritised over background flushing (3);
the latter will wait for a page to be no longer latched. A global mutex is
used to protect the latching and data structure integrity of the page FIFO.
This mutex is only held for a brief time (few instructions), to make the
page FIFO scalable.

The page FIFO is implemented in class fsp_binlog_page_fifo.

Crash-safe replication
======================

One of the goals of the new binlog format is to make the binlog and
replication crash-safe also when --innodb-flush-log-at-trx-commit=0|2.
To achieve this, it is necessary to make sure that binlog events are not
sent to slaves until they have become durable on the master; otherwise, if
the master crashes, the slave might have a transaction that no longer exists
on the master when it comes back up after crash recovery, and replication
will diverge.

(This is the default setup; the converse property that a slave is always
ahead of the master and any master crash will demote the old master as a
slave can be provided in a later version using semi-synchronous replication
with an "AFTER-PREPARE" configuration).

Binlog events become durable (together with the table-data part of the
associated transaction) when the corresponding LSN in the InnoDB redo log
has been durably flushed to disk. The binlog code keeps track of each binlog
write and the associated redo log LSN in the singleton class pending_lsn_fifo.
A binlog write of a commit record inserts an entry into this. Then when that
LSN has been durably written, the atomic binlog_cur_durable_offset[] is
updated to reflect the point up to which the binlog is now known to be
durable. The dump thread readers (connected slaves) will use this to not
read too far ahead.

When a reader reaches the end of the durable part of the binlog, that reader
can then initiate an InnoDB redo log flush to make more of the binlog
durable and allow the read to continue. This way, we achieve that transactions
can complete without waiting for fsync-per-commit (which can massively
increase commits-per-second depending on disk latency); at the same time,
waiting readers will trigger redo log sync immediately and provide low
latency to slaves.

Backup
======

With the new binlog implementation, binlog files are maintained
transactionally using the same InnoDB crash recovery mechanism used for its
other tablespaces. This allows mariabackup to include the binlog files in
the backup, something that was not easily available before if not using
something like LVM snapshots.

The binlog files are copied at the end of the backup, after releasing backup
locks. The backup LSN is used together with the InnoDB/binlog recovery
algorithm (during mariabackup --prepare) to make the binlog in the backup
consistent with the table data.

Apart from the obvious desirability of having a backup facility for the
binlogs, this also allows a better way to get a corresponding binlog
position to use the backup to provision a new replication slave. After
setting up a new slave from a backup of the master, the new slave can simply
be started (using GTID) from the @@gtid_binlog_pos restored with the binlog
(CHANGE MASTER TO ... MASTER_DEMOTE_TO_SLAVE can be used for this). When
setting up from a backup of another slave, the starting GTID position is
directly available from the restored mysql.gtid_slave_pos table (as is
already the case in existing MariaDB versions). This avoids the need for the
special code in InnoDB that transactionally stores the current binlog
position in the undo segments, as well as the need to hold backup locks
while querying @@gtid_binlog_pos or SHOW MASTER STATUS.

Storage engine API
==================

The initial implementation of the new binlog is inside InnoDB, but the
intention is that another (transactional) storage engine could do its own
implementation to achive similar benefits as InnoDB. Some effort has gone
into making a clean and well-defined interface between the server/binlog
layer and the storage engine to facilitate this.

The API is seen as new handlerton methods in struct handlerton, as well as
new classes handler_binlog_reader, handler_binlog_event_group_info, and
handler_binlog_purge_info.

Here is an overview of the new handlerton methods:

  bool (*binlog_init)(size_t binlog_size, const char *directory);

    This is used at server startup to initialize the engine's binlog data
    structures, when --binlog-storage-engine=ENGINE is configured by the
    user.

  bool (*binlog_write_direct_ordered)(IO_CACHE *cache,
                              handler_binlog_event_group_info *binlog_info,
                              const rpl_gtid *gtid);
  bool (*binlog_write_direct)(IO_CACHE *cache,
                              handler_binlog_event_group_info *binlog_info,
                              const rpl_gtid *gtid);

    These are called by the server to write binlog data into the
    engine-implemented binlog. They are dual methods similar to
    commit_ordered() and commit(); the _ordered variant runs serialized
    (in binlog and GTID order) under LOCK_commit_ordered, while the
    non-ordered runs without holding locks. In addition, the engine is
    expected to binlog the transaction itself as part of binlog group commit
    (during commit_ordered), obtaining the binlog data to write from the
    server-provided binlog_get_cache() function. The class
    handler_binlog_event_group_info object is used to hold context data
    related to the event group while it is being binlogged.

  void (*binlog_group_commit_ordered)(THD *thd,
                               handler_binlog_event_group_info *binlog_info);

    This is called at the end of binlog group commit (after releasing
    LOCK_commit_ordered), with the binlog_info of the last commit in the
    group. It is used to optimise the handling of the crash-safe replication
    described above, so that the engine can perform operations that need
    only be done once per group commit (not once per commit), and which can
    be done without holding the performance-critical LOCK_commit_ordered
    mutex.

  bool (*binlog_oob_data_ordered)(THD *thd, const unsigned char *data,
                                  size_t data_len, void **engine_data);
  bool (*binlog_oob_data)(THD *thd, const unsigned char *data,
                          size_t data_len, void **engine_data);

    These are called by the server layer to binlog large event groups
    (transactions) in multiple pieces that can be interleaved with other
    events in the binlog. This allows large transactions to span multiple
    binlog files, avoids stalls around commits of large transactions, and
    allows event groups larger than the maximum size of an InnoDB
    mini-transaction.

  void (*binlog_oob_reset)(THD *thd, void **engine_data);
  void (*binlog_oob_free)(THD *thd, void *engine_data);

    These are simple functions to coordinate (re-)initialisation and
    de-allocation of engine-specific data associated with the binlogging of
    event groups.

  handler_binlog_reader * (*get_binlog_reader)(bool wait_durable);

    This is the main function used to implement the reading of the binlog by
    binlog dump threads (eg. connected slaves). The class
    handler_binlog_reader implements an interface for the server to read
    binlog data and for the dump threads to wait for new data to arrive when
    they reach the current EOF.

  void (*binlog_status)(uint64_t * out_fileno, uint64_t *out_pos);

    This is used to implement SHOW MASTER STATUS. The file number and offset
    is mostly informational, the new binlog format is intended for GTID use.

  void (*get_filename)(char name[FN_REFLEN], uint64_t file_no);

    In the new binlog implementation, binlog files are identified only by
    number (the file name is engine-specific and not user configurable).
    This function allows the server to provide the engine-specific file name
    for informational purposes to the user.

  binlog_file_entry * (*get_binlog_file_list)(MEM_ROOT *mem_root);

    This is used to implement SHOW BINARY LOGS.

  bool (*binlog_flush)();

    This is used to implement FLUSH BINARY LOGS.

  bool (*binlog_get_init_state)(rpl_binlog_state_base *out_state);

    This is used to implement FLUSH BIANRY LOGS DELETE_DOMAIN_ID, to get the
    binlog state at the start of the current (non-purged) binlog.

  bool (*reset_binlogs)();

    This is used to implement RESET MASTER.

  int (*binlog_purge)(handler_binlog_purge_info *purge_info);

    This is used to implement PURGE BINARY LOGS as well as auto-purge. The
    class handler_binlog_purge_info is used to pass the various purge
    options to the engine, as well as to inform about active dump threads
    that would prevent purging past their current read position.

Kristian Nielsen

tags

participants (1)