[MariaDB developers] Design of new binlog file format (MDEV-34705)

21 Apr 2025

      As the MDEV-34705 binlog-in-engine project matures, I'm finalising the file
format of the new binlog, and will present it here in case of any final
comments or request for changes. The file format of course is an important
part of the design, as file formats in general are more difficult to change
going forward than many other aspects of the code.

Following previous discussions on the list with Marko, the file format of
the new binlog is now simplified, and mostly independent of InnoDB. The
InnoDB buffer pool is no longer used for binlog files, and InnoDB
recovery uses separate code to recover binlog pages than other tablespace
pages. The binlog implementation reserves two special tablespace IDs in the
redo log to denote binlog files, but otherwise binlog files share little
code with other InnoDB tablespaces.

1. Binlog file names
--------------------

The innodb binlog files have a fixed naming scheme:

  binlog-000000.ibb
  binlog-000001.ibb
  binlog-000002.ibb
  ...

By default they are written to the data directory, but a separate directory
can be configured with --binlog-directory. This, there is no longer any need
for a binlog index file.

New binlog files are pre-allocated to the configured --max-binlog-size in
the background. Unused pages contain zeros until written to. Each binlog
file is always written strictly append-only, bytes are never overwritten or
written out-of-order.

2. Binlog file header
---------------------

Each binlog file is page-based. The page size is 4k, fixed in the initial
release, but planned to be configurable in a future release.

Pages are checksummed (same algorithm as InnoDB full_crc32). The last four
bytes of each page contain the CRC32, rest is available for data (no special
page header is needed).

The first page of the file is reserved as a file header. It contains the
following data (little-endian format):

  Offset  Length  Data
       0       4  "Magic" file identifier (0x010dfefe)
       4       4  Log2 of page size (ie. 12 for 4k pages)
       8       4  Major version number (0)
      12       4  Minor version number (0)
      16       8  The file number from the binlog name, consistency check
      24       8  File length, in number of pages
      32       8  LSN corresponding to the start of this binlog file
      40       8  The "GTID state interval" N. Used to speed up GTID lookup.
      48       8  File number containing out-of-band data referenced from
                  this file.
      56       8  File number containing XA data referenced from this file
                  (for future expansion to support XA).
     508       4  Extra crc32 checksum, to support down to 512 bytes page size.

The idea with version numbers is that new minor versions are readable by
older code, while new major versions are not and older code must refuse to
read the file.

The LSN in the header is used to do recovery correctly. Any redo record with
smaller LSN will be ignored during recovery and not applied to this file.
This way only two tablespace IDs are needed for recovery of binlog files.

The GTID state interval (--innodb-binlog-state-interval) is a replacement
for the GTID indexes of the legacy binlog. Every N pages in a binlog file,
the current binlog GTID state is written. This allows to quickly find a GTID
position for the slave to start at, by doing binary search on the GTID state
records and then scanning ahead over the last few (<=N) pages to find the
starting position. The first GTID state in the file stores the full state,
the remaining pages only store the difference to the first state.

The out-of-band (OOB) file number reference is used to manage when large
transactions are written into the binlog files interleaved with other
transaction data. If a large transaction starts binlogging in
binlog-000005.ibb say, and then completes and writes the commit record to
file binlog-000008.ibb, then the file header for file_no=8 will contain
oob-reference=5, to mark that file_no=5 is needed by dump threads(slave
connections) reading from file_no=8.

An extra crc32 checksum is stored at offset 512 in the header page. This is
to be later able to support 512-byte (or larger) configurable page size.
This makes it possible to read the first 512 bytes of the file, verify the
integrity, and read the actual page size at offset 4. The initial release is
planned to have fixed 4k page size.

3. Binlog data page format
--------------------------

Overall, the binlog contains a logical sequence of "records". Conceptually,
the binlog is one single stream of such records; one record can span two
files, and file boundary is not meaningful for the binlog content.

Each binlog record is written as a single InnoDB mini-transaction (mtr).
This means binlog records will be recovered atomically during InnoDB crash
recovery, a partial record will not be found at the end of a binlog file. It
also means the size of one record is limited by the maximum size of an mtr.

Each record is split into one or more "chunks". A chunk always fits in one
page.

Each binlog page (apart from the first header page) contains one or more
chunks. Then follows 0-3 filler bytes 0xff to pad any remaining space, and
last the 4 bytes of CRC32 checksum. Here is the byte layout of a page:

Offset  Length  Data
     0   len+3  <type> <len1> <len2> <data 0> <data 1> ... <data len-1>
 len+3 A+3+B+3  <type> <len1> <len2> <data 0> <data 1> ... <data len-1>
...
  4089       1  <filler (optional)>
  4090       1  <filler (optional)>
  4091       1  <filler (optional)>
  4092       4  <CRC0> <CRC1> <CRC2> <CRC3>

Thus, each chunk has a simple format with one byte denoting the chunk type,
two bytes (little-endian) denoting the length, and then the data.

Chunk types:

     0   Empty (not yet used data, has no length field)
     1   Commit record
     2   GTID state record, ie. for --innodb-binlog-state-interval
     3   Out-of-band binlog data for large transactions
     4   Dummy record (used to fill up the last page when FLUSH BINARY LOGS)
  0xff   Filler byte, to pad page when there is no room for a whole chunk
         (no length field).

Here for example is a commit record, type=1:

00001000:           41c8 0000 918e 0168 a201 0000      A......h....
00001010: 0026 0000 0000 0000 0008 0001 0000 0000  .&..............
00001020: 0000 0000 0000 0029 0000 0000 0000 918e  .......)........
00001030: 0168 0201 0000 00a1 0000 0000 0000 0000  .h..............
00001040: 0007 0000 0000 0000 0004 0000 2300 0000  ............#...
00001050: 0000 0101 0000 2054 0000 0000 0603 7374  ...... T......st
00001060: 6404 0800 0800 0800 818c 0000 0000 0000  d...............
00001070: 0074 6573 7400 4352 4541 5445 2054 4142  .test.CREATE TAB
00001080: 4c45 2074 3120 2861 2049 4e54 204e 4f54  LE t1 (a INT NOT
00001090: 204e 554c 4c2c 2062 2049 4e54 204e 4f54   NULL, b INT NOT
000010a0: 204e 554c 4c2c 2063 2054 4558 542c 2050   NULL, c TEXT, P
000010b0: 5249 4d41 5259 204b 4559 2861 2c20 6229  RIMARY KEY(a, b)
000010c0: 2920 454e 4749 4e45 3d49 6e6e 6f44 42  ) ENGINE=InnoDB

The 0x41 is the type, and the 0xc8 0x00 is the length=0x00c8. Then follows
the raw event data, in the same format used in the existing binlog
(sql/log_event.h).

The type byte has two additional flag bits (hence 0x41 and not 0x01):

  Bit 6 (0x40)  This is the last chunk of a record
  Bit 7 (0x80)  This is a continuation chunk (not first) of a record

Thus a commit record can consist of a single chunk:

  type=0x41

, or two chunks:

  type=0x01 type=0xc1

, or more than two chunks:

  type=0x01 type=0x81 type=0x81 ... type=0xc1

Chunks are thus used to split a record across pages (a page will not contain
more than one chunk for a single record).

4. Chunk data contents
----------------------

The commit record, type=1, has the following header data, consisting of 5
64-bit numbers:

  num_oob           Number of out-of-band records in this transaction
  oob_first_file_no The file_no of the first oob record
  oob_first_offset  The offset into oob_first_file_no
  oob_last_file_no  The file_no of the last oob record
  oob_last_offset   The offset into oob_last_file_no

For small event groups with no out-of-band data, only num_oob=0 is stored.
The numbers are stored in a compressed format that saves space for small
numbers, see storage/innobase/include/ut0compr_int.h

After follows the raw binlog event data. For a small transaction with no
out-of-band records, this is the whole event group
(GTID, Query/rows, XID/Commit). For large transactions it is just the GTID
event. See above for example commit record.

The GTID state record, type=2, is just a list of GTIDs, similar to the
existing Gtid_list_log_event:

  number_of_gtids
  domain_id_0, server_id_0, seq_no_0
  domain_id_1, server_id_1, seq_no_1
  ...
  domain_id_N-1, server_id_N-1, seq_no_N-1

Like for commit records, the numbers are stored compressed to save space for
small numbers.

Here is an example GTID state record containing one GTID 0-1-575 (see
ut0compr_int.h for how to interpret the compressed numbers 0x8 0x0 0x8
0x11f9 as 1, 0, 1, 575):

00001000: 4205 0008 0008 f911                      B.......        
00001010: 

The OOB data record, type=3, stores one node in a binary tree of event group
data. The out-of-band data is structured as binary trees in a way that can
be written strictly append-only, and read efficiently by dump
threads/connected slaves.

The OOB node is defined by 5 numbers:

  index          The identification of the node (nodes are numbered 0, 1, 2...)
  left_file_no   The file_no of the left child of the node
  left_offset    The offset into left_file_no of the left child. Zero
                 denotes a leaf node.
  right_file_no  The file_no of the right child of the node
  right_offset   The offset into the right_file_no.

After this follows the raw event data.

5. Conclusion
-------------

So this is how the new binlog format is planned to look, barring any
last-minute design change requirements or design review comments. I tried to
include a good amount of details, feel free to ask for any of the specific
details that I did omit, if interested.

Hope this helps,

 - Kristian.

[MariaDB developers] Design of new binlog file format (MDEV-34705)

Kristian Nielsen