
As the MDEV-34705 binlog-in-engine project matures, I'm finalising the file format of the new binlog, and will present it here in case of any final comments or request for changes. The file format of course is an important part of the design, as file formats in general are more difficult to change going forward than many other aspects of the code. Following previous discussions on the list with Marko, the file format of the new binlog is now simplified, and mostly independent of InnoDB. The InnoDB buffer pool is no longer used for binlog files, and InnoDB recovery uses separate code to recover binlog pages than other tablespace pages. The binlog implementation reserves two special tablespace IDs in the redo log to denote binlog files, but otherwise binlog files share little code with other InnoDB tablespaces. 1. Binlog file names -------------------- The innodb binlog files have a fixed naming scheme: binlog-000000.ibb binlog-000001.ibb binlog-000002.ibb ... By default they are written to the data directory, but a separate directory can be configured with --binlog-directory. This, there is no longer any need for a binlog index file. New binlog files are pre-allocated to the configured --max-binlog-size in the background. Unused pages contain zeros until written to. Each binlog file is always written strictly append-only, bytes are never overwritten or written out-of-order. 2. Binlog file header --------------------- Each binlog file is page-based. The page size is 4k, fixed in the initial release, but planned to be configurable in a future release. Pages are checksummed (same algorithm as InnoDB full_crc32). The last four bytes of each page contain the CRC32, rest is available for data (no special page header is needed). The first page of the file is reserved as a file header. It contains the following data (little-endian format): Offset Length Data 0 4 "Magic" file identifier (0x010dfefe) 4 4 Log2 of page size (ie. 12 for 4k pages) 8 4 Major version number (0) 12 4 Minor version number (0) 16 8 The file number from the binlog name, consistency check 24 8 File length, in number of pages 32 8 LSN corresponding to the start of this binlog file 40 8 The "GTID state interval" N. Used to speed up GTID lookup. 48 8 File number containing out-of-band data referenced from this file. 56 8 File number containing XA data referenced from this file (for future expansion to support XA). 508 4 Extra crc32 checksum, to support down to 512 bytes page size. The idea with version numbers is that new minor versions are readable by older code, while new major versions are not and older code must refuse to read the file. The LSN in the header is used to do recovery correctly. Any redo record with smaller LSN will be ignored during recovery and not applied to this file. This way only two tablespace IDs are needed for recovery of binlog files. The GTID state interval (--innodb-binlog-state-interval) is a replacement for the GTID indexes of the legacy binlog. Every N pages in a binlog file, the current binlog GTID state is written. This allows to quickly find a GTID position for the slave to start at, by doing binary search on the GTID state records and then scanning ahead over the last few (<=N) pages to find the starting position. The first GTID state in the file stores the full state, the remaining pages only store the difference to the first state. The out-of-band (OOB) file number reference is used to manage when large transactions are written into the binlog files interleaved with other transaction data. If a large transaction starts binlogging in binlog-000005.ibb say, and then completes and writes the commit record to file binlog-000008.ibb, then the file header for file_no=8 will contain oob-reference=5, to mark that file_no=5 is needed by dump threads(slave connections) reading from file_no=8. An extra crc32 checksum is stored at offset 512 in the header page. This is to be later able to support 512-byte (or larger) configurable page size. This makes it possible to read the first 512 bytes of the file, verify the integrity, and read the actual page size at offset 4. The initial release is planned to have fixed 4k page size. 3. Binlog data page format -------------------------- Overall, the binlog contains a logical sequence of "records". Conceptually, the binlog is one single stream of such records; one record can span two files, and file boundary is not meaningful for the binlog content. Each binlog record is written as a single InnoDB mini-transaction (mtr). This means binlog records will be recovered atomically during InnoDB crash recovery, a partial record will not be found at the end of a binlog file. It also means the size of one record is limited by the maximum size of an mtr. Each record is split into one or more "chunks". A chunk always fits in one page. Each binlog page (apart from the first header page) contains one or more chunks. Then follows 0-3 filler bytes 0xff to pad any remaining space, and last the 4 bytes of CRC32 checksum. Here is the byte layout of a page: Offset Length Data 0 len+3 <type> <len1> <len2> <data 0> <data 1> ... <data len-1> len+3 A+3+B+3 <type> <len1> <len2> <data 0> <data 1> ... <data len-1> ... 4089 1 <filler (optional)> 4090 1 <filler (optional)> 4091 1 <filler (optional)> 4092 4 <CRC0> <CRC1> <CRC2> <CRC3> Thus, each chunk has a simple format with one byte denoting the chunk type, two bytes (little-endian) denoting the length, and then the data. Chunk types: 0 Empty (not yet used data, has no length field) 1 Commit record 2 GTID state record, ie. for --innodb-binlog-state-interval 3 Out-of-band binlog data for large transactions 4 Dummy record (used to fill up the last page when FLUSH BINARY LOGS) 0xff Filler byte, to pad page when there is no room for a whole chunk (no length field). Here for example is a commit record, type=1: 00001000: 41c8 0000 918e 0168 a201 0000 A......h.... 00001010: 0026 0000 0000 0000 0008 0001 0000 0000 .&.............. 00001020: 0000 0000 0000 0029 0000 0000 0000 918e .......)........ 00001030: 0168 0201 0000 00a1 0000 0000 0000 0000 .h.............. 00001040: 0007 0000 0000 0000 0004 0000 2300 0000 ............#... 00001050: 0000 0101 0000 2054 0000 0000 0603 7374 ...... T......st 00001060: 6404 0800 0800 0800 818c 0000 0000 0000 d............... 00001070: 0074 6573 7400 4352 4541 5445 2054 4142 .test.CREATE TAB 00001080: 4c45 2074 3120 2861 2049 4e54 204e 4f54 LE t1 (a INT NOT 00001090: 204e 554c 4c2c 2062 2049 4e54 204e 4f54 NULL, b INT NOT 000010a0: 204e 554c 4c2c 2063 2054 4558 542c 2050 NULL, c TEXT, P 000010b0: 5249 4d41 5259 204b 4559 2861 2c20 6229 RIMARY KEY(a, b) 000010c0: 2920 454e 4749 4e45 3d49 6e6e 6f44 42 ) ENGINE=InnoDB The 0x41 is the type, and the 0xc8 0x00 is the length=0x00c8. Then follows the raw event data, in the same format used in the existing binlog (sql/log_event.h). The type byte has two additional flag bits (hence 0x41 and not 0x01): Bit 6 (0x40) This is the last chunk of a record Bit 7 (0x80) This is a continuation chunk (not first) of a record Thus a commit record can consist of a single chunk: type=0x41 , or two chunks: type=0x01 type=0xc1 , or more than two chunks: type=0x01 type=0x81 type=0x81 ... type=0xc1 Chunks are thus used to split a record across pages (a page will not contain more than one chunk for a single record). 4. Chunk data contents ---------------------- The commit record, type=1, has the following header data, consisting of 5 64-bit numbers: num_oob Number of out-of-band records in this transaction oob_first_file_no The file_no of the first oob record oob_first_offset The offset into oob_first_file_no oob_last_file_no The file_no of the last oob record oob_last_offset The offset into oob_last_file_no For small event groups with no out-of-band data, only num_oob=0 is stored. The numbers are stored in a compressed format that saves space for small numbers, see storage/innobase/include/ut0compr_int.h After follows the raw binlog event data. For a small transaction with no out-of-band records, this is the whole event group (GTID, Query/rows, XID/Commit). For large transactions it is just the GTID event. See above for example commit record. The GTID state record, type=2, is just a list of GTIDs, similar to the existing Gtid_list_log_event: number_of_gtids domain_id_0, server_id_0, seq_no_0 domain_id_1, server_id_1, seq_no_1 ... domain_id_N-1, server_id_N-1, seq_no_N-1 Like for commit records, the numbers are stored compressed to save space for small numbers. Here is an example GTID state record containing one GTID 0-1-575 (see ut0compr_int.h for how to interpret the compressed numbers 0x8 0x0 0x8 0x11f9 as 1, 0, 1, 575): 00001000: 4205 0008 0008 f911 B....... 00001010: The OOB data record, type=3, stores one node in a binary tree of event group data. The out-of-band data is structured as binary trees in a way that can be written strictly append-only, and read efficiently by dump threads/connected slaves. The OOB node is defined by 5 numbers: index The identification of the node (nodes are numbered 0, 1, 2...) left_file_no The file_no of the left child of the node left_offset The offset into left_file_no of the left child. Zero denotes a leaf node. right_file_no The file_no of the right child of the node right_offset The offset into the right_file_no. After this follows the raw event data. 5. Conclusion ------------- So this is how the new binlog format is planned to look, barring any last-minute design change requirements or design review comments. I tried to include a good amount of details, feel free to ask for any of the specific details that I did omit, if interested. Hope this helps, - Kristian.