
As the new binlog format (MDEV-34705 Binlog-in-engine) is taking shape, I wanted to write up some more details on a number of the low-level design decisions made so far. Re-implementing the binlog format is a huge undertaking due to way it interacts with many different parts of the server code, as well as user-visible interfaces. I am not sure how many people actually read these detailed write-ups, but I think it is still important to at least try to present the design and give others a chance to comment. So many other large changes to or around the server code are being done without even an attempt to proper design review, often with serious defects as a result. All of the design elements mentioned in this mail can be seen in the code available in the branch "knielsen_binlog_in_engine" on Github: https://github.com/MariaDB/server/commits/knielsen_binlog_in_engine File format =========== It was clear from the start that a page-based file format should be used. Initially, the idea was to use a new type of InnoDB tablespace. But after discussions with Marko, this was revised to use a simplified custom format for the binlog files. This avoids much of the overhead of a full InnoDB tablespace and page format, but at the cost of having to implement new code to take over the functionality of the InnoDB buffer pool and recovery code. The InnoDB redo log is still used, with only minor modifications. The page size is currently 4kB, I'm considering 16kB to reduce per-page runtime overhead. The page size is independent of the InnoDB page size and can easily be made configurable. The first page of the file is reserved as a header page; rest of pages have only a four-byte overhead for a CRC32. The CRC32 replaces the iffy per-event checksum in the old legacy binlog. Using a page-based format means it is possible to random-access read the binlog file (this is not possible in the legacy binlog, as there is no easy way to determine the boundaries between events). The GTID indexes are thus no longer necessary; instead we simply binary-search the binlog file for the starting GTID position. To facilitate this binary search, the current GTID binlog state is written (compressed) every --innodb-binlog-state-interval bytes. Similarly, at restart restart, binary search is used to find the position at which to continue writing the last binlog file (as opposed to forcing a binlog rotation as in the legacy binlog). A similar mechanism is used to recover the current gtid_binlog_state without the need for a master-bin.state file. In the new binlog implementation, binlog files are identified solely by their number (starting from 0). The binlog file name is engine-specific, not user configurable. This considerably simplifies the binlog handling, avoiding the need for the binlog index file. A new --binlog-directory option allows to still place the binlog on a file system of choice. Generally 64-bit numbers are used to identify binlog file numbers, sizes, offsets, etc. to avoid limitations like the legacy binlog's 32-bit file offsets. A simple compression scheme is used to store small numbers in few bytes. See storage/innobase/include/ut0compr_int.h for the implementation, which is written to be efficient on modern superscalar CPUs, and helps avoid 64-bit penalty on data storage sizes. In the current design, the new binlog format is not very InnoDB specific, but it is still implemented inside the storage engine, in principle in an engine-dependent format. The advantage of this is that another engine has the flexibility to implement a different format, perhaps even interleaving the binlog data with other data in existing data files. On the other hand, fixing the binlog file format like it is with InnoDB would allow to move part of the code to the server layer and share it between engines. I believe this decision would be best revisited when/if another engine binlog implementation becomes relevant. The function ibb_write_header_page() shows how the header page is written. The function innodb_binlog_prealloc_thread() runs the background thread that pre-allocates new binlog files and closes fully written files. Binlog recovery =============== Just two InnoDB tablespace IDs (for redo records) are reserved for binlog files, used for odd/even files alternately. To be able to unambiguously match a redo record to a binlog file during recovery, we make sure that only the two most recent binlog files will need recovery. When the writer switches to binlog file N, we first pre-allocate (in a background thread) file N+1, and then we fsync() to disk the file (N-1). The page header contains an InnoDB LSN corresponding to the start of the binlog data in the file. Thus, when binlog recovery processes a redo record, it can select the last or second-to-last existing binlog file that has a matching tablespace id and compare the record's LSN against the LSN in the file header. If the record's LSN is smaller, then we know this record corresponds to a binlog file that was already durably fsync()ed to disk, so we can skip it; otherwise the record is applied to the file (possibly creating the file anew if required). This way, we avoid any need to stall during commits to fsync() any binlog data (except when binlog write load starts exceeding the disk I/O capacity). And InnoDB checkpoints can span arbitrary number of binlog files, so that we also avoid stalling binlog rotation while waiting to complete a checkpoint. As a special case, RESET MASTER (or the initial binlog creation at first server restart) fsync()s both the InnoDB redo log and the newly created file header to disk before starting to write data. This simplifies the binlog recovery as we then will always have at least one file header available with starting LSN to compare against the LSN in redo records. In the case of --innodb-flush-log-at-trx-commit=0|2, binlog recovery will truncate extra binlog data for transactions that were not durably written to the redo log. This allows the code to write binlog data out to disk immediately, without waiting for redo log sync to happen first, unlike for InnoDB table/index data. This works because binlog files are always written strictly append-only; data is never modified in place. In fact, each page is only ever written once to the file system, except for one special case. During an InnoDB checkpoint, binlog data prior to the checkpoint LSN need to be synced durably to disk. If there is not enough binlog traffic during this to fill up the last page, we might need to write a partial last page to disk. When this happens, the page is marked, and the next write to that page will redo log the complete page (as opposed to just the appended data). This avoids the need for any double-write buffer for the binlog writes. The class binlog_recovery implements the binlog recovery algorithm. Out-of-band binlogging ====================== The new binlog implementation removes the limitation in the legacy binlog that each event group (transaction) must be written consecutively to a single binlog file. When the binlog transaction cache is full, instead of spilling it to a temporary file, the cache data is written directly into the binlog as so-called out-of-band data. At commit time, large transactions only need to binlog a small commit record that back-references the previously written partial event data. This avoids a large stall at commit time of huge transactions. It also makes it possible to keep a fixed binlog file size and still allow transactions much larger than that. If a transaction rolls back, any already written out-of-band data is simply left in the binlog file as garbage, to disappear when that file is eventually purged. The individual out-of-band records are structured on disk as a forest of complete binary trees. This is used to allow efficient reading of the data starting from the commit record (used to re-assemble complete event groups to send to slaves from the binlog dump threads); and still allow to write the binlog files strictly append-only, without needing to go back to update already written records (eg. with forward-pointers). The header page of each binlog contains a reference to any earlier binlog file that may contain out-of-band data referenced by commit records in this file. This is used to prevent the purge of old binlog files that may still be needed by active dump threads now reading a newer file. This mechanism can also be used in a future version to implement proper XA replication support following the design described in MDEV-32020. The class innodb_binlog_oob_reader implements the logic that re-assembles the out-of-band pieces of data into a complete event group for the reader. The struct binlog_oob_context is the corresponding object that keeps track of the construction of the forest of complete binary trees in the writer. Page FIFO ========= The binlog implements a page FIFO as a replacement for the InnoDB buffer pool used for other tablespaces. As the binlog is written strictly append-only, the "buffer pool" functionality is much simplified. Pages are only added to the end, and can be written always in sequence from the start, hence "FIFO". The page FIFO is accessed by three main parts of the code: 1. By threads writing to the binlog, allocating new pages and writing data into them in-memory. Binlog writes are serialized by LOCK_commit_ordered. 2. By binlog dump thread readers (eg. connected slaves). Readers must read data from pages in the page FIFO until they have been written to disk (and removed from the FIFO); then they must be read from the file through the file system. 3. By the flush thread which writes out completed pages to the binlog files in the background, and a couple other similar non-performance-critical places, such as tablespace close and InnoDB checkpoint to flush binlog files to disk. A simple latching mechanism is used to allow writer and readers to access pages concurrently, to avoid readers stalling the writers. Atomic variable binlog_cur_end_offset[] is used to make readers never read ahead of the writer without requiring mutex locking during the read; see also the description of binlog_cur_durable_offset[] below under "Crash-safe replication". Writer and readers (1) and (2) are prioritised over background flushing (3); the latter will wait for a page to be no longer latched. A global mutex is used to protect the latching and data structure integrity of the page FIFO. This mutex is only held for a brief time (few instructions), to make the page FIFO scalable. The page FIFO is implemented in class fsp_binlog_page_fifo. Crash-safe replication ====================== One of the goals of the new binlog format is to make the binlog and replication crash-safe also when --innodb-flush-log-at-trx-commit=0|2. To achieve this, it is necessary to make sure that binlog events are not sent to slaves until they have become durable on the master; otherwise, if the master crashes, the slave might have a transaction that no longer exists on the master when it comes back up after crash recovery, and replication will diverge. (This is the default setup; the converse property that a slave is always ahead of the master and any master crash will demote the old master as a slave can be provided in a later version using semi-synchronous replication with an "AFTER-PREPARE" configuration). Binlog events become durable (together with the table-data part of the associated transaction) when the corresponding LSN in the InnoDB redo log has been durably flushed to disk. The binlog code keeps track of each binlog write and the associated redo log LSN in the singleton class pending_lsn_fifo. A binlog write of a commit record inserts an entry into this. Then when that LSN has been durably written, the atomic binlog_cur_durable_offset[] is updated to reflect the point up to which the binlog is now known to be durable. The dump thread readers (connected slaves) will use this to not read too far ahead. When a reader reaches the end of the durable part of the binlog, that reader can then initiate an InnoDB redo log flush to make more of the binlog durable and allow the read to continue. This way, we achieve that transactions can complete without waiting for fsync-per-commit (which can massively increase commits-per-second depending on disk latency); at the same time, waiting readers will trigger redo log sync immediately and provide low latency to slaves. Backup ====== With the new binlog implementation, binlog files are maintained transactionally using the same InnoDB crash recovery mechanism used for its other tablespaces. This allows mariabackup to include the binlog files in the backup, something that was not easily available before if not using something like LVM snapshots. The binlog files are copied at the end of the backup, after releasing backup locks. The backup LSN is used together with the InnoDB/binlog recovery algorithm (during mariabackup --prepare) to make the binlog in the backup consistent with the table data. Apart from the obvious desirability of having a backup facility for the binlogs, this also allows a better way to get a corresponding binlog position to use the backup to provision a new replication slave. After setting up a new slave from a backup of the master, the new slave can simply be started (using GTID) from the @@gtid_binlog_pos restored with the binlog (CHANGE MASTER TO ... MASTER_DEMOTE_TO_SLAVE can be used for this). When setting up from a backup of another slave, the starting GTID position is directly available from the restored mysql.gtid_slave_pos table (as is already the case in existing MariaDB versions). This avoids the need for the special code in InnoDB that transactionally stores the current binlog position in the undo segments, as well as the need to hold backup locks while querying @@gtid_binlog_pos or SHOW MASTER STATUS. Storage engine API ================== The initial implementation of the new binlog is inside InnoDB, but the intention is that another (transactional) storage engine could do its own implementation to achive similar benefits as InnoDB. Some effort has gone into making a clean and well-defined interface between the server/binlog layer and the storage engine to facilitate this. The API is seen as new handlerton methods in struct handlerton, as well as new classes handler_binlog_reader, handler_binlog_event_group_info, and handler_binlog_purge_info. Here is an overview of the new handlerton methods: bool (*binlog_init)(size_t binlog_size, const char *directory); This is used at server startup to initialize the engine's binlog data structures, when --binlog-storage-engine=ENGINE is configured by the user. bool (*binlog_write_direct_ordered)(IO_CACHE *cache, handler_binlog_event_group_info *binlog_info, const rpl_gtid *gtid); bool (*binlog_write_direct)(IO_CACHE *cache, handler_binlog_event_group_info *binlog_info, const rpl_gtid *gtid); These are called by the server to write binlog data into the engine-implemented binlog. They are dual methods similar to commit_ordered() and commit(); the _ordered variant runs serialized (in binlog and GTID order) under LOCK_commit_ordered, while the non-ordered runs without holding locks. In addition, the engine is expected to binlog the transaction itself as part of binlog group commit (during commit_ordered), obtaining the binlog data to write from the server-provided binlog_get_cache() function. The class handler_binlog_event_group_info object is used to hold context data related to the event group while it is being binlogged. void (*binlog_group_commit_ordered)(THD *thd, handler_binlog_event_group_info *binlog_info); This is called at the end of binlog group commit (after releasing LOCK_commit_ordered), with the binlog_info of the last commit in the group. It is used to optimise the handling of the crash-safe replication described above, so that the engine can perform operations that need only be done once per group commit (not once per commit), and which can be done without holding the performance-critical LOCK_commit_ordered mutex. bool (*binlog_oob_data_ordered)(THD *thd, const unsigned char *data, size_t data_len, void **engine_data); bool (*binlog_oob_data)(THD *thd, const unsigned char *data, size_t data_len, void **engine_data); These are called by the server layer to binlog large event groups (transactions) in multiple pieces that can be interleaved with other events in the binlog. This allows large transactions to span multiple binlog files, avoids stalls around commits of large transactions, and allows event groups larger than the maximum size of an InnoDB mini-transaction. void (*binlog_oob_reset)(THD *thd, void **engine_data); void (*binlog_oob_free)(THD *thd, void *engine_data); These are simple functions to coordinate (re-)initialisation and de-allocation of engine-specific data associated with the binlogging of event groups. handler_binlog_reader * (*get_binlog_reader)(bool wait_durable); This is the main function used to implement the reading of the binlog by binlog dump threads (eg. connected slaves). The class handler_binlog_reader implements an interface for the server to read binlog data and for the dump threads to wait for new data to arrive when they reach the current EOF. void (*binlog_status)(uint64_t * out_fileno, uint64_t *out_pos); This is used to implement SHOW MASTER STATUS. The file number and offset is mostly informational, the new binlog format is intended for GTID use. void (*get_filename)(char name[FN_REFLEN], uint64_t file_no); In the new binlog implementation, binlog files are identified only by number (the file name is engine-specific and not user configurable). This function allows the server to provide the engine-specific file name for informational purposes to the user. binlog_file_entry * (*get_binlog_file_list)(MEM_ROOT *mem_root); This is used to implement SHOW BINARY LOGS. bool (*binlog_flush)(); This is used to implement FLUSH BINARY LOGS. bool (*binlog_get_init_state)(rpl_binlog_state_base *out_state); This is used to implement FLUSH BIANRY LOGS DELETE_DOMAIN_ID, to get the binlog state at the start of the current (non-purged) binlog. bool (*reset_binlogs)(); This is used to implement RESET MASTER. int (*binlog_purge)(handler_binlog_purge_info *purge_info); This is used to implement PURGE BINARY LOGS as well as auto-purge. The class handler_binlog_purge_info is used to pass the various purge options to the engine, as well as to inform about active dump threads that would prevent purging past their current read position.
participants (1)
-
Kristian Nielsen