Current status on MDEV-34705 binlog-in-engine
![](https://secure.gravatar.com/avatar/99fde0c1dfd216326aae0aff30d493f1.jpg?s=120&d=mm&r=g)
Hi Brandon (and other interested), It is great that you want to look closer at the MDEV-34705, binlog-in-engine project. I wanted to write up a high-level summary of the current status to help you get an overview. For reference, this is a project to implement a new binlog format that solves some of the complexity and performance problems with the old format that is not integrated with the storage engine transaction log (eg. InnoDB binlog). The code is being developed on the branch knielsen_binlog_in_engine on Github, https://github.com/MariaDB/server/commits/knielsen_binlog_in_engine There has been some previous design discussions on the mailing list, eg.: https://lists.mariadb.org/hyperkitty/list/developers@lists.mariadb.org/threa... https://lists.mariadb.org/hyperkitty/list/developers@lists.mariadb.org/threa... - Kristian. Current status: - Feature is off by default, enabled by --binlog-storage-engine=innodb. - Binlog files are now identified solely with their file_no, starting from 0 (no more master-bin.index). Binlog file names are hard-coded by the engine (binlog-NNNNNN.ibb), directory can be selected with --binlog-directory. - Binlog files are pre-allocated to --max-binlog-size, thus their size is fixed. Event groups and even events can span multiple binlog files. - New binlog format is page-based. On server start, binary search is used to find the current position (no need for scanning a whole binlog file). GTID start position is likewise found by binary search, no need for separate GTID index. - Binlog files are written strictly append-only, data is never updated after initial write. The InnoDB redo log is used to ensure that files are recovered into a consistent state with the engine data after a crash. - Large event groups (larger than --binlog-cache-size) are split into pieces that are binlogged "out-of-band", meaning they are written in separate records during transaction execution, before commit. The final commit record merely has a reference to the out-of-band data. Out-of-band data for rolled-back transactions remain as "garbage" until binlog purge. - The binlog dump thread re-assembles out-of-band data into consecutive event groups for the slaves. Thus, slaves see event data in the same format as before, no slave code changes needed. Out-of-band data is structured as complete binary trees for efficient reading. - Basic slave replication is working, not much is tested though. Currently GTID mode only, might make sense to not support filename/position mode for simplicity. - RESET MASTER, FLUSH BINARY LOGS, PURGE BINARY LOGS are implemented. - Initial benchmarking show good speedup (x2) in durable configuration, potentially huge (3 to sevelal hundred times faster) speedup in crash-safe configuration. A main challenge with this project is that the binlog internals are visible in so many places to users, and to other parts of the code. Here is an (incomplete) list of major issues still to implement: - The current implementation is using the normal InnoDB buffer pool and checkpoint mechanism. See the mailing list discussions for Marko's ideas for implementing a simpler integration with the InnoDB redo log. - When --sync-binlog=1, implement that binlog events are not sent to slaves until durably written to redo log. When --sync-binlog=0, implement that events are not sent until redo log is (non-durably) written (--innodb-flush-log-at-trx-commit=2). - mysqlbinlog ability to read new binlog format. The code is there, mostly a matter of finding a good way to share code between the server/InnoDB and mysqlbinlog.cc. - Disable all parts of the legacy binlog when --binlog-storage-engine is set (in current code some parts still remain active). - SHOW BINLOG EVENTS. - Binlog checksum and encryption. Will checksum and encrypt the individual pages, not use the legacy per-event checksums or encryption. - Storage engine API. The functionality is implemented, but the design of the API and interfaces might need review and cleanup, ie. maybe parts to be implemented as services (include/service_versions.h etc.) - Lots of smaller ToDo:s, some marked in comments in the code. - There is a --suite=binlog_in_engine, but a lot more testing is neeed. Some parts may be initially not supported in the first release, but implemented later: - Semi-synchronous replication. Should be straight-forward to implement later. Might make sense to use GTID to send back-and-forth instead of filename/offset. AFTER_SYNC no longer makes sense, but could consider implementing AFTER_PREPARE. - External XA, to be implemented following MDEV-32020. The out-of-band mechanism will be used so that XA COMMIT can reference the binlog data and replicate without the need for any transactions to be in XA PREPARED state on the slave or in backups.
participants (1)
-
Kristian Nielsen