I wanted to give an update on the progress of my work on MDEV-34705, which is a task to implement a new binlog format that is stored as an InnoDB tablespace (or other engine that chooses to implement it). To recap, the motivation includes removing the costly 2-phase commit between binlog and InnoDB; making replication crash-safe even when --innodb-flush-log-at-trx-commit=0 (or 2) and --sync-binlog=0; remove unnecessary complexity in the legacy binlog implementation; and removing limitations in the legacy binlog to facilitate future developemnts for replication. The design is described in https://jira.mariadb.org/browse/MDEV-34705 , and the code is developed in https://github.com/MariaDB/server/commits/knielsen_binlog_in_engine. A few weeks ago I reached a major milestone with the first working replication from InnoDB-implemented binlog on the master to a slave. I'm currently half-way with the last big piece of the puzzle, which is to be able to split event groups into multiple pieces interleaved with other event groups in the binlog. After that there will still be many details to be implemented, as the binlog implementation is visible in many user-facing places (which is one part of the problem with the legacy binlog). So good progress, but also lots of work left still. I want to point out some design decisions that significantly changes how the new binlog works compared to the legacy one, to facilitate the discussion. Remember that an important goal is to remove some of the unnecessary complexity of the legacy binlog, so support for some things will be dropped that will be controversial, but the design is still open to suggestions with solid technical arguments. 1. I intend to remove the option to set the base name of binlog files. File names will be set by the storage engine (I'm using "binlog-NNNNNN.ibb" in the current code), and identified only by their (64-bit) number. This avoids the need for the master-bin.index file. The need to keep track of different base file names for different binlog files creates a _log_ of complexity in the legacy binlog (and there are still a number of bugs due to this). It must still be possible to set the directory containing the binlog files (but the binlog will not be possible to split amongst multiple directories). 2. There will be some delay from commit until the binlog data is readable externally from the file. This is kind of inherent in the desire to speed up binlogging exactly by delaying the physical disk write (aka fsync()). Using mysqlbinlog --read-from-remote-server will work as before (eg. it will be able to see committed transactions immediately). The code will though try to flush binlog pages to disk with high priority, so the delay will usually be small. 3. Binlog rotations, which are quite complex in the legacy binlog, will be mostly invisible. Binlog tablespace files are pre-allocated in the background, and will always have a fixed size (--max-binlog-size). Binlog writes pass seamlessly from the end of one file to the start of the other, and replication events can be split across binlog files. 4. I am thinking to require GTID mode in the new binlog format, disallowing slaves to connect using filename/offset. This is not a hard decision yet, technically I think it is not too hard to keep this. But removing this will reduce complexity and potentially allow future storage engines to implement its own binlog format that does not map well to filename/offset. 5. A more controversial thought is to drop support for semi-sync replication. I think many users use semi-sync believing it does something more than the reality. Instead of semi-sync, users can always just SELECT MASTER_GTID_WAIT(@@last_gtid) on a slave to get arguably better functionality. And the semi-sync implementation has always been problematic (IMHO), what with sending the actual binlog filename string back and forth with every commit, and causing much complexity and many bugs. Less controversial will be to release the first version without semi-sync support and add it later. 6. Large event groups (configurable, currently using --binlog-cache-size) will be written out-of-band into the binlog during query execution. This means the event group for the transaction can be binlogged in different pieces that can be interleaved with other event groups. This removes the limitation that even a huge transaction must be binlogged as a single consecutive event group in a single binlog file (which can stall other commits). It also allows a future (not in first release) enhancement where optimistic parallel replication could optionally start applying a large transaction on the slave while it is still executing on the master. In the first version, the dump thread on the master will assemble the pieces before sending to the slave. This means that if an active binlog file N contains a commit that references event data writtent to file (N-k), then binlog log purge will be blocked not just from N, but from N-k. It also means that if a large transaction ends up being rolled back, then this will leave extra unused data in the binlog files until purged. I think this is a good trade-off, but it's easy to add an option to disable the out-of-band binlogging, if desired in some special uses. 7. For migration to the new binlog, I want to allow that the old part of the binlog is in the legacy format, and the new part is using the new implementation. This to allow switching a replication setup to use the new implementation by simply stopping and restarting the master with the new option --binlog-storage-engine=innodb, and the slaves can pick up from where they left. I also want to leave a way to roll back, that is for users to disable the new binlog and go back to the legacy one in case of problems. But I want to avoid a binlog that goes back-and-forth between different formats (only allow a single point where it switches from legacy to new). So current thinking is that rolling back to the legacy format will be with a script that converts any newly written binlog files in the new format to the legacy format while traffic is stopped. As always, comments and suggestions very welcome. - Kristian.