On Wed, Jan 8, 2025 at 6:25 PM Marko Mäkelä <marko.makela@mariadb.com> wrote:
I think that we could allow the binlog layer to write directly to the 4096-byte blocks that are allocated from the InnoDB buffer pool. The binlog page cleaner thread might even be writing the last (incomplete) block concurrently while we are adding more data to it.
We might simplify the format even further and make it mostly independent of block sizes, similar to how in MDEV-14425 I removed the 512-byte block structure of ib_logfile0 and made each mini-transaction a "block" of its own. That is, the binlog writer would compute CRC-32C on the event snippets or groups and include it in the data that it passes to InnoDB. InnoDB would write entire pages without reserving any header or footer. The InnoDB block size could simply be innodb_page_size. The write granularity from InnoDB could be 4096 bytes, to be compatible with the requirements of O_DIRECT. If we go down this route, then encryption would have to be implemented in the binlog writer, before computing the CRC-32C (which I think should be computed on the encrypted data). In the binlog file, the only additional structure would be a file header block that identifies the format and stores the creation LSN. I would propose to reserve 4096 bytes for this (independently of innodb_page_size). In that way, even if there is a race between an asynchronous write into the file system, and a binlog producer appending records to the last (incomplete) binlog block, any external tool could handle the situation just fine, simply by stopping when a CRC-32C validation fails. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc