Hi Kristian, On Fri, Jan 3, 2025 at 10:23 AM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
But things have progressed, and I have now reached the point where most of the basic groundwork is implemented. Event groups are binlogged to InnoDB tablespaces. Binlog dump thread can read the binlog and send to slave, and replication is working. Large event groups are split into pieces, bounding the amount of data that needs to be atomically written in mini-transactions and at commit time. There are still many details left, but mostly in the server-layer replication code which should be manageable, just will take some time to get completed.
I think now is a good time for you to take a first real look at the InnoDB part of the changes, I would really value your input.
This is great. I will try to find some time for this before the FOSDEM weekend.
The main part of the InnoDB code is in two files:
1. handler/handler0binlog.cc for the high-level part that deals mostly with the new binlog file format and interfacing to the server layer.
2. fsp/fsp0binlog.cc for the low-level part that most tightly interacts with the InnoDB mini-transactions and buffer pool.
Side note: I think that we can abandon Heikki Tuuri's convention when naming new files. That is, just drop the meaningless handler0 and fsp0 prefixes.
- We previously discussed removing some of the page header overhead for binlog tablespaces. Currently the code just leaves alone the first FIL_PAGE_DATA bytes (38) and the last FIL_PAGE_DATA_END (8 IIRC).
Right. As far as I can tell, the minimum that we need is a page checksum. I would use CRC-32C in big endian format. Because the binlog tablespaces will be append only and never update-in-place (except for the last block), we will not need any per-page LSN field.
- We discussed previously to write the current LSN at the start of the tablespace, and use this in recovery to handle that we have only two tablespace IDs that are reused. So we need code in recovery that checks the LSN at the start of the tablespace, and skips redo records with LSN smaller than this.
In addition to the creation LSN, any tablespace attributes, such as encryption parameters or format version, would have to be stored in the first page. When it comes to encryption, I think that it is easiest to allow key version changes or key rotation only when switching binlog tablespaces. I would always use a 4096-byte page size for the binlog tablespace. The InnoDB buffer pool only supports one innodb_page_size at a time, but we could simply allocate innodb_page_size blocks (4 KiB to 64 KiB) and write the last page up to the required multiple of 4096 byte. Because I would like to simplify and optimize the page format, we must implement some special I/O handling of the binlog tablespace anyway.
- We want to avoid the double-write buffer for binlog pages, at least for the first page write (most pages will only be written as full pages). You mentioned an idea to completely avoid the double-write buffer and instead do some specific code for recovery in the uncommon case where a partial binlog page is written to disk due to low commit activity.
The idea is simple: Ensure that recovery will be able to read complete blocks, or to read log records that will completely initialize the blocks. We need to tweak the InnoDB log checkpoint somehow to guarantee this. For regular persistent tablespaces, the current requirement is less strict: Any page write completion will "increment" buf_pool.get_oldest_modification() by "shrinking" buf_pool.flush_list, and we only care that there are no pending writes with LSN less than the checkpoint LSN. The current LSN could be megabytes or even gigabytes ahead of the old or the new checkpoint LSN. An alternative to the doublewrite buffer would be to "copy data to the log across the checkpoint", like we do for the FILE_MODIFY records that are needed for discovering *.ibd files on recovery. I do not have any idea how to implement this efficiently. I think that it is simplest to implement some additional synchronization on log checkpoint, to ensure that any pending binlog writes have completed and been fsync-ed. After a checkpoint or on server startup, we must never overwrite the last written (partially filled) block, but just leave a zero-filled gap at the end of it. The next write would start a new block. In that way, recovery should be guaranteed to work. If we are writing a new block, the redo log records will start at offset 0 (or 4 if we store the checksum at the start of the block), and recovery will not have to read anything from the binlog tablespace. In fact, it could be a recovery error if the log records for the binlog tablespace are not starting at offset 0. Did you have any plans of updating the binlog file in place? Anything like a directory structure within the file, or updating the status of a binlog event group in some header after a transaction has been committed? If the format cannot be strictly append-only, it will be harder to avoid using a doublewrite buffer.
- The flushing of binlog pages to disk currently happens in a dedicated thread in the background. I'd welcome ideas on how to do this differently. It is good to flush binlog pages quickly and re-use their buffer pool entries for something better. Also writing the pages to disk quickly (not necessarily fsync()'ing) makes the data readable by mysqlbinlog.
It could make sense to introduce a separate list to manage binlog blocks, and keep those blocks out of buf_pool.LRU altogether. Maybe also keep them out of buf_pool.flush_list as well as mtr_t::m_memo, so that any code that deals with those lists can continue to assume that the pages use the InnoDB format. Separate flushing logic seems to be unavoidable. We might also introduce a new data member in mtr_t for keeping track of binlog blocks, so that mtr_t::m_memo would be something for the regular buf_pool.flush_list. If there was no foreseeable need to write both InnoDB data and binlog in the same atomic mini-transaction (mainly, to have an atomic commit of an InnoDB-only persistent transaction), it could make sense to replace mtr_t with something binlog specific. It could make sense to avoid O_DIRECT on the binlog files and to issue posix_fadvise(POSIX_FADV_DONTNEED) to avoid file system cache pollution. Maybe there should be some configuration parameters for this. We probably want asynchronous writes, possibly with the RWF_UNCACHED flag when(ever) it becomes available: https://lore.kernel.org/linux-fsdevel/20241220154831.1086649-1-axboe@kernel....
- Checksum and encryption should use the standard InnoDB mechanism. I assume checksum is already handled in the code through using the buffer pool and mini-transactions to read/write pages. Not sure about encryption. I need to implement that the code handles checksum and decryption when reading the pages manually from the file (not through buffer pool).
The buffer pool stores clear-text pages. Checksums are computed right before a page is written. For encryption, a separate buffer will be reserved right before writing out the page. I think that we must implement this logic separately for the binlog tablespace files. It does not need to be as complicated as for the InnoDB data files, with multiple format options. I don't think it makes any sense to implement any page_compressed compression for the binlog tablespace. If you want compression, that would be best done at the binlog event level, similar to how the compressed BLOBs in InnoDB ROW_FORMAT=COMPRESSED works; see btr_store_big_rec_extern_fields(). This would have to be done before the bytes reach the InnoDB buffer pool. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc