Marko Mäkelä via developers <developers@lists.mariadb.org> writes:
On Fri, Jan 3, 2025 at 10:23 AM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I think now is a good time for you to take a first real look at the InnoDB part of the changes, I would really value your input.
This is great. I will try to find some time for this before the FOSDEM weekend.
Much appreciated, thanks for your comments so far.
- We want to avoid the double-write buffer for binlog pages, at least for the first page write (most pages will only be written as full pages). You mentioned an idea to completely avoid the double-write buffer and instead do some specific code for recovery in the uncommon case where a partial binlog page is written to disk due to low commit activity.
The idea is simple: Ensure that recovery will be able to read complete blocks, or to read log records that will completely initialize the blocks.
Right. Here is my conceptual understanding of how recovery should work. In most cases, the "page create" as well as all the writes to the page will be redo logged before the page is written to the file system - we only write full pages. Only if there is no/little binlog activity for a long time would it be necessary/desirable to write a partial page. For the first write of a page to the file system after "page create", there is no need for a double-write buffer, right? Since there is no existing data that can be corrupted by a torn write. Only in the uncommon case where we decide to write out a partial page is there an issue with the subsequent write of the same page. Here is an idea how to handle this in a simple way completely inside the binlog code, without the need for either buffer pool or special recovery code: What if the binlog code simply keeps track of whenever the current page of the binlog gets partially written to the file system? And when this happens, the next mtr_t write to that page will simply re-write all the data from the start of the page? This way the recovery code can always assume that the page is valid on disk prior to each redo record, and should be set to zeros following the record. I think it's litterally just replacing this line: mtr->memcpy(*block, page_offset, size+3); with this in the rare case after the page was partially written: mtr->memcpy(*block, 0, size+3); Would this work, or are there some details of recovery I do not understand that makes this not safe?
An alternative to the doublewrite buffer would be to "copy data to the log across the checkpoint", like we do for the FILE_MODIFY records that are needed for discovering *.ibd files on recovery. I do not have any idea how to implement this efficiently.
I'm unsure about exactly how a checkpoint is made. But it seems to me that somehow a checkpoint must span some LSN interval, starting it an LSN1, then flushing out all pages modified before LSN1, then ending the checkpoint at a later LSN2, is that correct? Then as long as the current binlog page gets written full between LSN1 and LSN2 (which should be the common case), there is no need for double-write or log-copy across checkpoint, is there? And in the non-common case, doing the following mtr->memcpy() from the start of the page should effectively implement the copy-across-checkpoint?
I think that it is simplest to implement some additional synchronization on log checkpoint, to ensure that any pending binlog writes have completed and been fsync-ed. After a checkpoint or on server startup, we must never overwrite the last written (partially filled) block, but just leave a zero-filled gap at the end of it. The
This could also work (my code uses 0xff bytes to fill gaps to distinguish from end-of-file, but that is a detail).
Did you have any plans of updating the binlog file in place? Anything
No, I plan to make the binlog tablespaces strictly append-only. Even if I would need to eg. write some information at server shutdown to remember the current state (I do not need that in current code), I would write a record at the end of the binlog rather than update eg. the file header in-place, and then binary-search the end of the binlog at server restart.
I would always use a 4096-byte page size for the binlog tablespace.
Interesting. Why do you think it is beneficial to have a different page size for the binlog? From the point of view of the binlog code, a 4k page size should be fine, there is not a lot of difference between different page sizes. A smaller page size makes the per-page overhead more significant, but that overhead will be minimized for the binlog tablespaces, as you described.
It could make sense to introduce a separate list to manage binlog blocks, and keep those blocks out of buf_pool.LRU altogether. Maybe also keep them out of buf_pool.flush_list as well as mtr_t::m_memo, so that any code that deals with those lists can continue to assume that the pages use the InnoDB format. Separate flushing logic seems to be unavoidable.
Ok, sounds reasonable. The flushing of binlog pages is conceptually quite simple. We will simply write out pages in page number order one by one, one tablespace after the other. So we don't even need any explicit LRU list. The only thing is that there is no need to write out the current end-of-binlog page until it is full. Unless we need to do so to complete a checkpoint; this will be rare, it is unlikely that a checkpoint will be needed without also writing at least one page of binlog data. Maybe we would want to write out the last partial page if there is say 1 second of inactivity, just to make it available to external programs; again that will be rare.
regular buf_pool.flush_list. If there was no foreseeable need to write both InnoDB data and binlog in the same atomic mini-transaction (mainly, to have an atomic commit of an InnoDB-only persistent transaction), it could make sense to replace mtr_t with something binlog specific.
Right, but having the binlog commit record in the same mtr_t as the transaction commit record is of course a crucial point, to avoid the need for 2-phase commit. Maybe it would make sense to do this for binlog writes that are not commit records? This includes non-transactional/DDL stuff, as well as out-of-band binlog data that gets written before the commit, for example batches of row-based replication events for large transactions.
It could make sense to avoid O_DIRECT on the binlog files and to issue
Any reason you want to avoid O_DIRECT?
posix_fadvise(POSIX_FADV_DONTNEED) to avoid file system cache pollution. Maybe there should be some configuration parameters for
Right, this probably makes sense in many cases, where the slave dump threads immediately read out the binlog data to send to slaves from the buffer pool, before the pages get written to the filesystem and evicted from the buffer pool. On the other hand, if a dump thread ends up reading the data from the file system, file system cache would make sense. But I think slave dump threads are usually fully up-to-date and can read the data from the buffer bool before it gets evicted, so maybe avoiding file system cache for the writes makes sense.
The buffer pool stores clear-text pages. Checksums are computed right before a page is written. For encryption, a separate buffer will be reserved right before writing out the page. I think that we must implement this logic separately for the binlog tablespace files. It
In my current design, the new binlog format (eg. the page format) is specific to the storage engine, ie. InnoDB. One reason for this is to be able to re-use as much of the existing InnoDB code, eg. for buffer pool, checksum, encruption, etc. As we discuss implementing more and more of this InnoDB code special for the binlog, I wonder if it would be feasible to have the page format be implemented on the server level, common for all engines that want to implement the binlog. And the interface to InnoDB would be a lower-level API that exposes the mtr_t and recovery logic somehow, rather than a high-level API that reads and writes binlog records, as currently. I'm not sure. Another advantage of the current design is that it gives more flexibility for InnoDB to implement things in the best way possible. It really only matters to be able to share more code between different engines implementing the binlog. Which only matters if there will ever be another binlog engine implementation. So I think the current design is ok, but it is something that should be considered, at least.
In addition to the creation LSN, any tablespace attributes, such as encryption parameters or format version, would have to be stored in the first page.
Agree.
When it comes to encryption, I think that it is easiest to allow key version changes or key rotation only when switching binlog tablespaces.
Yes. This is also how it works for the legacy binlog.
I don't think it makes any sense to implement any page_compressed compression for the binlog tablespace. If you want compression, that
Agree. There is already some compression support in the replication events.
Side note: I think that we can abandon Heikki Tuuri's convention when naming new files. That is, just drop the meaningless handler0 and fsp0 prefixes.
Ack. Hm, this became a long mail, hope it makes sense. - Kristian.