Hi Marko, Time flies, somehow it's already more than a year since our first discussions on implementing the binlog in InnoDB and avoiding the extra fsync() and complexity of the two-phase commit between InnoDB and binlog. But things have progressed, and I have now reached the point where most of the basic groundwork is implemented. Event groups are binlogged to InnoDB tablespaces. Binlog dump thread can read the binlog and send to slave, and replication is working. Large event groups are split into pieces, bounding the amount of data that needs to be atomically written in mini-transactions and at commit time. There are still many details left, but mostly in the server-layer replication code which should be manageable, just will take some time to get completed. I think now is a good time for you to take a first real look at the InnoDB part of the changes, I would really value your input. The main part of the InnoDB code is in two files: 1. handler/handler0binlog.cc for the high-level part that deals mostly with the new binlog file format and interfacing to the server layer. 2. fsp/fsp0binlog.cc for the low-level part that most tightly interacts with the InnoDB mini-transactions and buffer pool. The most interesting part for you to look at is fsp/fsp0binlog.cc (~1k lines), though I'm happy to hear comments on any part of the patch, of course. The code is pushed to GitHub in the branch knielsen_binlog_in_engine: https://github.com/MariaDB/server/commits/knielsen_binlog_in_engine and I've also attached the complete patch. This is my first major patch for InnoDB, so there will undoubtedly be a number of style changes required. But the overall structure of the code should now be close to what I imagine would be the final result, with some pending ToDo steps marked in comments in the code, and detailed in the below list, some of which we discussed a bit already. I hope you will take a look at the patch and let me know of any questions or other things you need from me. Maybe we can also find a chance to discuss further if you will come to FOSDEM start of February, or I could visit sometimes in Finland. - Kristian. ---- Some known outstanding issues in the InnoDB part: - We previously discussed removing some of the page header overhead for binlog tablespaces. Currently the code just leaves alone the first FIL_PAGE_DATA bytes (38) and the last FIL_PAGE_DATA_END (8 IIRC). - We discussed previously to write the current LSN at the start of the tablespace, and use this in recovery to handle that we have only two tablespace IDs that are reused. So we need code in recovery that checks the LSN at the start of the tablespace, and skips redo records with LSN smaller than this. - We want to avoid the double-write buffer for binlog pages, at least for the first page write (most pages will only be written as full pages). You mentioned an idea to completely avoid the double-write buffer and instead do some specific code for recovery in the uncommon case where a partial binlog page is written to disk due to low commit activity. - The flushing of binlog pages to disk currently happens in a dedicated thread in the background. I'd welcome ideas on how to do this differently. It is good to flush binlog pages quickly and re-use their buffer pool entries for something better. Also writing the pages to disk quickly (not necessarily fsync()'ing) makes the data readable by mysqlbinlog. - Checksum and encryption should use the standard InnoDB mechanism. I assume checksum is already handled in the code through using the buffer pool and mini-transactions to read/write pages. Not sure about encryption. I need to implement that the code handles checksum and decryption when reading the pages manually from the file (not through buffer pool).
Hi Kristian, On Fri, Jan 3, 2025 at 10:23 AM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
But things have progressed, and I have now reached the point where most of the basic groundwork is implemented. Event groups are binlogged to InnoDB tablespaces. Binlog dump thread can read the binlog and send to slave, and replication is working. Large event groups are split into pieces, bounding the amount of data that needs to be atomically written in mini-transactions and at commit time. There are still many details left, but mostly in the server-layer replication code which should be manageable, just will take some time to get completed.
I think now is a good time for you to take a first real look at the InnoDB part of the changes, I would really value your input.
This is great. I will try to find some time for this before the FOSDEM weekend.
The main part of the InnoDB code is in two files:
1. handler/handler0binlog.cc for the high-level part that deals mostly with the new binlog file format and interfacing to the server layer.
2. fsp/fsp0binlog.cc for the low-level part that most tightly interacts with the InnoDB mini-transactions and buffer pool.
Side note: I think that we can abandon Heikki Tuuri's convention when naming new files. That is, just drop the meaningless handler0 and fsp0 prefixes.
- We previously discussed removing some of the page header overhead for binlog tablespaces. Currently the code just leaves alone the first FIL_PAGE_DATA bytes (38) and the last FIL_PAGE_DATA_END (8 IIRC).
Right. As far as I can tell, the minimum that we need is a page checksum. I would use CRC-32C in big endian format. Because the binlog tablespaces will be append only and never update-in-place (except for the last block), we will not need any per-page LSN field.
- We discussed previously to write the current LSN at the start of the tablespace, and use this in recovery to handle that we have only two tablespace IDs that are reused. So we need code in recovery that checks the LSN at the start of the tablespace, and skips redo records with LSN smaller than this.
In addition to the creation LSN, any tablespace attributes, such as encryption parameters or format version, would have to be stored in the first page. When it comes to encryption, I think that it is easiest to allow key version changes or key rotation only when switching binlog tablespaces. I would always use a 4096-byte page size for the binlog tablespace. The InnoDB buffer pool only supports one innodb_page_size at a time, but we could simply allocate innodb_page_size blocks (4 KiB to 64 KiB) and write the last page up to the required multiple of 4096 byte. Because I would like to simplify and optimize the page format, we must implement some special I/O handling of the binlog tablespace anyway.
- We want to avoid the double-write buffer for binlog pages, at least for the first page write (most pages will only be written as full pages). You mentioned an idea to completely avoid the double-write buffer and instead do some specific code for recovery in the uncommon case where a partial binlog page is written to disk due to low commit activity.
The idea is simple: Ensure that recovery will be able to read complete blocks, or to read log records that will completely initialize the blocks. We need to tweak the InnoDB log checkpoint somehow to guarantee this. For regular persistent tablespaces, the current requirement is less strict: Any page write completion will "increment" buf_pool.get_oldest_modification() by "shrinking" buf_pool.flush_list, and we only care that there are no pending writes with LSN less than the checkpoint LSN. The current LSN could be megabytes or even gigabytes ahead of the old or the new checkpoint LSN. An alternative to the doublewrite buffer would be to "copy data to the log across the checkpoint", like we do for the FILE_MODIFY records that are needed for discovering *.ibd files on recovery. I do not have any idea how to implement this efficiently. I think that it is simplest to implement some additional synchronization on log checkpoint, to ensure that any pending binlog writes have completed and been fsync-ed. After a checkpoint or on server startup, we must never overwrite the last written (partially filled) block, but just leave a zero-filled gap at the end of it. The next write would start a new block. In that way, recovery should be guaranteed to work. If we are writing a new block, the redo log records will start at offset 0 (or 4 if we store the checksum at the start of the block), and recovery will not have to read anything from the binlog tablespace. In fact, it could be a recovery error if the log records for the binlog tablespace are not starting at offset 0. Did you have any plans of updating the binlog file in place? Anything like a directory structure within the file, or updating the status of a binlog event group in some header after a transaction has been committed? If the format cannot be strictly append-only, it will be harder to avoid using a doublewrite buffer.
- The flushing of binlog pages to disk currently happens in a dedicated thread in the background. I'd welcome ideas on how to do this differently. It is good to flush binlog pages quickly and re-use their buffer pool entries for something better. Also writing the pages to disk quickly (not necessarily fsync()'ing) makes the data readable by mysqlbinlog.
It could make sense to introduce a separate list to manage binlog blocks, and keep those blocks out of buf_pool.LRU altogether. Maybe also keep them out of buf_pool.flush_list as well as mtr_t::m_memo, so that any code that deals with those lists can continue to assume that the pages use the InnoDB format. Separate flushing logic seems to be unavoidable. We might also introduce a new data member in mtr_t for keeping track of binlog blocks, so that mtr_t::m_memo would be something for the regular buf_pool.flush_list. If there was no foreseeable need to write both InnoDB data and binlog in the same atomic mini-transaction (mainly, to have an atomic commit of an InnoDB-only persistent transaction), it could make sense to replace mtr_t with something binlog specific. It could make sense to avoid O_DIRECT on the binlog files and to issue posix_fadvise(POSIX_FADV_DONTNEED) to avoid file system cache pollution. Maybe there should be some configuration parameters for this. We probably want asynchronous writes, possibly with the RWF_UNCACHED flag when(ever) it becomes available: https://lore.kernel.org/linux-fsdevel/20241220154831.1086649-1-axboe@kernel....
- Checksum and encryption should use the standard InnoDB mechanism. I assume checksum is already handled in the code through using the buffer pool and mini-transactions to read/write pages. Not sure about encryption. I need to implement that the code handles checksum and decryption when reading the pages manually from the file (not through buffer pool).
The buffer pool stores clear-text pages. Checksums are computed right before a page is written. For encryption, a separate buffer will be reserved right before writing out the page. I think that we must implement this logic separately for the binlog tablespace files. It does not need to be as complicated as for the InnoDB data files, with multiple format options. I don't think it makes any sense to implement any page_compressed compression for the binlog tablespace. If you want compression, that would be best done at the binlog event level, similar to how the compressed BLOBs in InnoDB ROW_FORMAT=COMPRESSED works; see btr_store_big_rec_extern_fields(). This would have to be done before the bytes reach the InnoDB buffer pool. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Marko Mäkelä via developers <developers@lists.mariadb.org> writes:
On Fri, Jan 3, 2025 at 10:23 AM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I think now is a good time for you to take a first real look at the InnoDB part of the changes, I would really value your input.
This is great. I will try to find some time for this before the FOSDEM weekend.
Much appreciated, thanks for your comments so far.
- We want to avoid the double-write buffer for binlog pages, at least for the first page write (most pages will only be written as full pages). You mentioned an idea to completely avoid the double-write buffer and instead do some specific code for recovery in the uncommon case where a partial binlog page is written to disk due to low commit activity.
The idea is simple: Ensure that recovery will be able to read complete blocks, or to read log records that will completely initialize the blocks.
Right. Here is my conceptual understanding of how recovery should work. In most cases, the "page create" as well as all the writes to the page will be redo logged before the page is written to the file system - we only write full pages. Only if there is no/little binlog activity for a long time would it be necessary/desirable to write a partial page. For the first write of a page to the file system after "page create", there is no need for a double-write buffer, right? Since there is no existing data that can be corrupted by a torn write. Only in the uncommon case where we decide to write out a partial page is there an issue with the subsequent write of the same page. Here is an idea how to handle this in a simple way completely inside the binlog code, without the need for either buffer pool or special recovery code: What if the binlog code simply keeps track of whenever the current page of the binlog gets partially written to the file system? And when this happens, the next mtr_t write to that page will simply re-write all the data from the start of the page? This way the recovery code can always assume that the page is valid on disk prior to each redo record, and should be set to zeros following the record. I think it's litterally just replacing this line: mtr->memcpy(*block, page_offset, size+3); with this in the rare case after the page was partially written: mtr->memcpy(*block, 0, size+3); Would this work, or are there some details of recovery I do not understand that makes this not safe?
An alternative to the doublewrite buffer would be to "copy data to the log across the checkpoint", like we do for the FILE_MODIFY records that are needed for discovering *.ibd files on recovery. I do not have any idea how to implement this efficiently.
I'm unsure about exactly how a checkpoint is made. But it seems to me that somehow a checkpoint must span some LSN interval, starting it an LSN1, then flushing out all pages modified before LSN1, then ending the checkpoint at a later LSN2, is that correct? Then as long as the current binlog page gets written full between LSN1 and LSN2 (which should be the common case), there is no need for double-write or log-copy across checkpoint, is there? And in the non-common case, doing the following mtr->memcpy() from the start of the page should effectively implement the copy-across-checkpoint?
I think that it is simplest to implement some additional synchronization on log checkpoint, to ensure that any pending binlog writes have completed and been fsync-ed. After a checkpoint or on server startup, we must never overwrite the last written (partially filled) block, but just leave a zero-filled gap at the end of it. The
This could also work (my code uses 0xff bytes to fill gaps to distinguish from end-of-file, but that is a detail).
Did you have any plans of updating the binlog file in place? Anything
No, I plan to make the binlog tablespaces strictly append-only. Even if I would need to eg. write some information at server shutdown to remember the current state (I do not need that in current code), I would write a record at the end of the binlog rather than update eg. the file header in-place, and then binary-search the end of the binlog at server restart.
I would always use a 4096-byte page size for the binlog tablespace.
Interesting. Why do you think it is beneficial to have a different page size for the binlog? From the point of view of the binlog code, a 4k page size should be fine, there is not a lot of difference between different page sizes. A smaller page size makes the per-page overhead more significant, but that overhead will be minimized for the binlog tablespaces, as you described.
It could make sense to introduce a separate list to manage binlog blocks, and keep those blocks out of buf_pool.LRU altogether. Maybe also keep them out of buf_pool.flush_list as well as mtr_t::m_memo, so that any code that deals with those lists can continue to assume that the pages use the InnoDB format. Separate flushing logic seems to be unavoidable.
Ok, sounds reasonable. The flushing of binlog pages is conceptually quite simple. We will simply write out pages in page number order one by one, one tablespace after the other. So we don't even need any explicit LRU list. The only thing is that there is no need to write out the current end-of-binlog page until it is full. Unless we need to do so to complete a checkpoint; this will be rare, it is unlikely that a checkpoint will be needed without also writing at least one page of binlog data. Maybe we would want to write out the last partial page if there is say 1 second of inactivity, just to make it available to external programs; again that will be rare.
regular buf_pool.flush_list. If there was no foreseeable need to write both InnoDB data and binlog in the same atomic mini-transaction (mainly, to have an atomic commit of an InnoDB-only persistent transaction), it could make sense to replace mtr_t with something binlog specific.
Right, but having the binlog commit record in the same mtr_t as the transaction commit record is of course a crucial point, to avoid the need for 2-phase commit. Maybe it would make sense to do this for binlog writes that are not commit records? This includes non-transactional/DDL stuff, as well as out-of-band binlog data that gets written before the commit, for example batches of row-based replication events for large transactions.
It could make sense to avoid O_DIRECT on the binlog files and to issue
Any reason you want to avoid O_DIRECT?
posix_fadvise(POSIX_FADV_DONTNEED) to avoid file system cache pollution. Maybe there should be some configuration parameters for
Right, this probably makes sense in many cases, where the slave dump threads immediately read out the binlog data to send to slaves from the buffer pool, before the pages get written to the filesystem and evicted from the buffer pool. On the other hand, if a dump thread ends up reading the data from the file system, file system cache would make sense. But I think slave dump threads are usually fully up-to-date and can read the data from the buffer bool before it gets evicted, so maybe avoiding file system cache for the writes makes sense.
The buffer pool stores clear-text pages. Checksums are computed right before a page is written. For encryption, a separate buffer will be reserved right before writing out the page. I think that we must implement this logic separately for the binlog tablespace files. It
In my current design, the new binlog format (eg. the page format) is specific to the storage engine, ie. InnoDB. One reason for this is to be able to re-use as much of the existing InnoDB code, eg. for buffer pool, checksum, encruption, etc. As we discuss implementing more and more of this InnoDB code special for the binlog, I wonder if it would be feasible to have the page format be implemented on the server level, common for all engines that want to implement the binlog. And the interface to InnoDB would be a lower-level API that exposes the mtr_t and recovery logic somehow, rather than a high-level API that reads and writes binlog records, as currently. I'm not sure. Another advantage of the current design is that it gives more flexibility for InnoDB to implement things in the best way possible. It really only matters to be able to share more code between different engines implementing the binlog. Which only matters if there will ever be another binlog engine implementation. So I think the current design is ok, but it is something that should be considered, at least.
In addition to the creation LSN, any tablespace attributes, such as encryption parameters or format version, would have to be stored in the first page.
Agree.
When it comes to encryption, I think that it is easiest to allow key version changes or key rotation only when switching binlog tablespaces.
Yes. This is also how it works for the legacy binlog.
I don't think it makes any sense to implement any page_compressed compression for the binlog tablespace. If you want compression, that
Agree. There is already some compression support in the replication events.
Side note: I think that we can abandon Heikki Tuuri's convention when naming new files. That is, just drop the meaningless handler0 and fsp0 prefixes.
Ack. Hm, this became a long mail, hope it makes sense. - Kristian.
Kristian Nielsen via developers <developers@lists.mariadb.org> writes:
What if the binlog code simply keeps track of whenever the current page of the binlog gets partially written to the file system? And when this happens, the next mtr_t write to that page will simply re-write all the data from the start of the page? This way the recovery code can always assume that the page is valid on disk prior to each redo record, and should be set to zeros following the record.
I think it's litterally just replacing this line:
mtr->memcpy(*block, page_offset, size+3);
with this in the rare case after the page was partially written:
mtr->memcpy(*block, 0, size+3);
That code line was a bit sloppy and not correct. What I had in mind is more something like this: if (page_offset > FIL_PAGE_DATA && block->page.oldest_modification() <= 1) { // Adding to a page that was already flushed. Redo log all the data to // protect recovery against torn page on subsequent page write. mtr->memcpy(*block, FIL_PAGE_DATA, (page_offset - FIL_PAGE_DATA) + size+3); } else { mtr->memcpy(*block, page_offset, size+3); } I wonder if we could do a test case for this. Some DBUG injection in the code that writes the page to disk, which instead writes garbage to the page and crashes the server, simulating a power outage that corrups the page write. Then would need to somehow arrange for the page to be first partially written and then written again with the DBUG injection active. - Kristian.
On Mon, Jan 6, 2025 at 2:15 PM Kristian Nielsen via developers <developers@lists.mariadb.org> wrote:
That code line was a bit sloppy and not correct. What I had in mind is more something like this:
if (page_offset > FIL_PAGE_DATA && block->page.oldest_modification() <= 1) { // Adding to a page that was already flushed. Redo log all the data to // protect recovery against torn page on subsequent page write. mtr->memcpy(*block, FIL_PAGE_DATA, (page_offset - FIL_PAGE_DATA) + size+3); } else { mtr->memcpy(*block, page_offset, size+3); }
I wonder if we could do a test case for this.
Yes, something like that could work. For testing, my preference would be to use DEBUG_SYNC possibly together with DBUG_EXECUTE_IF to prohibit page writes and then using Perl code to corrupt the data file. We have a number of tests that make use of no_checkpoint_start.inc and sometimes $ENV{MTR_SUITE_DIR}/include/crc32.pl to compute valid checksums for intentionally corrupted pages. Here we could just overwrite the last binlog block with NUL bytes. I think that we could allow the binlog layer to write directly to the 4096-byte blocks that are allocated from the InnoDB buffer pool. The binlog page cleaner thread might even be writing the last (incomplete) block concurrently while we are adding more data to it. If that is an issue for external tools that are trying to read a consistent copy of all of the binlog, then it could be better to use page latches properly, like we do for InnoDB data pages. Crash recovery would not have a problem with such racey writes, provided that the ib_logfile0 will always completely initialize the pages. That's normally signalled by writing an INIT_PAGE record before the WRITE record. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
On Wed, Jan 8, 2025 at 6:25 PM Marko Mäkelä <marko.makela@mariadb.com> wrote:
I think that we could allow the binlog layer to write directly to the 4096-byte blocks that are allocated from the InnoDB buffer pool. The binlog page cleaner thread might even be writing the last (incomplete) block concurrently while we are adding more data to it.
We might simplify the format even further and make it mostly independent of block sizes, similar to how in MDEV-14425 I removed the 512-byte block structure of ib_logfile0 and made each mini-transaction a "block" of its own. That is, the binlog writer would compute CRC-32C on the event snippets or groups and include it in the data that it passes to InnoDB. InnoDB would write entire pages without reserving any header or footer. The InnoDB block size could simply be innodb_page_size. The write granularity from InnoDB could be 4096 bytes, to be compatible with the requirements of O_DIRECT. If we go down this route, then encryption would have to be implemented in the binlog writer, before computing the CRC-32C (which I think should be computed on the encrypted data). In the binlog file, the only additional structure would be a file header block that identifies the format and stores the creation LSN. I would propose to reserve 4096 bytes for this (independently of innodb_page_size). In that way, even if there is a race between an asynchronous write into the file system, and a binlog producer appending records to the last (incomplete) binlog block, any external tool could handle the situation just fine, simply by stopping when a CRC-32C validation fails. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Hi Kristian, Finally, I got some more time to think about this. I’m trying to summarize from the InnoDB point of view what we discussed today. * We do not need any InnoDB buf_block_t or fil_space_t for any binlog files. * The binlog layer can simply append pages to binlog files (or rewrite the last page), whenever it pleases, in its preferred format. * The redo log records will cover writes to the binlog blocks before any encryption. ** As the first cut, only WRITE and possibly MEMSET records, covering the entire block (excluding checksum) * For durability, it is the InnoDB log write that matters. * Based on the binlog file creation LSN (in the first block, say, 4096 bytes), InnoDB recovery will: ** ignore files that are older than the checkpoint LSN ** delete files that are newer than the last recovered LSN ** recover any other files (re-apply writes or trim the contents after the last write) ** not read anything from the files ** invoke a pwrite() like binlog API that takes care of any encryption ** make sure that there are only WRITE, MEMSET, MEMMOVE records, in strictly sequential order * InnoDB log checkpoint will be tweaked as follows: ** The log checkpoint must not "split" a binlog write. ** InnoDB must remember the start LSN of the last partial binlog block write. ** Checkpoint_LSN=min(last_start_LSN,buf_pool.get_oldest_modification()): ** Before or after fil_flush_file_spaces(), the last binlog file must be durably written. last block padded. I will start to implement the log writing and recovery logic. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Marko Mäkelä <marko.makela@mariadb.com> writes:
Finally, I got some more time to think about this. I’m trying to summarize from the InnoDB point of view what we discussed today.
Thanks Marko! This is an interesting development. If we can bypass using the buffer pool and associated machinery it could simplify the logic a lot, and potentially perhaps also further improve performance.
* Based on the binlog file creation LSN (in the first block, say, 4096 bytes), InnoDB recovery will: ** ignore files that are older than the checkpoint LSN ** delete files that are newer than the last recovered LSN ** recover any other files (re-apply writes or trim the contents after the last write)
Agree. I was thinking that the binlog layer would need to be informed of upto which LSN the redo log has been durably written to disk (in case of --innodb-flush-log-at-trx-commit=0|2). But does "trim the contents" imply that the binlog is free to write pages to the file system even ahead of the redo log, because any data beyond the eventually recovered LSN will then be cleared during recovery? This could be quite neat and simplify things, and also reduce the need for synchronisation between the redo log and the binlog code.
** not read anything from the files
Agree, this sounds good.
** invoke a pwrite() like binlog API that takes care of any encryption
Yes. I am thinking that recovery can simply pass the data into the binlog pwrite-like API, leaving exact details of how data will then be written into the file system to the binlog code.
** make sure that there are only WRITE, MEMSET, MEMMOVE records, in strictly sequential order
Agree with strict sequential order, I have tried very hard to preserve this property so far. Not sure how WRITE and MEMMOVE differ, but the binlog code essentially only needs the ability to log a byte string appended to a page, like what is done in mtr_t::memcpy(const buf_block_t &b, ulint ofs, ulint len). Possibly also MEMSET, just to make the log record shorter when filling a byte string with identical bytes,
* InnoDB log checkpoint will be tweaked as follows: ** The log checkpoint must not "split" a binlog write. ** InnoDB must remember the start LSN of the last partial binlog block write.
I'm unsure here what "binlog write" refers to. Does it refer to the write of the page to the file system layer (eg. pwrite())? Or does it refer to the redo logging of appended data to a page, similar to currently mtr.start(); mtr.memcpy(); mtr.commit(); ?
** Checkpoint_LSN=min(last_start_LSN,buf_pool.get_oldest_modification()): ** Before or after fil_flush_file_spaces(), the last binlog file must be durably written. last block padded.
Binlog can pad the last block easily (it is also done in case of FLUSH BINARY LOGS which truncates the currently active binlog). If we can allow the last_start_LSN to be the end of the last full block (before any LSN for writing to the current, partial block), we could avoid having to pad a block for every checkpoint. Then the binlog needs to ensure that the checkpoint LSN can always advance (eg. padding or at least fully rewriting the last block if it has not completed since the last checkpoint or some timeout).
I will start to implement the log writing and recovery logic.
Great! As I promised, here is a first draft of a possible API between the binlog and redo/recovery code along the lines discussed. It is mostly based on what I see the binlog code will need, and changes will probably be needed to suite the redo/recovery part that I am still not very familiar with. API for binlog to append data to binlog tablespace files: binlog_record_begin() Start an atomically recovered logging group. Optionally part of an existing mtr (ie. InnoDB trx commit) for atomic recovery. binlog_record_memcpy(tablespace, page, in_page_offset, length, data) Redo log a byte string (maybe a memset() variant too). Always strictly append-only to a page. Optionally part of an existing mtr (ie. InnoDB trx commit) for atomic recovery. If the offset is 0, this implicitly INIT_PAGE. binlog_record_end() End an atomically recovered logging group Returns the corresponding LSN. binlog_tablespace_create(tablespace, length_in_pages) Create a tablespace. Register the new tablespace file for redo logging. binlog_tablespace_close(tablespace) Close a tablespace. Marks to redo logging that this tablespace file is now fully durably written to disk and will not receive any further updates. binlog_tablespace_truncate(tablespace, new_length_in_pages) Truncate a binlog tablespace (like mtr.trim_pages() and mtr.commit_shrink()). Can be independent, does not need to be part of a logging group with any other operations. API for interacting with InnoDB checkpointing and recovery. This is based on what I see as minimal needs from the binlog point of view. Probably need something more here, eg. to supply the last_start_LSN you mentioned: binlog_write_up_to(lsn) Request the binlog to durably write ASAP all data needed up to specified lsn Could be called by InnoDB checkpointing code, similar to fil_flush_file_spaces() perhaps. binlog_report_lsn(lsn): Called by binlog code to inform redo logging that all binlog data prior to that LSN is now durably written to disk. Could also be a synchroneous return from binlog_write_up_to() if that fits better. binlog_recover_data(tablespace_id, page_no, in_page_offset, length, buffer) During crash recovery, passes recovered data to the binlog layer. Recovered data is supplied in same order that it was originally written to the redo log. All data following the last binlog_report_lsn() is guaranteed to be recovered. Data before that LSN may or may not be recovered, binlog code needs to handle that in either case. binlog_recover_tablespace_create(tablespace, length_in_pages) binlog_recover_tablespace_truncate(tablespace, new_length_in_pages) Recovers a tablespace creation or truncation event. - Kristian.
Hi Kristian, On Wed, Jan 29, 2025 at 2:08 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I was thinking that the binlog layer would need to be informed of upto which LSN the redo log has been durably written to disk (in case of --innodb-flush-log-at-trx-commit=0|2). But does "trim the contents" imply that the binlog is free to write pages to the file system even ahead of the redo log, because any data beyond the eventually recovered LSN will then be cleared during recovery? This could be quite neat and simplify things, and also reduce the need for synchronisation between the redo log and the binlog code.
Yes, that was my idea. I think that the InnoDB API for appending something into the binlog could return the start or end LSN of the mini-transaction. Usually we would be interested in the end LSN, but the InnoDB checkpoint logic (to prevent a checkpoint from "splitting" writes to a binlog page) would be interested in the start LSN. In the binlog layer, I think that the only time when the LSN is of interest is the creation of a binlog file. I think that it could be something like the following: 1. Create an empty binlog file with the next available name. 2. Ensure that the file was durably created. (At least fdatasync() the file; I don't think we currently care about syncing directories.) 3. Write InnoDB redo log for creating the file. (This probably does not need to include the file name; it could be just a WRITE covering the header page data.) 4. Write the header block to the file. If crash recovery encounters a binlog file where the header page is not complete, it would either recover that file from the redo log, or it would delete the file if the ib_logfile0 had not been durably written. Recovery would never create binlog files on its own; that is why the file needs to be durably created before an InnoDB log record is written.
Yes. I am thinking that recovery can simply pass the data into the binlog pwrite-like API, leaving exact details of how data will then be written into the file system to the binlog code.
Exactly.
Not sure how WRITE and MEMMOVE differ, but the binlog code essentially only needs the ability to log a byte string appended to a page, like what is done in mtr_t::memcpy(const buf_block_t &b, ulint ofs, ulint len). Possibly also MEMSET, just to make the log record shorter when filling a byte string with identical bytes,
Since we are defining a new format for this page oriented binlog, the MEMMOVE records are probably not going to be that useful. Those records allow some very rudimentary compression of the ib_logfile0 when some data is being written multiple times to the same page. It would basically be a special WRITE that says "copy these bytes from an earlier binlog record in the same page", instead of repeating the same bytes verbatim.
* InnoDB log checkpoint will be tweaked as follows: ** The log checkpoint must not "split" a binlog write. ** InnoDB must remember the start LSN of the last partial binlog block write.
I'm unsure here what "binlog write" refers to.
Does it refer to the write of the page to the file system layer (eg. pwrite())?
Or does it refer to the redo logging of appended data to a page, similar to currently mtr.start(); mtr.memcpy(); mtr.commit(); ?
It refers to the latter: a set of mini-transactions that are appending data to the same binlog block. Because the recovery will not read any binlog blocks, it will only be able to deal with situations where after a checkpoint, the first record for writing into a binlog block starts from offset 0. We could have multiple separate mini-transactions, like this: (1) Write 123 bytes to offset 0 of binlog block 123 (2) Write 123 bytes to offset 123 of binlog block 123 (3) Write 3846 bytes to offset 246 of binlog block 123 (4) Write 1234 bytes to offset 0 of binlog block 124 The checkpoint LSN may be advanced to anywhere before (1), or between (3) and (4), but not anywhere else with respect to these. When the binlog layer is asked to write and fdatasync() everything during a checkpoint, it will also fully pad the last binlog block, so that InnoDB knows that it can reset last_start_LSN=LSN_MAX so that the next checkpoint will be able to move further. The binlog layer has to guarantee that the next write will be to offset 0 of a new block (125 in the above example). InnoDB will be able to enforce this with a debug assertion.
** Checkpoint_LSN=min(last_start_LSN,buf_pool.get_oldest_modification()): ** Before or after fil_flush_file_spaces(), the last binlog file must be durably written. last block padded.
Binlog can pad the last block easily (it is also done in case of FLUSH BINARY LOGS which truncates the currently active binlog).
If we can allow the last_start_LSN to be the end of the last full block (before any LSN for writing to the current, partial block), we could avoid having to pad a block for every checkpoint. Then the binlog needs to ensure that the checkpoint LSN can always advance (eg. padding or at least fully rewriting the last block if it has not completed since the last checkpoint or some timeout).
We can allow that, but then the binlog layer must resubmit WRITE records covering everything from the start of the last (incomplete) binlog block. This would have the benefit that the page oriented format would not need to tolerate any "padding" in the middle of the binlog.
As I promised, here is a first draft of a possible API between the binlog and redo/recovery code along the lines discussed. It is mostly based on what I see the binlog code will need, and changes will probably be needed to suite the redo/recovery part that I am still not very familiar with. [snip] binlog_tablespace_create(tablespace, length_in_pages) Create a tablespace. Register the new tablespace file for redo logging.
We don't need to register any tablespace metadata in InnoDB. We can simply hard-code two tablespace IDs to refer to the binlog files. A tablespace object was needed in the earlier prototype, because all data was being written through the InnoDB buffer pool. In InnoDB, this only needs to write a WRITE record with the payload of the header page. Recovery will additionally know the end LSN of the mini-transaction, which will be what this API will return, so that you can write the file creation LSN to the binlog file header. This could also return the 1 bit of InnoDB pseudo tablespace ID, which could be written to the header page.
binlog_tablespace_close(tablespace) Close a tablespace. Marks to redo logging that this tablespace file is now fully durably written to disk and will not receive any further updates.
All this needs to do in InnoDB is to assign last_start_lsn=LSN_MAX so that the checkpoint can be advanced. Possibly we will need last_start_lsn[2], to correspond to both binlog tablespace ID values that the InnoDB redo log knows about.
binlog_tablespace_truncate(tablespace, new_length_in_pages) Truncate a binlog tablespace (like mtr.trim_pages() and mtr.commit_shrink()). Can be independent, does not need to be part of a logging group with any other operations.
When would this be invoked? My understanding is that InnoDB only needs to identify 2 files: the one that is being written to, and another one that is being created when the old file is about to fill up. For these, an alternating tablespace ID will be assigned. What the binlog might do with older binlog files (such as moving them to an archive location, or removing them) does not interest InnoDB. See also below.
API for interacting with InnoDB checkpointing and recovery. This is based on what I see as minimal needs from the binlog point of view. Probably need something more here, eg. to supply the last_start_LSN you mentioned:
binlog_write_up_to(lsn) Request the binlog to durably write ASAP all data needed up to specified lsn Could be called by InnoDB checkpointing code, similar to fil_flush_file_spaces() perhaps.
Right. This call could also pass the previously completed checkpoint LSN, which would give a permission to delete or archive any older binlog files. In this way, the binlog layer could safely remove or archive the last-but-one binlog file, and only retain 1 file if that is desirable. We could also include a separate call for indicating the latest checkpoint LSN. That would typically be invoked soon after the binlog_write_up_to(lsn).
binlog_report_lsn(lsn): Called by binlog code to inform redo logging that all binlog data prior to that LSN is now durably written to disk. Could also be a synchroneous return from binlog_write_up_to() if that fits better.
The dedicated InnoDB buf_flush_page_cleaner() thread could invoke binlog_write_up_to(lsn) at the start of an page write batch and then invoke a function to wait for the completion later. I think that it would be simplest to use the same thread for both (kind of "push" interface for both instead of "push" for one and "pull" for the other). Or we could just merge these two interfaces for now.
binlog_recover_data(tablespace_id, page_no, in_page_offset, length, buffer) During crash recovery, passes recovered data to the binlog layer. Recovered data is supplied in same order that it was originally written to the redo log. All data following the last binlog_report_lsn() is guaranteed to be recovered. Data before that LSN may or may not be recovered, binlog code needs to handle that in either case.
binlog_recover_tablespace_create(tablespace, length_in_pages) binlog_recover_tablespace_truncate(tablespace, new_length_in_pages) Recovers a tablespace creation or truncation event.
The creation could be merged to binlog_recover_data() as well. Because you also mentioned binlog_tablespace_truncate(), for which I do not see any need, I wonder what the intended purpose of binlog_recover_tablespace_truncate() would be. We do need something that would trim the end of a binlog file, to discard anything that was not recovered via the ib_logfile0. That could be implemented as part of the binlog_recover_data() logic, exploiting the fact that all writes are going to be in ascending order of page number and byte offset, with the possible exception of starting a rewrite of the last block from byte offset 0. We would seem to need a call that would inform the binlog of the latest recovered LSN, so that any file that carries a newer creation LSN will be deleted by the binlog recovery logic. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Marko Mäkelä <marko.makela@mariadb.com> writes:
On Wed, Jan 29, 2025 at 2:08 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
binlog_tablespace_truncate(tablespace, new_length_in_pages) Truncate a binlog tablespace (like mtr.trim_pages() and mtr.commit_shrink()). Can be independent, does not need to be part of a logging group with any other operations.
When would this be invoked?
My understanding is that InnoDB only needs to identify 2 files: the one that is being written to, and another one that is being
Yes. Truncate is used only on the one that is being written to. It is used to implement FLUSH BINARY LOGS, which is used to close the currently written file early and move on to the next binlog file. This is used in certain cases, for example to be able to remove old binlog data without having to wait for the current binlog file to be written full. The truncate always happens on a page boundary. If it is a problem to implement truncate, binlog can instead just pad the rest of the binlog file with dummy data. If we can have a truncate record in the redo log for recovery, we can avoid this dummy data and binlog can simply ftruncate() the file during recovery.
binlog_write_up_to(lsn) Request the binlog to durably write ASAP all data needed up to specified lsn Could be called by InnoDB checkpointing code, similar to fil_flush_file_spaces() perhaps.
Right. This call could also pass the previously completed checkpoint LSN, which would give a permission to delete or archive any older binlog files. In this way, the binlog layer could safely remove or archive the last-but-one binlog file, and only retain 1 file if that is desirable.
Ah, good point, I had not thought about that. The user command for this on the SQL layer is PURGE BINARY LOGS. This command will not remove files that are still active or could be used in recovery. This could be extended to also not remove any file that was still active at the last checkpoint LSN.
We do need something that would trim the end of a binlog file, to discard anything that was not recovered via the ib_logfile0. That could be implemented as part of the binlog_recover_data() logic, exploiting the fact that all writes are going to be in ascending order of page number and byte offset, with the possible exception of starting a rewrite of the last block from byte offset 0.
Yes. Any bytes in the file after the last WRITE record recovered, can just be overwritten with zeros. Thanks, - Kristian.
On Wed, Jan 29, 2025 at 4:55 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Yes. Truncate is used only on the one that is being written to. It is used to implement FLUSH BINARY LOGS, which is used to close the currently written file early and move on to the next binlog file.
If the binlog files would normally be preallocated on creation, it would indeed be helpful to explicitly log file size changes. We could also log that by a WRITE of rewriting the binlog file header block, which could specify the allocated size of the file. For this, we would have to overwrite the first binlog block in place. To avoid problems with torn writes, it could be a good idea to reserve the header block payload within the first 512 or fewer bytes of the 4096-byte block. In that way, any risk of the data being corrupted in the case of an interrupted write should be minimal. It would then be up to the binlog layer to interpret the contents of the WRITE record of page 0. We might also write an (EXTENDED,TRIM_PAGES) record for trimming size, but it is not strictly needed. For InnoDB tablespaces, which are not append-only, these records are necessary so that any earlier log records that would write beyond the trimmed size of the tablespace can be discarded. The binlog would be strictly append-only, and the FLUSH BINARY LOGS would never "overwrite" or discard any previously written data for that file.
If it is a problem to implement truncate, binlog can instead just pad the rest of the binlog file with dummy data. If we can have a truncate record in the redo log for recovery, we can avoid this dummy data and binlog can simply ftruncate() the file during recovery.
If after recovery we would continue to use the last binlog file and we are preallocating the binlog files, some padding with NUL bytes will have to be implemented anyway. If we are going to always move to the next file, then we might as well trim the last recovered binlog file at the last recovered position. The POSIX interface for these would be posix_fallocate() and ftruncate(). Some existing code in InnoDB prefers fallocate() and falls back to pwrite() with NUL bytes. While fallocate() requires special support from the underlying file system and requires a fallback to regular writes, ftruncate() should always be available. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
participants (2)
-
Kristian Nielsen
-
Marko Mäkelä