Hi Kristian, On Wed, Jan 29, 2025 at 2:08 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I was thinking that the binlog layer would need to be informed of upto which LSN the redo log has been durably written to disk (in case of --innodb-flush-log-at-trx-commit=0|2). But does "trim the contents" imply that the binlog is free to write pages to the file system even ahead of the redo log, because any data beyond the eventually recovered LSN will then be cleared during recovery? This could be quite neat and simplify things, and also reduce the need for synchronisation between the redo log and the binlog code.
Yes, that was my idea. I think that the InnoDB API for appending something into the binlog could return the start or end LSN of the mini-transaction. Usually we would be interested in the end LSN, but the InnoDB checkpoint logic (to prevent a checkpoint from "splitting" writes to a binlog page) would be interested in the start LSN. In the binlog layer, I think that the only time when the LSN is of interest is the creation of a binlog file. I think that it could be something like the following: 1. Create an empty binlog file with the next available name. 2. Ensure that the file was durably created. (At least fdatasync() the file; I don't think we currently care about syncing directories.) 3. Write InnoDB redo log for creating the file. (This probably does not need to include the file name; it could be just a WRITE covering the header page data.) 4. Write the header block to the file. If crash recovery encounters a binlog file where the header page is not complete, it would either recover that file from the redo log, or it would delete the file if the ib_logfile0 had not been durably written. Recovery would never create binlog files on its own; that is why the file needs to be durably created before an InnoDB log record is written.
Yes. I am thinking that recovery can simply pass the data into the binlog pwrite-like API, leaving exact details of how data will then be written into the file system to the binlog code.
Exactly.
Not sure how WRITE and MEMMOVE differ, but the binlog code essentially only needs the ability to log a byte string appended to a page, like what is done in mtr_t::memcpy(const buf_block_t &b, ulint ofs, ulint len). Possibly also MEMSET, just to make the log record shorter when filling a byte string with identical bytes,
Since we are defining a new format for this page oriented binlog, the MEMMOVE records are probably not going to be that useful. Those records allow some very rudimentary compression of the ib_logfile0 when some data is being written multiple times to the same page. It would basically be a special WRITE that says "copy these bytes from an earlier binlog record in the same page", instead of repeating the same bytes verbatim.
* InnoDB log checkpoint will be tweaked as follows: ** The log checkpoint must not "split" a binlog write. ** InnoDB must remember the start LSN of the last partial binlog block write.
I'm unsure here what "binlog write" refers to.
Does it refer to the write of the page to the file system layer (eg. pwrite())?
Or does it refer to the redo logging of appended data to a page, similar to currently mtr.start(); mtr.memcpy(); mtr.commit(); ?
It refers to the latter: a set of mini-transactions that are appending data to the same binlog block. Because the recovery will not read any binlog blocks, it will only be able to deal with situations where after a checkpoint, the first record for writing into a binlog block starts from offset 0. We could have multiple separate mini-transactions, like this: (1) Write 123 bytes to offset 0 of binlog block 123 (2) Write 123 bytes to offset 123 of binlog block 123 (3) Write 3846 bytes to offset 246 of binlog block 123 (4) Write 1234 bytes to offset 0 of binlog block 124 The checkpoint LSN may be advanced to anywhere before (1), or between (3) and (4), but not anywhere else with respect to these. When the binlog layer is asked to write and fdatasync() everything during a checkpoint, it will also fully pad the last binlog block, so that InnoDB knows that it can reset last_start_LSN=LSN_MAX so that the next checkpoint will be able to move further. The binlog layer has to guarantee that the next write will be to offset 0 of a new block (125 in the above example). InnoDB will be able to enforce this with a debug assertion.
** Checkpoint_LSN=min(last_start_LSN,buf_pool.get_oldest_modification()): ** Before or after fil_flush_file_spaces(), the last binlog file must be durably written. last block padded.
Binlog can pad the last block easily (it is also done in case of FLUSH BINARY LOGS which truncates the currently active binlog).
If we can allow the last_start_LSN to be the end of the last full block (before any LSN for writing to the current, partial block), we could avoid having to pad a block for every checkpoint. Then the binlog needs to ensure that the checkpoint LSN can always advance (eg. padding or at least fully rewriting the last block if it has not completed since the last checkpoint or some timeout).
We can allow that, but then the binlog layer must resubmit WRITE records covering everything from the start of the last (incomplete) binlog block. This would have the benefit that the page oriented format would not need to tolerate any "padding" in the middle of the binlog.
As I promised, here is a first draft of a possible API between the binlog and redo/recovery code along the lines discussed. It is mostly based on what I see the binlog code will need, and changes will probably be needed to suite the redo/recovery part that I am still not very familiar with. [snip] binlog_tablespace_create(tablespace, length_in_pages) Create a tablespace. Register the new tablespace file for redo logging.
We don't need to register any tablespace metadata in InnoDB. We can simply hard-code two tablespace IDs to refer to the binlog files. A tablespace object was needed in the earlier prototype, because all data was being written through the InnoDB buffer pool. In InnoDB, this only needs to write a WRITE record with the payload of the header page. Recovery will additionally know the end LSN of the mini-transaction, which will be what this API will return, so that you can write the file creation LSN to the binlog file header. This could also return the 1 bit of InnoDB pseudo tablespace ID, which could be written to the header page.
binlog_tablespace_close(tablespace) Close a tablespace. Marks to redo logging that this tablespace file is now fully durably written to disk and will not receive any further updates.
All this needs to do in InnoDB is to assign last_start_lsn=LSN_MAX so that the checkpoint can be advanced. Possibly we will need last_start_lsn[2], to correspond to both binlog tablespace ID values that the InnoDB redo log knows about.
binlog_tablespace_truncate(tablespace, new_length_in_pages) Truncate a binlog tablespace (like mtr.trim_pages() and mtr.commit_shrink()). Can be independent, does not need to be part of a logging group with any other operations.
When would this be invoked? My understanding is that InnoDB only needs to identify 2 files: the one that is being written to, and another one that is being created when the old file is about to fill up. For these, an alternating tablespace ID will be assigned. What the binlog might do with older binlog files (such as moving them to an archive location, or removing them) does not interest InnoDB. See also below.
API for interacting with InnoDB checkpointing and recovery. This is based on what I see as minimal needs from the binlog point of view. Probably need something more here, eg. to supply the last_start_LSN you mentioned:
binlog_write_up_to(lsn) Request the binlog to durably write ASAP all data needed up to specified lsn Could be called by InnoDB checkpointing code, similar to fil_flush_file_spaces() perhaps.
Right. This call could also pass the previously completed checkpoint LSN, which would give a permission to delete or archive any older binlog files. In this way, the binlog layer could safely remove or archive the last-but-one binlog file, and only retain 1 file if that is desirable. We could also include a separate call for indicating the latest checkpoint LSN. That would typically be invoked soon after the binlog_write_up_to(lsn).
binlog_report_lsn(lsn): Called by binlog code to inform redo logging that all binlog data prior to that LSN is now durably written to disk. Could also be a synchroneous return from binlog_write_up_to() if that fits better.
The dedicated InnoDB buf_flush_page_cleaner() thread could invoke binlog_write_up_to(lsn) at the start of an page write batch and then invoke a function to wait for the completion later. I think that it would be simplest to use the same thread for both (kind of "push" interface for both instead of "push" for one and "pull" for the other). Or we could just merge these two interfaces for now.
binlog_recover_data(tablespace_id, page_no, in_page_offset, length, buffer) During crash recovery, passes recovered data to the binlog layer. Recovered data is supplied in same order that it was originally written to the redo log. All data following the last binlog_report_lsn() is guaranteed to be recovered. Data before that LSN may or may not be recovered, binlog code needs to handle that in either case.
binlog_recover_tablespace_create(tablespace, length_in_pages) binlog_recover_tablespace_truncate(tablespace, new_length_in_pages) Recovers a tablespace creation or truncation event.
The creation could be merged to binlog_recover_data() as well. Because you also mentioned binlog_tablespace_truncate(), for which I do not see any need, I wonder what the intended purpose of binlog_recover_tablespace_truncate() would be. We do need something that would trim the end of a binlog file, to discard anything that was not recovered via the ib_logfile0. That could be implemented as part of the binlog_recover_data() logic, exploiting the fact that all writes are going to be in ascending order of page number and byte offset, with the possible exception of starting a rewrite of the last block from byte offset 0. We would seem to need a call that would inform the binlog of the latest recovered LSN, so that any file that carries a newer creation LSN will be deleted by the binlog recovery logic. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc