[MariaDB developers] Re: Next step on MDEV-34705, implement binlog in InnoDB

29 Jan 2025

      Hi Kristian,

On Wed, Jan 29, 2025 at 2:08 PM Kristian Nielsen
<knielsen@knielsen-hq.org> wrote:
...
I was thinking that the binlog layer would need to be informed of upto which
LSN the redo log has been durably written to disk (in case of
--innodb-flush-log-at-trx-commit=0|2). But does "trim the contents" imply
that the binlog is free to write pages to the file system even ahead of the
redo log, because any data beyond the eventually recovered LSN will then be
cleared during recovery? This could be quite neat and simplify things, and
also reduce the need for synchronisation between the redo log and the binlog
code.
Yes, that was my idea. I think that the InnoDB API for appending
something into the binlog could return the start or end LSN of the
mini-transaction. Usually we would be interested in the end LSN, but
the InnoDB checkpoint logic (to prevent a checkpoint from "splitting"
writes to a binlog page) would be interested in the start LSN.

In the binlog layer, I think that the only time when the LSN is of
interest is the creation of a binlog file. I think that it could be
something like the following:
1. Create an empty binlog file with the next available name.
2. Ensure that the file was durably created. (At least fdatasync() the
file; I don't think we currently care about syncing directories.)
3. Write InnoDB redo log for creating the file. (This probably does
not need to include the file name; it could be just a WRITE covering
the header page data.)
4. Write the header block to the file.

If crash recovery encounters a binlog file where the header page is
not complete, it would either recover that file from the redo log, or
it would delete the file if the ib_logfile0 had not been durably
written.
Recovery would never create binlog files on its own; that is why the
file needs to be durably created before an InnoDB log record is
written.
...
Yes. I am thinking that recovery can simply pass the data into the binlog
pwrite-like API, leaving exact details of how data will then be written into
the file system to the binlog code.
Exactly.
...
Not sure how WRITE and MEMMOVE differ, but the binlog code essentially only
needs the ability to log a byte string appended to a page, like what is done
in mtr_t::memcpy(const buf_block_t &b, ulint ofs, ulint len). Possibly also
MEMSET, just to make the log record shorter when filling a byte string with
identical bytes,
Since we are defining a new format for this page oriented binlog, the
MEMMOVE records are probably not going to be that useful. Those
records allow some very rudimentary compression of the ib_logfile0
when some data is being written multiple times to the same page. It
would basically be a special WRITE that says "copy these bytes from an
earlier binlog record in the same page", instead of repeating the same
bytes verbatim.
...
...
* InnoDB log checkpoint will be tweaked as follows:
** The log checkpoint must not "split" a binlog write.
** InnoDB must remember the start LSN of the last partial binlog block write.
I'm unsure here what "binlog write" refers to.
Does it refer to the write of the page to the file system layer (eg.
pwrite())?
Or does it refer to the redo logging of appended data to a page, similar to
currently mtr.start(); mtr.memcpy(); mtr.commit(); ?
It refers to the latter: a set of mini-transactions that are appending
data to the same binlog block. Because the recovery will not read any
binlog blocks, it will only be able to deal with situations where
after a checkpoint, the first record for writing into a binlog block
starts from offset 0. We could have multiple separate
mini-transactions, like this:

(1) Write 123 bytes to offset 0 of binlog block 123
(2) Write 123 bytes to offset 123 of binlog block 123
(3) Write 3846 bytes to offset 246 of binlog block 123
(4) Write 1234 bytes to offset 0 of binlog block 124

The checkpoint LSN may be advanced to anywhere before (1), or between
(3) and (4), but not anywhere else with respect to these. When the
binlog layer is asked to write and fdatasync() everything during a
checkpoint, it will also fully pad the last binlog block, so that
InnoDB knows that it can reset last_start_LSN=LSN_MAX so that the next
checkpoint will be able to move further. The binlog layer has to
guarantee that the next write will be to offset 0 of a new block (125
in the above example). InnoDB will be able to enforce this with a
debug assertion.
...
...
** Checkpoint_LSN=min(last_start_LSN,buf_pool.get_oldest_modification()):
** Before or after fil_flush_file_spaces(), the last binlog file must
be durably written. last block padded.
Binlog can pad the last block easily (it is also done in case of FLUSH
BINARY LOGS which truncates the currently active binlog).
If we can allow the last_start_LSN to be the end of the last full block
(before any LSN for writing to the current, partial block), we could avoid
having to pad a block for every checkpoint. Then the binlog needs to ensure
that the checkpoint LSN can always advance (eg. padding or at least fully
rewriting the last block if it has not completed since the last checkpoint
or some timeout).
We can allow that, but then the binlog layer must resubmit WRITE
records covering everything from the start of the last (incomplete)
binlog block. This would have the benefit that the page oriented
format would not need to tolerate any "padding" in the middle of the
binlog.
...
As I promised, here is a first draft of a possible API between the binlog
and redo/recovery code along the lines discussed. It is mostly based on what
I see the binlog code will need, and changes will probably be needed to
suite the redo/recovery part that I am still not very familiar with.
[snip]
binlog_tablespace_create(tablespace, length_in_pages)
  Create a tablespace. Register the new tablespace file for redo logging.
We don't need to register any tablespace metadata in InnoDB. We can
simply hard-code two tablespace IDs to refer to the binlog files. A
tablespace object was needed in the earlier prototype, because all
data was being written through the InnoDB buffer pool.

In InnoDB, this only needs to write a WRITE record with the payload of
the header page. Recovery will additionally know the end LSN of the
mini-transaction, which will be what this API will return, so that you
can write the file creation LSN to the binlog file header. This could
also return the 1 bit of InnoDB pseudo tablespace ID, which could be
written to the header page.
...
binlog_tablespace_close(tablespace)
  Close a tablespace. Marks to redo logging that this tablespace file is
  now fully durably written to disk and will not receive any further updates.
All this needs to do in InnoDB is to assign last_start_lsn=LSN_MAX so
that the checkpoint can be advanced.
Possibly we will need last_start_lsn[2], to correspond to both binlog
tablespace ID values that the InnoDB redo log knows about.
...
binlog_tablespace_truncate(tablespace, new_length_in_pages)
  Truncate a binlog tablespace (like mtr.trim_pages() and
  mtr.commit_shrink()). Can be independent, does not need to be part of a
  logging group with any other operations.
When would this be invoked?

My understanding is that InnoDB only needs to identify 2 files:
the one that is being written to, and another one that is being
created when the old file is about to fill up.
For these, an alternating tablespace ID will be assigned.

What the binlog might do with older binlog files (such as moving them
to an archive location, or removing them) does not interest InnoDB.
See also below.
...
API for interacting with InnoDB checkpointing and recovery. This is based
on what I see as minimal needs from the binlog point of view. Probably need
something more here, eg. to supply the last_start_LSN you mentioned:
binlog_write_up_to(lsn)
  Request the binlog to durably write ASAP all data needed up to specified lsn
  Could be called by InnoDB checkpointing code, similar to
  fil_flush_file_spaces() perhaps.
Right. This call could also pass the previously completed checkpoint
LSN, which would give a permission to delete or archive any older
binlog files.
In this way, the binlog layer could safely remove or archive the
last-but-one binlog file, and only retain 1 file if that is desirable.

We could also include a separate call for indicating the latest
checkpoint LSN. That would typically be invoked soon after the
binlog_write_up_to(lsn).
...
binlog_report_lsn(lsn):
  Called by binlog code to inform redo logging that all binlog data prior to
  that LSN is now durably written to disk. Could also be a synchroneous
  return from binlog_write_up_to() if that fits better.
The dedicated InnoDB buf_flush_page_cleaner() thread could invoke
binlog_write_up_to(lsn) at the start of an page write batch and then
invoke a function to wait for the completion later.
I think that it would be simplest to use the same thread for both
(kind of "push" interface for both instead of "push" for one and
"pull" for the other).
Or we could just merge these two interfaces for now.
...
binlog_recover_data(tablespace_id, page_no, in_page_offset, length, buffer)
  During crash recovery, passes recovered data to the binlog layer.
  Recovered data is supplied in same order that it was originally written
  to the redo log.
  All data following the last binlog_report_lsn() is guaranteed to be
  recovered. Data before that LSN may or may not be recovered, binlog code
  needs to handle that in either case.
binlog_recover_tablespace_create(tablespace, length_in_pages)
binlog_recover_tablespace_truncate(tablespace, new_length_in_pages)
  Recovers a tablespace creation or truncation event.
The creation could be merged to binlog_recover_data() as well.

Because you also mentioned binlog_tablespace_truncate(), for which I
do not see any need, I wonder what the intended purpose of
binlog_recover_tablespace_truncate() would be.

We do need something that would trim the end of a binlog file, to
discard anything that was not recovered via the ib_logfile0. That
could be implemented as part of the binlog_recover_data() logic,
exploiting the fact that all writes are going to be in ascending order
of page number and byte offset, with the possible exception of
starting a rewrite of the last block from byte offset 0.

We would seem to need a call that would inform the binlog of the
latest recovered LSN, so that any file that carries a newer creation
LSN will be deleted by the binlog recovery logic.

Marko
-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB plc