[MariaDB developers] Re: Next step on MDEV-34705, implement binlog in InnoDB

29 Jan 2025

      Marko Mäkelä <marko.makela@mariadb.com> writes:
...
Finally, I got some more time to think about this. I’m trying to
summarize from the InnoDB point of view what we discussed today.
Thanks Marko! This is an interesting development. If we can bypass using the
buffer pool and associated machinery it could simplify the logic a lot, and
potentially perhaps also further improve performance.
...
* Based on the binlog file creation LSN (in the first block, say, 4096
bytes), InnoDB recovery will:
** ignore files that are older than the checkpoint LSN
** delete files that are newer than the last recovered LSN
** recover any other files (re-apply writes or trim the contents after
the last write)
Agree.

I was thinking that the binlog layer would need to be informed of upto which
LSN the redo log has been durably written to disk (in case of
--innodb-flush-log-at-trx-commit=0|2). But does "trim the contents" imply
that the binlog is free to write pages to the file system even ahead of the
redo log, because any data beyond the eventually recovered LSN will then be
cleared during recovery? This could be quite neat and simplify things, and
also reduce the need for synchronisation between the redo log and the binlog
code.
...
** not read anything from the files
Agree, this sounds good.
...
** invoke a pwrite() like binlog API that takes care of any encryption
Yes. I am thinking that recovery can simply pass the data into the binlog
pwrite-like API, leaving exact details of how data will then be written into
the file system to the binlog code.
...
** make sure that there are only WRITE, MEMSET, MEMMOVE records, in
strictly sequential order
Agree with strict sequential order, I have tried very hard to preserve this
property so far.

Not sure how WRITE and MEMMOVE differ, but the binlog code essentially only
needs the ability to log a byte string appended to a page, like what is done
in mtr_t::memcpy(const buf_block_t &b, ulint ofs, ulint len). Possibly also
MEMSET, just to make the log record shorter when filling a byte string with
identical bytes,
...
* InnoDB log checkpoint will be tweaked as follows:
** The log checkpoint must not "split" a binlog write.
** InnoDB must remember the start LSN of the last partial binlog block write.
I'm unsure here what "binlog write" refers to.

Does it refer to the write of the page to the file system layer (eg.
pwrite())?

Or does it refer to the redo logging of appended data to a page, similar to
currently mtr.start(); mtr.memcpy(); mtr.commit(); ?
...
** Checkpoint_LSN=min(last_start_LSN,buf_pool.get_oldest_modification()):
** Before or after fil_flush_file_spaces(), the last binlog file must
be durably written. last block padded.
Binlog can pad the last block easily (it is also done in case of FLUSH
BINARY LOGS which truncates the currently active binlog).

If we can allow the last_start_LSN to be the end of the last full block
(before any LSN for writing to the current, partial block), we could avoid
having to pad a block for every checkpoint. Then the binlog needs to ensure
that the checkpoint LSN can always advance (eg. padding or at least fully
rewriting the last block if it has not completed since the last checkpoint
or some timeout).
...
I will start to implement the log writing and recovery logic.
Great!

As I promised, here is a first draft of a possible API between the binlog
and redo/recovery code along the lines discussed. It is mostly based on what
I see the binlog code will need, and changes will probably be needed to
suite the redo/recovery part that I am still not very familiar with.

API for binlog to append data to binlog tablespace files:

binlog_record_begin()
  Start an atomically recovered logging group.
  Optionally part of an existing mtr (ie. InnoDB trx commit) for atomic recovery.

binlog_record_memcpy(tablespace, page, in_page_offset, length, data)
  Redo log a byte string (maybe a memset() variant too).
  Always strictly append-only to a page.
  Optionally part of an existing mtr (ie. InnoDB trx commit) for atomic recovery.
  If the offset is 0, this implicitly INIT_PAGE.

binlog_record_end()
  End an atomically recovered logging group
  Returns the corresponding LSN.

binlog_tablespace_create(tablespace, length_in_pages)
  Create a tablespace. Register the new tablespace file for redo logging.

binlog_tablespace_close(tablespace)
  Close a tablespace. Marks to redo logging that this tablespace file is
  now fully durably written to disk and will not receive any further updates.

binlog_tablespace_truncate(tablespace, new_length_in_pages)
  Truncate a binlog tablespace (like mtr.trim_pages() and
  mtr.commit_shrink()). Can be independent, does not need to be part of a
  logging group with any other operations.

API for interacting with InnoDB checkpointing and recovery. This is based
on what I see as minimal needs from the binlog point of view. Probably need
something more here, eg. to supply the last_start_LSN you mentioned:

binlog_write_up_to(lsn)
  Request the binlog to durably write ASAP all data needed up to specified lsn
  Could be called by InnoDB checkpointing code, similar to
  fil_flush_file_spaces() perhaps.

binlog_report_lsn(lsn):
  Called by binlog code to inform redo logging that all binlog data prior to
  that LSN is now durably written to disk. Could also be a synchroneous
  return from binlog_write_up_to() if that fits better.

binlog_recover_data(tablespace_id, page_no, in_page_offset, length, buffer)
  During crash recovery, passes recovered data to the binlog layer.
  Recovered data is supplied in same order that it was originally written
  to the redo log.
  All data following the last binlog_report_lsn() is guaranteed to be
  recovered. Data before that LSN may or may not be recovered, binlog code
  needs to handle that in either case.

binlog_recover_tablespace_create(tablespace, length_in_pages)
binlog_recover_tablespace_truncate(tablespace, new_length_in_pages)
  Recovers a tablespace creation or truncation event.

 - Kristian.

[MariaDB developers] Re: Next step on MDEV-34705, implement binlog in InnoDB

Kristian Nielsen