[MariaDB developers] Next step on MDEV-34705, implement binlog in InnoDB

3 Jan 2025

      Hi Marko,

Time flies, somehow it's already more than a year since our first
discussions on implementing the binlog in InnoDB and avoiding the extra
fsync() and complexity of the two-phase commit between InnoDB and binlog.

But things have progressed, and I have now reached the point where most of
the basic groundwork is implemented. Event groups are binlogged to InnoDB
tablespaces. Binlog dump thread can read the binlog and send to slave, and
replication is working. Large event groups are split into pieces, bounding
the amount of data that needs to be atomically written in mini-transactions
and at commit time. There are still many details left, but mostly in the
server-layer replication code which should be manageable, just will take
some time to get completed.

I think now is a good time for you to take a first real look at the InnoDB
part of the changes, I would really value your input.

The main part of the InnoDB code is in two files:

1. handler/handler0binlog.cc for the high-level part that deals mostly with
the new binlog file format and interfacing to the server layer.

2. fsp/fsp0binlog.cc for the low-level part that most tightly interacts with
the InnoDB mini-transactions and buffer pool.

The most interesting part for you to look at is fsp/fsp0binlog.cc (~1k
lines), though I'm happy to hear comments on any part of the patch, of
course.

The code is pushed to GitHub in the branch knielsen_binlog_in_engine:

  https://github.com/MariaDB/server/commits/knielsen_binlog_in_engine

and I've also attached the complete patch.

This is my first major patch for InnoDB, so there will undoubtedly be a
number of style changes required. But the overall structure of the code
should now be close to what I imagine would be the final result, with some
pending ToDo steps marked in comments in the code, and detailed in the below
list, some of which we discussed a bit already.

I hope you will take a look at the patch and let me know of any questions or
other things you need from me. Maybe we can also find a chance to discuss
further if you will come to FOSDEM start of February, or I could visit
sometimes in Finland.

 - Kristian.

----
Some known outstanding issues in the InnoDB part:

 - We previously discussed removing some of the page header overhead for
   binlog tablespaces. Currently the code just leaves alone the first
   FIL_PAGE_DATA bytes (38) and the last FIL_PAGE_DATA_END (8 IIRC).

 - We discussed previously to write the current LSN at the start of the
   tablespace, and use this in recovery to handle that we have only two
   tablespace IDs that are reused. So we need code in recovery that checks
   the LSN at the start of the tablespace, and skips redo records with LSN
   smaller than this.

 - We want to avoid the double-write buffer for binlog pages, at least for
   the first page write (most pages will only be written as full pages). You
   mentioned an idea to completely avoid the double-write buffer and instead
   do some specific code for recovery in the uncommon case where a partial
   binlog page is written to disk due to low commit activity.

 - The flushing of binlog pages to disk currently happens in a dedicated
   thread in the background. I'd welcome ideas on how to do this
   differently. It is good to flush binlog pages quickly and re-use their
   buffer pool entries for something better. Also writing the pages to disk
   quickly (not necessarily fsync()'ing) makes the data readable by
   mysqlbinlog.

 - Checksum and encryption should use the standard InnoDB mechanism. I
   assume checksum is already handled in the code through using the buffer
   pool and mini-transactions to read/write pages. Not sure about
   encryption. I need to implement that the code handles checksum and
   decryption when reading the pages manually from the file (not through
   buffer pool).

[MariaDB developers] Next step on MDEV-34705, implement binlog in InnoDB

Kristian Nielsen