[MariaDB developers] Re: Storing binlig in InnoDB/engine

26 Feb 2024

      Marko Mäkelä <marko.makela@mariadb.com> writes:
...
I think that what you have written so far should be useful for an
initial feasibility study, for measuring the performance. We do not
need recovery to actually work when running the initial tests. I
Yes, agree. Thanks for your comments, it's good to know that I'm on the
right track, and it should help me understand more of the details of InnoDB
as I develop the patch further.
...
Like we discussed a week ago, some more changes would be needed around
writing the GTID. We might want to assign the GTID in
mtr_t::do_write() under the protection of an exclusive log_sys.latch,
to ensure that transactions are made durable in the GTID order. As
Yes, I will try to get this done next, and that could already be a good
basis for initial benchmarking as you suggest. We don't need to have an
implementation of slave dump threads reading the binlog tablespaces to learn
something about the performance on the master (as long as the data written
is close to what it would be in a full implementation).

So the following questions are related to a later step with a full
implementation, they don't need to be finalized for initial testing:
...
The InnoDB redo log identifies files by a tablespace ID. I think that
we would want to reserve 2 tablespace IDs for the page-oriented binlog
files. We do not need to write the binlog file names into the redo
log; we can hard-code a pattern for them, and we can write the 1-bit
tablespace ID into the first page of the file. When switching
tablespace files, we would toggle this ID bit.
...
I would tweak the log checkpoint to ensure that all pages of the
"previous" binlog tablespace must be written back before we can
advance the log checkpoint. The tablespace ID (actually just 1 bit of
Conversely, we would then also need to wait for a log checkpoint before we
can rotate to a new binlog tablespace, right?
Because if more than two binlog tablespaces would be actively written
between log checkpoints, it would be ambiguous which tablespace a log record
should be applied to.

I think log checkpoints can be relatively infrequent, to improve transaction
throughput and reduce I/O (but increasing the time for recovery), right?
Then this would mean that each binlog tablespace would need to grow as
needed and could not have a specified maximum size. But not 100% sure that I
understand all the details around recovery and log checkpoints here.

Allocating a set of "normal" space ids and reusing them (when the tablespace
has been fully synced to disk and a new log checkpoint created) could remove
this dependency and allow binlog tablespace rotation independent of last log
checkpoint. But it would be nice to avoid the need to keep track of
allocated tablespace ids and just have two fixed ids for this, I like that
approach if it can be made to work.

In any case, I'm sure this issue can be solved in some way, for now I'm just
trying to understand what the constraints are.
...
For the final implementation, I would bypass as much of the
"middleware" that resides above the buffer pool. For normal InnoDB
tablespaces, there are page headers and footers that are wasting quite
a bit of space, and there also is management of allocated pages within
the tablespace.
Ok, sounds good, we can take the details on this later.
...
The minimum that we actually need is a 4-byte checksum at the end of
each page, and possibly also the 8-byte log sequence number that is
normally stored at FIL_PAGE_LSN. If you can guarantee that the binlog
is always being written sequentially, we may do away with one or both,
and remove the "apply the log unless the page is newer" logic, that
is, unconditionally apply any log to the binlog file on recovery.
Yes, this is a good idea. Unconditionally applying redo log should work, I
think.
...
...
And maybe there is a
way to pin the current page in the buffer pool so buf_page_get_gen() is not
needed for every write?
There is, and it is called buffer-fixing. However, a page is being
loaded into the buffer pool, we must acquire a page latch so that we
can ensure that the page has been fully loaded before we access it.
Ok, thanks for explaining these details. I'll need to get a better
understanding on the use of the buffer pool when it's time to expand on this
part.

I think I will be able to guarantee that readers will never access ranges
that would be written concurrently, and that writers will never write the
same data concurrently.

I am wondering if it would make sense to fix the page in the buffer pool
already at fsp_page_create(). Keep the page fixed until it has been written
(not necessarily fsync()'ed); and after that have the slave dump threads
read the data from the file through the OS, so the page can be dropped from
the buffer pool. To reduce the load on the buffer pool, and especially avoid
binlog pages being purged from the pool and then later re-read.

But these are probably details that can be handled later.
...
...
Another assertion was fixed by doing mtr.set_named_space() before writing.
Again, I'm not sure what this does exactly or if it's appropriate?
The purpose of that is to ensure that FILE_MODIFY records are being
written to allow recovery to construct a mapping between tablespace ID
and file names. We do not use this for the InnoDB system tablespace or
the undo log tablespaces, and we do not want this for the binlog
tablespaces either.
Ok, thanks for the explanation. At some point I will find how to avoid this
then without getting the assertion.

 - Kristian.

[MariaDB developers] Re: Storing binlig in InnoDB/engine

Kristian Nielsen