Hi Kristian, This is great, a years-old dream finally moving a little forward. On Mon, Feb 26, 2024 at 2:40 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
My general approach to writing is to use fsp_page_create() for the first write to a page, and then buf_page_get_gen() for subsequent writes. But maybe this should be refined for the actual implementation.
I'm thinking if perhaps fsp_page_create() does too much
For the final implementation, I would bypass as much of the "middleware" that resides above the buffer pool. For normal InnoDB tablespaces, there are page headers and footers that are wasting quite a bit of space, and there also is management of allocated pages within the tablespace. The minimum that we actually need is a 4-byte checksum at the end of each page, and possibly also the 8-byte log sequence number that is normally stored at FIL_PAGE_LSN. If you can guarantee that the binlog is always being written sequentially, we may do away with one or both, and remove the "apply the log unless the page is newer" logic, that is, unconditionally apply any log to the binlog file on recovery. The InnoDB redo log identifies files by a tablespace ID. I think that we would want to reserve 2 tablespace IDs for the page-oriented binlog files. We do not need to write the binlog file names into the redo log; we can hard-code a pattern for them, and we can write the 1-bit tablespace ID into the first page of the file. When switching tablespace files, we would toggle this ID bit.
And maybe there is a way to pin the current page in the buffer pool so buf_page_get_gen() is not needed for every write?
There is, and it is called buffer-fixing. However, a page is being loaded into the buffer pool, we must acquire a page latch so that we can ensure that the page has been fully loaded before we access it. This was in fact what caused a serious regression MDEV-31767: https://github.com/MariaDB/server/commit/b102872ad50cce5959ad95369740766d14e... A buffer-fix will prevent the page from being removed from the buffer pool. The page may be concurrently modified by other threads, or modifications may be concurrently written back to the file system. In some special cases such as some related to the undo log (see MDEV-32050), it is safe to read the contents of a page while only holding a buffer-fix. Basically, you must be sure that the range that you are reading cannot be concurrently overwritten by other threads. For an append-only binlog tablespace this property would be trivially guaranteed: an earlier written part of the binlog can be sent to a replica while more events are being appended to the binlog page.
I'm currently passing RW_SX_LATCH to buf_page_get_gen() (otherwise I got an assertion when writing). I'm not sure though how these latches work, or if binlog writing would need such latches; maybe it makes more sense to have a simple mutex protecting page access?
The rw-lock or Shared/Update/eXclusive lock on the block descriptor is the simplest that we have. The U or SX latch is the weakest available option for writes. It will allow read latches to be granted to the page concurrently.
Another assertion was fixed by doing mtr.set_named_space() before writing. Again, I'm not sure what this does exactly or if it's appropriate?
The purpose of that is to ensure that FILE_MODIFY records are being written to allow recovery to construct a mapping between tablespace ID and file names. We do not use this for the InnoDB system tablespace or the undo log tablespaces, and we do not want this for the binlog tablespaces either.
I tried in this patch to reserve 2 "special" tablespace ids for the binlog tablespaces. Idea would be to cycle between them, keeping at most the two last tablespaces active. But do the tablespace IDs appear in the redo log and used for recovery?
I would tweak the log checkpoint to ensure that all pages of the "previous" binlog tablespace must be written back before we can advance the log checkpoint. The tablespace ID (actually just 1 bit of it) would have to be written to the file header. Recovery would find the last binlog tablespace file by some filtering of opendir()/readdir()/closedir() and determine the tablespace ID by reading the last 2 files. I think that what you have written so far should be useful for an initial feasibility study, for measuring the performance. We do not need recovery to actually work when running the initial tests. I expect the difference to be drastic when using the only safe setting sync_binlog=1. Like we discussed a week ago, some more changes would be needed around writing the GTID. We might want to assign the GTID in mtr_t::do_write() under the protection of an exclusive log_sys.latch, to ensure that transactions are made durable in the GTID order. As this would allow us to be crash-safe even when using innodb_flush_log_at_trx_commit=0 and no fdatasync() except for write barriers around log checkpoints. The setting innodb_flush_log_at_trx_commit=1 would only be necessary for full durability, and the group commit that was improved in https://jira.mariadb.org/browse/MDEV-26789 would work out of the box. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc