Marko Mäkelä <marko.makela@mariadb.com> writes:
I think that what you have written so far should be useful for an initial feasibility study, for measuring the performance. We do not need recovery to actually work when running the initial tests. I
Yes, agree. Thanks for your comments, it's good to know that I'm on the right track, and it should help me understand more of the details of InnoDB as I develop the patch further.
Like we discussed a week ago, some more changes would be needed around writing the GTID. We might want to assign the GTID in mtr_t::do_write() under the protection of an exclusive log_sys.latch, to ensure that transactions are made durable in the GTID order. As
Yes, I will try to get this done next, and that could already be a good basis for initial benchmarking as you suggest. We don't need to have an implementation of slave dump threads reading the binlog tablespaces to learn something about the performance on the master (as long as the data written is close to what it would be in a full implementation). So the following questions are related to a later step with a full implementation, they don't need to be finalized for initial testing:
The InnoDB redo log identifies files by a tablespace ID. I think that we would want to reserve 2 tablespace IDs for the page-oriented binlog files. We do not need to write the binlog file names into the redo log; we can hard-code a pattern for them, and we can write the 1-bit tablespace ID into the first page of the file. When switching tablespace files, we would toggle this ID bit.
I would tweak the log checkpoint to ensure that all pages of the "previous" binlog tablespace must be written back before we can advance the log checkpoint. The tablespace ID (actually just 1 bit of
Conversely, we would then also need to wait for a log checkpoint before we can rotate to a new binlog tablespace, right? Because if more than two binlog tablespaces would be actively written between log checkpoints, it would be ambiguous which tablespace a log record should be applied to. I think log checkpoints can be relatively infrequent, to improve transaction throughput and reduce I/O (but increasing the time for recovery), right? Then this would mean that each binlog tablespace would need to grow as needed and could not have a specified maximum size. But not 100% sure that I understand all the details around recovery and log checkpoints here. Allocating a set of "normal" space ids and reusing them (when the tablespace has been fully synced to disk and a new log checkpoint created) could remove this dependency and allow binlog tablespace rotation independent of last log checkpoint. But it would be nice to avoid the need to keep track of allocated tablespace ids and just have two fixed ids for this, I like that approach if it can be made to work. In any case, I'm sure this issue can be solved in some way, for now I'm just trying to understand what the constraints are.
For the final implementation, I would bypass as much of the "middleware" that resides above the buffer pool. For normal InnoDB tablespaces, there are page headers and footers that are wasting quite a bit of space, and there also is management of allocated pages within the tablespace.
Ok, sounds good, we can take the details on this later.
The minimum that we actually need is a 4-byte checksum at the end of each page, and possibly also the 8-byte log sequence number that is normally stored at FIL_PAGE_LSN. If you can guarantee that the binlog is always being written sequentially, we may do away with one or both, and remove the "apply the log unless the page is newer" logic, that is, unconditionally apply any log to the binlog file on recovery.
Yes, this is a good idea. Unconditionally applying redo log should work, I think.
And maybe there is a way to pin the current page in the buffer pool so buf_page_get_gen() is not needed for every write?
There is, and it is called buffer-fixing. However, a page is being loaded into the buffer pool, we must acquire a page latch so that we can ensure that the page has been fully loaded before we access it.
Ok, thanks for explaining these details. I'll need to get a better understanding on the use of the buffer pool when it's time to expand on this part. I think I will be able to guarantee that readers will never access ranges that would be written concurrently, and that writers will never write the same data concurrently. I am wondering if it would make sense to fix the page in the buffer pool already at fsp_page_create(). Keep the page fixed until it has been written (not necessarily fsync()'ed); and after that have the slave dump threads read the data from the file through the OS, so the page can be dropped from the buffer pool. To reduce the load on the buffer pool, and especially avoid binlog pages being purged from the pool and then later re-read. But these are probably details that can be handled later.
Another assertion was fixed by doing mtr.set_named_space() before writing. Again, I'm not sure what this does exactly or if it's appropriate?
The purpose of that is to ensure that FILE_MODIFY records are being written to allow recovery to construct a mapping between tablespace ID and file names. We do not use this for the InnoDB system tablespace or the undo log tablespaces, and we do not want this for the binlog tablespaces either.
Ok, thanks for the explanation. At some point I will find how to avoid this then without getting the assertion. - Kristian.