Marko Mäkelä <marko.makela@mariadb.com> writes:
Finally, I got some more time to think about this. I’m trying to summarize from the InnoDB point of view what we discussed today.
Thanks Marko! This is an interesting development. If we can bypass using the buffer pool and associated machinery it could simplify the logic a lot, and potentially perhaps also further improve performance.
* Based on the binlog file creation LSN (in the first block, say, 4096 bytes), InnoDB recovery will: ** ignore files that are older than the checkpoint LSN ** delete files that are newer than the last recovered LSN ** recover any other files (re-apply writes or trim the contents after the last write)
Agree. I was thinking that the binlog layer would need to be informed of upto which LSN the redo log has been durably written to disk (in case of --innodb-flush-log-at-trx-commit=0|2). But does "trim the contents" imply that the binlog is free to write pages to the file system even ahead of the redo log, because any data beyond the eventually recovered LSN will then be cleared during recovery? This could be quite neat and simplify things, and also reduce the need for synchronisation between the redo log and the binlog code.
** not read anything from the files
Agree, this sounds good.
** invoke a pwrite() like binlog API that takes care of any encryption
Yes. I am thinking that recovery can simply pass the data into the binlog pwrite-like API, leaving exact details of how data will then be written into the file system to the binlog code.
** make sure that there are only WRITE, MEMSET, MEMMOVE records, in strictly sequential order
Agree with strict sequential order, I have tried very hard to preserve this property so far. Not sure how WRITE and MEMMOVE differ, but the binlog code essentially only needs the ability to log a byte string appended to a page, like what is done in mtr_t::memcpy(const buf_block_t &b, ulint ofs, ulint len). Possibly also MEMSET, just to make the log record shorter when filling a byte string with identical bytes,
* InnoDB log checkpoint will be tweaked as follows: ** The log checkpoint must not "split" a binlog write. ** InnoDB must remember the start LSN of the last partial binlog block write.
I'm unsure here what "binlog write" refers to. Does it refer to the write of the page to the file system layer (eg. pwrite())? Or does it refer to the redo logging of appended data to a page, similar to currently mtr.start(); mtr.memcpy(); mtr.commit(); ?
** Checkpoint_LSN=min(last_start_LSN,buf_pool.get_oldest_modification()): ** Before or after fil_flush_file_spaces(), the last binlog file must be durably written. last block padded.
Binlog can pad the last block easily (it is also done in case of FLUSH BINARY LOGS which truncates the currently active binlog). If we can allow the last_start_LSN to be the end of the last full block (before any LSN for writing to the current, partial block), we could avoid having to pad a block for every checkpoint. Then the binlog needs to ensure that the checkpoint LSN can always advance (eg. padding or at least fully rewriting the last block if it has not completed since the last checkpoint or some timeout).
I will start to implement the log writing and recovery logic.
Great! As I promised, here is a first draft of a possible API between the binlog and redo/recovery code along the lines discussed. It is mostly based on what I see the binlog code will need, and changes will probably be needed to suite the redo/recovery part that I am still not very familiar with. API for binlog to append data to binlog tablespace files: binlog_record_begin() Start an atomically recovered logging group. Optionally part of an existing mtr (ie. InnoDB trx commit) for atomic recovery. binlog_record_memcpy(tablespace, page, in_page_offset, length, data) Redo log a byte string (maybe a memset() variant too). Always strictly append-only to a page. Optionally part of an existing mtr (ie. InnoDB trx commit) for atomic recovery. If the offset is 0, this implicitly INIT_PAGE. binlog_record_end() End an atomically recovered logging group Returns the corresponding LSN. binlog_tablespace_create(tablespace, length_in_pages) Create a tablespace. Register the new tablespace file for redo logging. binlog_tablespace_close(tablespace) Close a tablespace. Marks to redo logging that this tablespace file is now fully durably written to disk and will not receive any further updates. binlog_tablespace_truncate(tablespace, new_length_in_pages) Truncate a binlog tablespace (like mtr.trim_pages() and mtr.commit_shrink()). Can be independent, does not need to be part of a logging group with any other operations. API for interacting with InnoDB checkpointing and recovery. This is based on what I see as minimal needs from the binlog point of view. Probably need something more here, eg. to supply the last_start_LSN you mentioned: binlog_write_up_to(lsn) Request the binlog to durably write ASAP all data needed up to specified lsn Could be called by InnoDB checkpointing code, similar to fil_flush_file_spaces() perhaps. binlog_report_lsn(lsn): Called by binlog code to inform redo logging that all binlog data prior to that LSN is now durably written to disk. Could also be a synchroneous return from binlog_write_up_to() if that fits better. binlog_recover_data(tablespace_id, page_no, in_page_offset, length, buffer) During crash recovery, passes recovered data to the binlog layer. Recovered data is supplied in same order that it was originally written to the redo log. All data following the last binlog_report_lsn() is guaranteed to be recovered. Data before that LSN may or may not be recovered, binlog code needs to handle that in either case. binlog_recover_tablespace_create(tablespace, length_in_pages) binlog_recover_tablespace_truncate(tablespace, new_length_in_pages) Recovers a tablespace creation or truncation event. - Kristian.