Storing binlig in InnoDB/engine
Hi Marko, I looked a bit more at the idea that we have discussed a couple times, of storing the binlog in an InnoDB tablespace to avoid the need for two-phase commit between binlog and InnoDB and save one (or potentially both) fsync()s during a commit. I managed to get some InnoDB code written that is able to create a tablespace and write to it, patch below or on github: https://github.com/MariaDB/server/commits/knielsen_binlog_in_engine https://github.com/MariaDB/server/commit/95958c3842ebf0f7e358d6c3f51b887bd99... This is far from a complete patch, just an exercise for me on how InnoDB tablespaces and mini-transactions work in detail. But it turned out to be useful for me to get started and find a few questions to ask. My general approach to writing is to use fsp_page_create() for the first write to a page, and then buf_page_get_gen() for subsequent writes. But maybe this should be refined for the actual implementation. I'm thinking if perhaps fsp_page_create() does too much, you mentioned earlier that some parts of the page header could be simplified/omitted. And maybe there is a way to pin the current page in the buffer pool so buf_page_get_gen() is not needed for every write? I'm currently passing RW_SX_LATCH to buf_page_get_gen() (otherwise I got an assertion when writing). I'm not sure though how these latches work, or if binlog writing would need such latches; maybe it makes more sense to have a simple mutex protecting page access? Another assertion was fixed by doing mtr.set_named_space() before writing. Again, I'm not sure what this does exactly or if it's appropriate? I tried in this patch to reserve 2 "special" tablespace ids for the binlog tablespaces. Idea would be to cycle between them, keeping at most the two last tablespaces active. But do the tablespace IDs appear in the redo log and used for recovery? In that case, I assume that all binlog tablespaces written since the last InnoDB checkpoint will need a unique tablespace ID? So maybe 2 is too few. I was thinking maybe the binlog could allocate new tablespace IDs as necessary, but re-use them after each InnoDB checkpoint. This would avoid wasting ids and eventually hitting the 2**32 limit. - Kristian. commit 95958c3842ebf0f7e358d6c3f51b887bd9948845 (HEAD -> binlog_in_inno, origin/knielsen_binlog_in_engine) Author: Kristian Nielsen <knielsen@knielsen-hq.org> Date: Sun Feb 25 17:41:50 2024 +0100 Binlog in Engine: Very first sketch, able to create and write an InnoDB tablespace Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org> diff --git a/mysql-test/suite/binlog/t/binlog_in_engine.test b/mysql-test/suite/binlog/t/binlog_in_engine.test new file mode 100644 index 00000000000..947139c9bcc --- /dev/null +++ b/mysql-test/suite/binlog/t/binlog_in_engine.test @@ -0,0 +1,11 @@ +--source include/have_innodb.inc +--source include/have_binlog_format_mixed.inc + +CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=InnoDB; +INSERT INTO t1 VALUES (1); +BEGIN; +INSERT INTO t1 VALUES (2); +INSERT INTO t1 VALUES (3); +COMMIT; +SELECT * FROM t1 ORDER BY a; +DROP TABLE t1; diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index bd43429eb5d..09d8fdaff39 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -1626,7 +1626,8 @@ static ulint buf_flush_list(ulint max_n= ULINT_UNDEFINED, bool buf_flush_list_space(fil_space_t *space, ulint *n_flushed) { const auto space_id= space->id; - ut_ad(space_id <= SRV_SPACE_ID_UPPER_BOUND); + ut_ad(space_id <= SRV_SPACE_ID_UPPER_BOUND || + space_id == SRV_SPACE_ID_BINLOG0 || space_id == SRV_SPACE_ID_BINLOG1); bool may_have_skipped= false; ulint max_n_flush= srv_io_capacity; diff --git a/storage/innobase/fil/fil0fil.cc b/storage/innobase/fil/fil0fil.cc index 0ce54df6574..e83225a4883 100644 --- a/storage/innobase/fil/fil0fil.cc +++ b/storage/innobase/fil/fil0fil.cc @@ -184,7 +184,7 @@ it is an absolute path. */ const char* fil_path_to_mysql_datadir; /** Common InnoDB file extensions */ -const char* dot_ext[] = { "", ".ibd", ".isl", ".cfg" }; +const char* dot_ext[] = { "", ".ibd", ".isl", ".cfg", ".ibb" }; /** Number of pending tablespace flushes */ Atomic_counter<ulint> fil_n_pending_tablespace_flushes; @@ -1044,6 +1044,9 @@ fil_space_t *fil_space_t::create(uint32_t id, uint32_t flags, if (UNIV_LIKELY(id <= fil_system.max_assigned_id)) { break; } + if (id == SRV_SPACE_ID_BINLOG0 || id == SRV_SPACE_ID_BINLOG1) { + break; + } if (UNIV_UNLIKELY(srv_operation == SRV_OPERATION_BACKUP)) { break; } @@ -1603,9 +1606,10 @@ inline void mtr_t::log_file_op(mfile_type_t type, uint32_t space_id, ut_ad(!(byte(type) & 15)); /* fil_name_parse() requires that there be at least one path - separator and that the file path end with ".ibd". */ + separator and that the file path end with ".ibd" or "ibb". */ ut_ad(strchr(path, '/')); - ut_ad(!strcmp(&path[strlen(path) - strlen(DOT_IBD)], DOT_IBD)); + ut_ad(!strcmp(&path[strlen(path) - strlen(DOT_IBD)], DOT_IBD) || + !strcmp(&path[strlen(path) - strlen(DOT_IBB)], DOT_IBB)); m_modifications= true; if (!is_logged()) diff --git a/storage/innobase/fsp/fsp0fsp.cc b/storage/innobase/fsp/fsp0fsp.cc index 787bda53895..f1bb42b3a15 100644 --- a/storage/innobase/fsp/fsp0fsp.cc +++ b/storage/innobase/fsp/fsp0fsp.cc @@ -3763,3 +3763,129 @@ void fsp_shrink_temp_space() mtr.commit(); sql_print_information("InnoDB: Temporary tablespace truncated successfully"); } + + + +fil_space_t* binlog_space; +buf_block_t *binlog_cur_block; +uint32_t binlog_cur_page_no; +uint32_t binlog_cur_page_offset; + +/** Create a binlog tablespace file +@param[in] name file name +@return DB_SUCCESS or error code */ +dberr_t fsp_binlog_tablespace_create(const char* name) +{ + pfs_os_file_t fh; + bool ret; + + uint32_t size= (1<<20) >> srv_page_size_shift /* ToDo --max-binlog-size */; + if(srv_read_only_mode) + return DB_ERROR; + + os_file_create_subdirs_if_needed(name); + + /* ToDo: Do we need here an mtr.log_file_op(FILE_CREATE) like in fil_ibd_create(()? */ + fh = os_file_create( + innodb_data_file_key, + name, + OS_FILE_CREATE | OS_FILE_ON_ERROR_NO_EXIT, + OS_FILE_AIO, OS_DATA_FILE, srv_read_only_mode, &ret); + + if (!ret) { + os_file_close(fh); + return DB_ERROR; + } + + /* ToDo: Enryption? */ + fil_encryption_t mode= FIL_ENCRYPTION_OFF; + fil_space_crypt_t* crypt_data= nullptr; + + /* We created the binlog file and now write it full of zeros */ + if (!os_file_set_size(name, fh, + os_offset_t{size} << srv_page_size_shift)) { + ib::error() << "Unable to allocate " << name; + os_file_close(fh); + os_file_delete(innodb_data_file_key, name); + return DB_ERROR; + } + + mysql_mutex_lock(&fil_system.mutex); + uint32_t space_id= SRV_SPACE_ID_BINLOG0; + if (!(binlog_space= fil_space_t::create(space_id, + ( FSP_FLAGS_FCRC32_MASK_MARKER | + FSP_FLAGS_FCRC32_PAGE_SSIZE()), + FIL_TYPE_TABLESPACE, crypt_data, + mode, true))) { + mysql_mutex_unlock(&fil_system.mutex); + return DB_ERROR; + } + + fil_node_t* node = binlog_space->add(name, fh, size, false, true); + IF_WIN(node->find_metadata(), node->find_metadata(fh, true)); + mysql_mutex_unlock(&fil_system.mutex); + + binlog_cur_page_no= 0; + binlog_cur_page_offset= FIL_PAGE_DATA; + return DB_SUCCESS; +} + +void fsp_binlog_write_start(uint32_t page_no, + const uchar *data, uint32_t len, mtr_t *mtr) +{ + buf_block_t *block= fsp_page_create(binlog_space, page_no, mtr); + mtr->memcpy<mtr_t::MAYBE_NOP>(*block, FIL_PAGE_DATA + block->page.frame, + data, len); + binlog_cur_block= block; +} + +void fsp_binlog_write_offset(uint32_t page_no, uint32_t offset, + const uchar *data, uint32_t len, mtr_t *mtr) +{ + dberr_t err; + /* ToDo: Is RW_SX_LATCH appropriate here? */ + buf_block_t *block= buf_page_get_gen(page_id_t{binlog_space->id, page_no}, + 0, RW_SX_LATCH, binlog_cur_block, + BUF_GET, mtr, &err); + ut_a(err == DB_SUCCESS); + mtr->memcpy<mtr_t::MAYBE_NOP>(*block, + offset + block->page.frame, + data, len); +} + +void fsp_binlog_append(const uchar *data, uint32_t len, mtr_t *mtr) +{ + ut_ad(binlog_cur_page_offset <= srv_page_size - FIL_PAGE_DATA_END); + uint32_t remain= ((uint32_t)srv_page_size - FIL_PAGE_DATA_END) - + binlog_cur_page_offset; + // ToDo: Some kind of mutex to protect binlog access. + while (len > 0) { + if (remain < 4) { + binlog_cur_page_offset= FIL_PAGE_DATA; + remain= ((uint32_t)srv_page_size - FIL_PAGE_DATA_END) - + binlog_cur_page_offset; + ++binlog_cur_page_no; + } + uint32_t this_len= std::min<uint32_t>(len, remain); + if (binlog_cur_page_offset == FIL_PAGE_DATA) + fsp_binlog_write_start(binlog_cur_page_no, data, this_len, mtr); + else + fsp_binlog_write_offset(binlog_cur_page_no, binlog_cur_page_offset, + data, this_len, mtr); + len-= this_len; + data+= this_len; + binlog_cur_page_offset+= this_len; + } +} + + +void fsp_binlog_test(const uchar *data, uint32_t len) +{ + mtr_t mtr; + mtr.start(); + if (!binlog_space) + fsp_binlog_tablespace_create("./binlog-000000.ibb"); + mtr.set_named_space(binlog_space); + fsp_binlog_append(data, len, &mtr); + mtr.commit(); +} diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 93127bb1c3a..df2bd07d2dc 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -4481,6 +4481,10 @@ innobase_commit( if (commit_trx || (!thd_test_options(thd, OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN))) { + /* ToDo: This is just a random very initial test of writing + something into a binlog tablespace. */ + if (!opt_bootstrap) + fsp_binlog_test((const uchar *)"Hulubulu!!?!", 12); /* Run the fast part of commit if we did not already. */ if (!trx->active_commit_ordered) { innobase_commit_ordered_2(trx, thd); diff --git a/storage/innobase/include/fil0fil.h b/storage/innobase/include/fil0fil.h index f3660eff7c6..17b35f2f892 100644 --- a/storage/innobase/include/fil0fil.h +++ b/storage/innobase/include/fil0fil.h @@ -1129,10 +1129,12 @@ enum ib_extention { NO_EXT = 0, IBD = 1, ISL = 2, - CFG = 3 + CFG = 3, + IBB = 4 }; extern const char* dot_ext[]; #define DOT_IBD dot_ext[IBD] +#define DOT_IBB dot_ext[IBB] #define DOT_ISL dot_ext[ISL] #define DOT_CFG dot_ext[CFG] diff --git a/storage/innobase/include/fsp0fsp.h b/storage/innobase/include/fsp0fsp.h index ddc45e53fe6..26a45518ba2 100644 --- a/storage/innobase/include/fsp0fsp.h +++ b/storage/innobase/include/fsp0fsp.h @@ -579,6 +579,8 @@ void fsp_system_tablespace_truncate(); /** Truncate the temporary tablespace */ void fsp_shrink_temp_space(); +extern void fsp_binlog_test(const uchar *data, uint32_t len); + #ifndef UNIV_DEBUG # define fsp_init_file_page(space, block, mtr) fsp_init_file_page(block, mtr) #endif diff --git a/storage/innobase/include/fsp0types.h b/storage/innobase/include/fsp0types.h index 757ead55d03..e3d45796190 100644 --- a/storage/innobase/include/fsp0types.h +++ b/storage/innobase/include/fsp0types.h @@ -27,8 +27,12 @@ Created May 26, 2009 Vasil Dimov #pragma once #include "ut0byte.h" -/** All persistent tablespaces have a smaller fil_space_t::id than this. */ +/** All persistent tablespaces (except binlog tablespaces) have a smaller +fil_space_t::id than this. */ constexpr uint32_t SRV_SPACE_ID_UPPER_BOUND= 0xFFFFFFF0U; +/** Binlog tablespaces. */ +constexpr uint32_t SRV_SPACE_ID_BINLOG0 = SRV_SPACE_ID_UPPER_BOUND + 1; +constexpr uint32_t SRV_SPACE_ID_BINLOG1 = SRV_SPACE_ID_UPPER_BOUND + 2; /** The fil_space_t::id of the innodb_temporary tablespace. */ constexpr uint32_t SRV_TMP_SPACE_ID= 0xFFFFFFFEU; diff --git a/storage/innobase/include/mtr0log.h b/storage/innobase/include/mtr0log.h index e2419309764..86f6e3794f6 100644 --- a/storage/innobase/include/mtr0log.h +++ b/storage/innobase/include/mtr0log.h @@ -25,7 +25,7 @@ Mini-transaction log record encoding and decoding #include "mtr0mtr.h" /** The smallest invalid page identifier for persistent tablespaces */ -constexpr page_id_t end_page_id{SRV_SPACE_ID_UPPER_BOUND, 0}; +constexpr page_id_t end_page_id{SRV_SPACE_ID_BINLOG1, 0}; /** The minimum 2-byte integer (0b10xxxxxx xxxxxxxx) */ constexpr uint32_t MIN_2BYTE= 1 << 7; diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index ef31a4d00c1..310acb73071 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -2973,7 +2973,8 @@ recv_sys_t::parse_mtr_result recv_sys_t::parse(source &l, bool if_exists) if (is_predefined_tablespace(space_id)) goto file_rec_error; - if (fnend - fn < 4 || memcmp(fnend - 4, DOT_IBD, 4)) + if (fnend - fn < 4 || + (memcmp(fnend - 4, DOT_IBD, 4) && memcmp(fnend - 4, DOT_IBB, 4))) goto file_rec_error; if (UNIV_UNLIKELY(!recv_needed_recovery && srv_read_only_mode))
Hi Kristian, This is great, a years-old dream finally moving a little forward. On Mon, Feb 26, 2024 at 2:40 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
My general approach to writing is to use fsp_page_create() for the first write to a page, and then buf_page_get_gen() for subsequent writes. But maybe this should be refined for the actual implementation.
I'm thinking if perhaps fsp_page_create() does too much
For the final implementation, I would bypass as much of the "middleware" that resides above the buffer pool. For normal InnoDB tablespaces, there are page headers and footers that are wasting quite a bit of space, and there also is management of allocated pages within the tablespace. The minimum that we actually need is a 4-byte checksum at the end of each page, and possibly also the 8-byte log sequence number that is normally stored at FIL_PAGE_LSN. If you can guarantee that the binlog is always being written sequentially, we may do away with one or both, and remove the "apply the log unless the page is newer" logic, that is, unconditionally apply any log to the binlog file on recovery. The InnoDB redo log identifies files by a tablespace ID. I think that we would want to reserve 2 tablespace IDs for the page-oriented binlog files. We do not need to write the binlog file names into the redo log; we can hard-code a pattern for them, and we can write the 1-bit tablespace ID into the first page of the file. When switching tablespace files, we would toggle this ID bit.
And maybe there is a way to pin the current page in the buffer pool so buf_page_get_gen() is not needed for every write?
There is, and it is called buffer-fixing. However, a page is being loaded into the buffer pool, we must acquire a page latch so that we can ensure that the page has been fully loaded before we access it. This was in fact what caused a serious regression MDEV-31767: https://github.com/MariaDB/server/commit/b102872ad50cce5959ad95369740766d14e... A buffer-fix will prevent the page from being removed from the buffer pool. The page may be concurrently modified by other threads, or modifications may be concurrently written back to the file system. In some special cases such as some related to the undo log (see MDEV-32050), it is safe to read the contents of a page while only holding a buffer-fix. Basically, you must be sure that the range that you are reading cannot be concurrently overwritten by other threads. For an append-only binlog tablespace this property would be trivially guaranteed: an earlier written part of the binlog can be sent to a replica while more events are being appended to the binlog page.
I'm currently passing RW_SX_LATCH to buf_page_get_gen() (otherwise I got an assertion when writing). I'm not sure though how these latches work, or if binlog writing would need such latches; maybe it makes more sense to have a simple mutex protecting page access?
The rw-lock or Shared/Update/eXclusive lock on the block descriptor is the simplest that we have. The U or SX latch is the weakest available option for writes. It will allow read latches to be granted to the page concurrently.
Another assertion was fixed by doing mtr.set_named_space() before writing. Again, I'm not sure what this does exactly or if it's appropriate?
The purpose of that is to ensure that FILE_MODIFY records are being written to allow recovery to construct a mapping between tablespace ID and file names. We do not use this for the InnoDB system tablespace or the undo log tablespaces, and we do not want this for the binlog tablespaces either.
I tried in this patch to reserve 2 "special" tablespace ids for the binlog tablespaces. Idea would be to cycle between them, keeping at most the two last tablespaces active. But do the tablespace IDs appear in the redo log and used for recovery?
I would tweak the log checkpoint to ensure that all pages of the "previous" binlog tablespace must be written back before we can advance the log checkpoint. The tablespace ID (actually just 1 bit of it) would have to be written to the file header. Recovery would find the last binlog tablespace file by some filtering of opendir()/readdir()/closedir() and determine the tablespace ID by reading the last 2 files. I think that what you have written so far should be useful for an initial feasibility study, for measuring the performance. We do not need recovery to actually work when running the initial tests. I expect the difference to be drastic when using the only safe setting sync_binlog=1. Like we discussed a week ago, some more changes would be needed around writing the GTID. We might want to assign the GTID in mtr_t::do_write() under the protection of an exclusive log_sys.latch, to ensure that transactions are made durable in the GTID order. As this would allow us to be crash-safe even when using innodb_flush_log_at_trx_commit=0 and no fdatasync() except for write barriers around log checkpoints. The setting innodb_flush_log_at_trx_commit=1 would only be necessary for full durability, and the group commit that was improved in https://jira.mariadb.org/browse/MDEV-26789 would work out of the box. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Marko Mäkelä <marko.makela@mariadb.com> writes:
I think that what you have written so far should be useful for an initial feasibility study, for measuring the performance. We do not need recovery to actually work when running the initial tests. I
Yes, agree. Thanks for your comments, it's good to know that I'm on the right track, and it should help me understand more of the details of InnoDB as I develop the patch further.
Like we discussed a week ago, some more changes would be needed around writing the GTID. We might want to assign the GTID in mtr_t::do_write() under the protection of an exclusive log_sys.latch, to ensure that transactions are made durable in the GTID order. As
Yes, I will try to get this done next, and that could already be a good basis for initial benchmarking as you suggest. We don't need to have an implementation of slave dump threads reading the binlog tablespaces to learn something about the performance on the master (as long as the data written is close to what it would be in a full implementation). So the following questions are related to a later step with a full implementation, they don't need to be finalized for initial testing:
The InnoDB redo log identifies files by a tablespace ID. I think that we would want to reserve 2 tablespace IDs for the page-oriented binlog files. We do not need to write the binlog file names into the redo log; we can hard-code a pattern for them, and we can write the 1-bit tablespace ID into the first page of the file. When switching tablespace files, we would toggle this ID bit.
I would tweak the log checkpoint to ensure that all pages of the "previous" binlog tablespace must be written back before we can advance the log checkpoint. The tablespace ID (actually just 1 bit of
Conversely, we would then also need to wait for a log checkpoint before we can rotate to a new binlog tablespace, right? Because if more than two binlog tablespaces would be actively written between log checkpoints, it would be ambiguous which tablespace a log record should be applied to. I think log checkpoints can be relatively infrequent, to improve transaction throughput and reduce I/O (but increasing the time for recovery), right? Then this would mean that each binlog tablespace would need to grow as needed and could not have a specified maximum size. But not 100% sure that I understand all the details around recovery and log checkpoints here. Allocating a set of "normal" space ids and reusing them (when the tablespace has been fully synced to disk and a new log checkpoint created) could remove this dependency and allow binlog tablespace rotation independent of last log checkpoint. But it would be nice to avoid the need to keep track of allocated tablespace ids and just have two fixed ids for this, I like that approach if it can be made to work. In any case, I'm sure this issue can be solved in some way, for now I'm just trying to understand what the constraints are.
For the final implementation, I would bypass as much of the "middleware" that resides above the buffer pool. For normal InnoDB tablespaces, there are page headers and footers that are wasting quite a bit of space, and there also is management of allocated pages within the tablespace.
Ok, sounds good, we can take the details on this later.
The minimum that we actually need is a 4-byte checksum at the end of each page, and possibly also the 8-byte log sequence number that is normally stored at FIL_PAGE_LSN. If you can guarantee that the binlog is always being written sequentially, we may do away with one or both, and remove the "apply the log unless the page is newer" logic, that is, unconditionally apply any log to the binlog file on recovery.
Yes, this is a good idea. Unconditionally applying redo log should work, I think.
And maybe there is a way to pin the current page in the buffer pool so buf_page_get_gen() is not needed for every write?
There is, and it is called buffer-fixing. However, a page is being loaded into the buffer pool, we must acquire a page latch so that we can ensure that the page has been fully loaded before we access it.
Ok, thanks for explaining these details. I'll need to get a better understanding on the use of the buffer pool when it's time to expand on this part. I think I will be able to guarantee that readers will never access ranges that would be written concurrently, and that writers will never write the same data concurrently. I am wondering if it would make sense to fix the page in the buffer pool already at fsp_page_create(). Keep the page fixed until it has been written (not necessarily fsync()'ed); and after that have the slave dump threads read the data from the file through the OS, so the page can be dropped from the buffer pool. To reduce the load on the buffer pool, and especially avoid binlog pages being purged from the pool and then later re-read. But these are probably details that can be handled later.
Another assertion was fixed by doing mtr.set_named_space() before writing. Again, I'm not sure what this does exactly or if it's appropriate?
The purpose of that is to ensure that FILE_MODIFY records are being written to allow recovery to construct a mapping between tablespace ID and file names. We do not use this for the InnoDB system tablespace or the undo log tablespaces, and we do not want this for the binlog tablespaces either.
Ok, thanks for the explanation. At some point I will find how to avoid this then without getting the assertion. - Kristian.
Hi Kristian, On Mon, Feb 26, 2024 at 8:31 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I would tweak the log checkpoint to ensure that all pages of the "previous" binlog tablespace must be written back before we can advance the log checkpoint. The tablespace ID (actually just 1 bit of
Conversely, we would then also need to wait for a log checkpoint before we can rotate to a new binlog tablespace, right?
That is not necessary. We only have to completely write back the changes from the buffer pool to the last-but-one binlog file, whose tablespace ID we are about to reuse for the new file. That can be done by invoking buf_flush_list_space(). It does not matter if there are pending changes to other tablespaces that will prevent the log checkpoint from being advanced. If we write the last modification LSN to the first page of the binlog tablespace, recovery can simply skip all log records for the binlog tablespace that are older than the LSN.
I think log checkpoints can be relatively infrequent, to improve transaction throughput and reduce I/O (but increasing the time for recovery), right?
Checkpoints can actually occur once per second or even more frequently, depending on the workload and the log capacity. If there is lots of free space in the buffer pool and in the redo log, or if writes are infrequent, then checkpoints could occur less often.
I am wondering if it would make sense to fix the page in the buffer pool already at fsp_page_create().
Created pages are fixed in the buffer pool until the mtr_t::commit() that would release the page latch and the buffer-fix. Simply by invoking buf_page_t::io_fix() before mtr_t::commit() you can extend the buffer-fix, to reuse the page in a subsequent mini-transaction. For example, purge_sys_t::iterator::free_history_rseg() is making use of that: rseg_hdr->fix(); //... mtr.commit(); //... mtr.start(); rseg_hdr->page.lock.x_lock(); mtr.memo_push(rseg_hdr, MTR_MEMO_PAGE_X_FIX);
Keep the page fixed until it has been written (not necessarily fsync()'ed); and after that have the slave dump threads read the data from the file through the OS, so the page can be dropped from the buffer pool. To reduce the load on the buffer pool, and especially avoid binlog pages being purged from the pool and then later re-read.
If a page is buffer-fixed for an unbounded time, it could interfere with an attempt to shrink the buffer pool or to respond to a memory pressure event. Some interface for releasing those pages would be nice to have. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Marko Mäkelä <marko.makela@mariadb.com> writes:
Hi Kristian,
On Mon, Feb 26, 2024 at 8:31 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I would tweak the log checkpoint to ensure that all pages of the "previous" binlog tablespace must be written back before we can advance the log checkpoint. The tablespace ID (actually just 1 bit of
Conversely, we would then also need to wait for a log checkpoint before we can rotate to a new binlog tablespace, right?
That is not necessary. We only have to completely write back the changes from the buffer pool to the last-but-one binlog file, whose tablespace ID we are about to reuse for the new file. That can be done
checkpoint from being advanced. If we write the last modification LSN to the first page of the binlog tablespace, recovery can simply skip all log records for the binlog tablespace that are older than the LSN.
Ah! Yes, I see, that seems a good solution. And users will want to have binlog data written to the file as quickly as possible anyway, to be visible with external tools (mariadb-binlog), so that fits perfectly. So this looks perfect, I like the approach of cycling between two reserved tablespace IDs.
Created pages are fixed in the buffer pool until the mtr_t::commit() that would release the page latch and the buffer-fix. Simply by invoking buf_page_t::io_fix() before mtr_t::commit() you can extend the buffer-fix, to reuse the page in a subsequent mini-transaction. For example,
Ok, thanks for the explanation, sounds useful.
If a page is buffer-fixed for an unbounded time, it could interfere with an attempt to shrink the buffer pool or to respond to a memory pressure event. Some interface for releasing those pages would be nice
Right. In this case, the idea would be to fix at most one page at a time, the last partial page that is currently being appended to. - Kristian.
Hi Kristian, On Wed, Feb 28, 2024 at 2:27 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
If we write the last modification LSN to the first page of the binlog tablespace, recovery can simply skip all log records for the binlog tablespace that are older than the LSN.
Ah! Yes, I see, that seems a good solution. And users will want to have binlog data written to the file as quickly as possible anyway, to be visible with external tools (mariadb-binlog), so that fits perfectly.
I realized later that we might as well write the LSN at the time of the file creation. That would remove the need to make the first page of the binlog file dirty whenever anything in the file is being modified. File creation may have to durably write the first page for this scheme to work. For normal .ibd file creation, this kind of a step was removed in https://jira.mariadb.org/browse/MDEV-24626. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
participants (2)
-
Kristian Nielsen
-
Marko Mäkelä