Re: [MariaDB commits] [PATCH] MDEV-33487 Feature non-blocking binlog
Here is my review of the MDEV-33487 / PR#3087 patch. First, some general remarks. This patch addresses a specific bottleneck when binlogging huge transactions (eg. multi-gigabyte table updates logged in ROW mode). The root cause is a fundamental limitation in the current binlog format that requires all transactions to be binlogged as a single event group, using consecutive bytes in a single binlog file. Thus, while writing the large transaction to the binlog during commit, all other transaction commits will have to wait before they can commit. Eventually, we should move towards an enhanced binlog format that doesn't have this limitation, so that transactions can be partially binlogged during execution, avoiding this bottlenect. There are a number of other problems caused by this limitation that are not addressed by this patch. Changing the binlog format is a complex task not easily done, so this patch should be seen as a temporary solution to one specific bottleneck, until if/when this underlying limitation can hopefully be removed. Below my detailed review of the code in the patch. This patch touches core code of the binlog group commit mechanism, which is complex code with a number of tricky issues to consider, and very important to keep correct and maintainable. I have tried to give concrete suggestions on ways to write the code, as well as some necessary extra tests. I think if MariaDB PLC wants this patch to be merged, they need to assign a core developer to take responsibility for it to ensure that any problems that turn up later will be handled in a reasonable time frame (eg. prioritise time to work with the author to debug and get fixes reviewed and merged).
From: poempeng <poempeng@tencent.com>
During the commit stage of a large transaction, it may cost too much time to write binlog cache and block all subsequent transactions for a long time. One possible solution of this issue is to write the binlog of the large transaction to a temporary file and then rename it to the next new binlog file.
It's not really the cost of writing (which is similar with and without the patch), the problem is doing costly I/O under the global mutex LOCK_log that blocks other commits. I suggest making that clear from the commit message, for example: Very large transactions (eg. gigabytes) can stall other transactions for a long time because the data is written while holding LOCK_log, which blocks other commits from binlogging. One possible solution to this issue is to write a large transaction to its own binlog file outside of holding LOCK_log, and then forcing a binlog rotate after obtaining LOCK_log, simply renaming the new binlog file in place.
diff --git a/sql/log.cc b/sql/log.cc index 460cefea47b..9de1c6118cd 100644 --- a/sql/log.cc +++ b/sql/log.cc @@ -74,6 +74,9 @@ #include <utility> // pair #endif
+#include <atomic> +#include <chrono>
These should not be needed after changes mentioned below.
@@ -3727,7 +3730,10 @@ bool MYSQL_BIN_LOG::open(const char *log_name, enum cache_type io_cache_type_arg, ulong max_size_arg, bool null_created_arg, - bool need_mutex) + bool need_mutex, + const char *file_to_rename, + my_off_t file_size, + group_commit_entry *entry)
+ if (file_to_rename) + { + if (write_gtid_and_skip_event(file_size, entry)) + goto err; + } +
I don't think you need to write these here, inside MYSQL_BIN_LOG::open(). Instead you can do it in the caller after open() returns, right? Then you can avoid the extra arguments file_size and entry, and the conditional here.
@@ -6826,7 +6850,6 @@ MYSQL_BIN_LOG::write_gtid_event(THD *thd, bool standalone, thd->variables.server_id= global_system_variables.server_id; } #endif -
Avoid unrelated changes in the diff like this one (to simplify later merges and history search).
@@ -7849,11 +7872,18 @@ int Event_log::write_cache_raw(THD *thd, IO_CACHE *cache)
- mysql_mutex_assert_owner(&LOCK_log); + + IO_CACHE *out_file= f; + if (likely(f == nullptr)) + { + mysql_mutex_assert_owner(&LOCK_log); + out_file= get_log_file(); + }
Don't add another conditional here. Instead just pass in the IO_CACHE to use as a parameter in each call site, eg. make the `f` parameter mandatory, not optional.
@@ -8581,7 +8612,23 @@ MYSQL_BIN_LOG::queue_for_group_commit(group_commit_entry *orig_entry) bool MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry) { - int is_leader= queue_for_group_commit(entry); + int is_leader; + if (unlikely(entry->cache_mngr->trx_cache.get_byte_position() >= + non_blocking_binlog_threshold) || + DBUG_IF("non_blocking_binlog_ignore_cache_size")) + { + if (!can_use_non_blocking_binlog(entry) || + write_non_blocking_binlog(entry)) + goto original_commit_path; + /* thread using non-blocking binlog is treated as a single group */ + is_leader= 1; + entry->non_blocking_log= true; + entry->next= nullptr; + goto group_commit_leader; + } +original_commit_path: + + is_leader= queue_for_group_commit(entry);
Ok, so when we force a binlog rotation due to binlog_non_blocking_threshold, we do this by itself, not as part of a group commit. I think that is sensible. The way this gets integrated in the existing code is not very nice, with all these extra goto's and conditionals on entry->non_blocking_log. Can we instead make the decision on binlog_non_blocking_threshold in the function MYSQL_BIN_LOG::write_transaction_to_binlog()? And then call into MYSQL_BIN_LOG::trx_group_commit_leader() directly, skipping the group commit code with `is_leader` in write_transaction_to_binlog_events(). That should greatly simplify the logic. Probably the trx_group_commit_leader() needs to be refactored a bit, split in two, where the first part is dealing with grabbing the group commit queue, while only the second part is called in the binlog_non_blocking_threshold case, to do the actual binlogging. You might also consider using a separate function for the binlogging in the binlog_non_blocking_threshold case, extracting pieces from trx_group_commit_leader() into smaller shared functions; it depends on what gives the cleanest and simplest code, and is best decided as part of the refactoring of trx_group_commit_leader(). Another thing related to is this: How does the code ensure that commit ordering (on the slave) is preserved? This is the wait_for_prior_commit() mechanism, and is needed to ensure on the slave that the GTIDs are binlogged in the right order. This is normally handled by queue_for_group_commit(), which is skipped in the binlog_non_blocking_threshold case. It was not clear to me from the code how this gets handled. This probably also needs a test case to test that binlog order will be correct on the slave when multiple transactions commit in parallel (eg. optimistic parallel replication), and several of them use the binlog_non_blocking_threshold case.
@@ -12459,6 +12519,7 @@ mysql_bin_log_commit_pos(THD *thd, ulonglong *out_pos, const char **out_file) } #endif /* INNODB_COMPATIBILITY_HOOKS */
+mysql_rwlock_t binlog_checksum_rwlock;
static void binlog_checksum_update(MYSQL_THD thd, struct st_mysql_sys_var *var, @@ -12468,6 +12529,7 @@ binlog_checksum_update(MYSQL_THD thd, struct st_mysql_sys_var *var, bool check_purge= false; ulong UNINIT_VAR(prev_binlog_id);
+ mysql_rwlock_wrlock(&binlog_checksum_rwlock); mysql_mutex_lock(mysql_bin_log.get_log_lock()); if(mysql_bin_log.is_open()) {
Ok, so you introduce a new lock to protect changing binlog_checksum_options. This is otherwise protected by LOCK_log, but we don't want to be holding that while writing the bin transaction, obviously. But did you check all places where binlog_checksum_options is accessed? Your patch doesn't use the new binlog_checksum_rwlock anywhere except in binlog_checksum_update(). There might be some places that currently use LOCK_log to protect the access, and could instead use the new binlog_checksum_rwlock (reducing contention on LOCK_log). Or did you check already, and there was no such opportunity?
+bool MYSQL_BIN_LOG::can_use_non_blocking_binlog(group_commit_entry *entry) +{ + DBUG_ASSERT(entry->cache_mngr->trx_cache.get_byte_position() >= + non_blocking_binlog_threshold || + DBUG_IF("non_blocking_binlog_ignore_cache_size")); + THD *thd= entry->thd; + if (unlikely(!is_open()) || encrypt_binlog || + !entry->cache_mngr->stmt_cache.empty() || + entry->cache_mngr->trx_cache.has_incident() || thd->slave_thread || + thd->wait_for_commit_ptr)
Oh. So you disable this on the slave. Is that really necessary? Ok, on the slave, in many cases the following transactions in any case are blocked from committing before the large one, so I see that this could be acceptable. It does seem a very special-purpose use case though. So if this is kept disabled for slave threads, it needs to be very clearly documented, and there should be a comment here as well in the code about why this is done.
+std::string MYSQL_BIN_LOG::generate_random_file_name()
"generate_tmp_binlog_file_name" ?
+ if (unlikely(!binlog_dir_inited.load(std::memory_order_acquire))) + { + char dev[FN_REFLEN]; + size_t dev_length; + mysql_mutex_lock(&LOCK_log); + if (!binlog_dir_inited.load(std::memory_order_relaxed) && name != nullptr)
This is not necessary. Just initialize what you need when the binlog is initialized, just like other binlog initialization, avoiding the check for binlog_dir_inited. See for example init_server_components() in sql/mysqld.cc
+ std::string temp_file_name;
Normally we don't use std::string in the server source, as it makes it harder to control the memory allocation and/or to avoid dynamic memory allocation. In this case we would normally use a fixed-size buffer of length FN_REFLEN, or the class String from sql/sql_string.h with some fixed stack-allocated initial buffer for the common case of not too long name, to avoid a dynamic memory allocation.
+ temp_file_name.reserve(FN_REFLEN); + auto now_in_sys= std::chrono::system_clock::now().time_since_epoch(); + auto now_in_ms= + std::chrono::duration_cast<std::chrono::milliseconds>(now_in_sys) + .count();
You don't need to add this millisecond timestamp, do you? The file name will already be unique from temp_bin_counter?
+ auto count= temp_bin_counter.fetch_add(1);
This by default uses std::memory_order_seq_cst, which is overly restrictive. You can just use class Atomic_counter from include/my_counter.h to get a simple atomic counter with std::memory_order_relaxed semantics.
+ temp_file_name.append(binlog_dir); + temp_file_name.append("_temp_bin_"); + temp_file_name.append(std::to_string(now_in_ms)); + temp_file_name.push_back('_'); + temp_file_name.append(std::to_string(count));
+template <typename F= std::function<void()>> class Scoped_guard +{ +public: + Scoped_guard(F f); + ~Scoped_guard(); + +private: + F func; +};
Hm. Yes, this is nice, but it is not good if each new feature implements its own specific class like this. So either this should go somewhere shared where it can be used in the future by other code (did you check that there isn't already something similar that can be used, in the MariaDB source code or in the C++ standard library?), preferably as a separate commit. Or alternatively stay with the existing old-fascioned style for error handling used elsewhere in sql/log.cc.
+ size_t skip_event_len= non_blocking_binlog_reserved_size - skip_event_start; + skip_event_buf= + (uchar *) my_malloc(PSI_INSTRUMENT_ME, skip_event_len, MYF(MY_ZEROFILL)); + generate_skip_event(skip_event_buf, skip_event_len, skip_event_start, + entry->thd); + if (my_b_write(&log_file, skip_event_buf, skip_event_len) != 0) + return true;
Do we really need to generate this skip event? It seems quite hacky to have a random dummy event in the middle of the binlog. The size you need to reserve should be possible to pre-calculate: - Format_description_log_event - Binlog_checkpoint_event - Gtid_list_log_event - Gtid_log_event The only issue should be the Gtid_list_log_event. This can grow in size, but only if a new GTID domain id or server id gets added during the writing of the temporary file. This is very unlikely to happen, and the code can check for if this happens and in this case fall back to writing the data to the binlog file normally under LOCK_log. This way, we can avoid this strange skip event.
+ if (temp_file_name.size() >= FN_REFLEN) + { + sql_print_warning("The name of temp file '%s' is too long!", + temp_file_name.c_str()); + return true;
This will go in the server log for the user/DBA to wonder about. So add a bit of context to the message so that it is clear that it is about the temporary file used for binlog_non_blocking_threshold.
+ thd->push_internal_handler(&dummy_error_handler);
Hm. Why do you need to do this? If this is really needed, then it needs a good comment explaning why. But I suspect you might not need this at all. Is it to avoid assertions when an error is handled on the file operations? If you don't want mysql_file_open() and so to set an error with my_error() (because you want to handle the error yourself somehow), then just don't pass MY_WME as a flag to the function.
+ if (unlikely(generate_new_name(new_name, name, 0))) + abort(); + new_name_ptr= new_name; + /* + We log the whole file name for log file as the user may decide + to change base names at some point. + */ + Rotate_log_event r(new_name + dirname_length(new_name), 0, + LOG_EVENT_OFFSET, 0); + if (unlikely(write_event(&r, alg))) + abort(); + if (unlikely(flush_io_cache(&log_file) != 0)) + abort();
No, don't crash the server like this. Errors during writing binlog are indeed very tricky. But it shouldn't crash the server. In the worst case, we may be left with a corrup binlog, but the error should be handled and reported to the user/DBA, similar to how it is done in other parts of the code. How does this code work with respect to the binlog checkpoint mechanism? The binlog checkpoint mechanism is a way to ensure that sufficient binlog files are kept to ensure correct crash recovery, in an asynchroneous way that does not require syncing the InnoDB redo log at binlog rotation. It uses the struct xid_count_per_binlog to keep track of which binlogs are active, and uses this to write binlog checkpoint events as appropriate. There are some tricky concurrency issues around this mechanism, so it is important to consider carefully how this will be handled in different cases by the code. For example, if the binlog checkpoint gets delayed and spans multiple binlog files including one or more generated by this patch; or what happens if RESET MASTER is run in parallel, etc. I don't see any changes related to xid_count_per_binlog in the patch. Is this because this is all handled correctly by MYSQL_BIN_LOG::open()? I would like to see a comment somewhere explaining how this is working. Also I think there needs to be some test cases testing this, eg. delayed binlog checkpointing when several binlog_non_blocking_threshold binlog rotations are running in parallel.
diff --git a/sql/log.h b/sql/log.h index fc5209d1922..af2b56da802 100644 --- a/sql/log.h +++ b/sql/log.h @@ -19,6 +19,7 @@
+private: + // reserved size for format_description_log_event, gtid_list_log_event ... + static const my_off_t non_blocking_binlog_reserved_size= 10 * 1024 * 1024;
Oh, 10 *megabyte* default for the dummy event? That's surely excessive? But hopefully we can avoid the dummy event completely as suggested above.
+static Sys_var_ulonglong Sys_var_non_blocking_binlog_threshold( + "non_blocking_binlog_threshold", + "If the binlog size of a transaction exceeds this value, we write it to a " + "new temporary file and rename it to the next binlog file.", + GLOBAL_VAR(non_blocking_binlog_threshold), CMD_LINE(OPT_ARG), + VALID_RANGE(256 * 1024 * 1024, ULONGLONG_MAX), DEFAULT(ULONGLONG_MAX), + BLOCK_SIZE(1));
Maybe name it so that it starts with "binlog_", eg. binlog_non_blocking_threshold? I think the VALID_RANGE should be extended, there is no need to forbid users to set it low for testing purposes etc. (even though I agree normally a large value would be strongly recommended). Suggestion for better description: If the binlog size in bytes of a transaction exceeds this value, force a binlog rotation and write the transaction to its own binlog file. This can reduce stalls of other `commits when binlogging very large transactions, at the cost of creating extra binlog files smaller than the configured --max-binlog-size. - Kristian.
participants (1)
-
Kristian Nielsen