Here is my review of the MDEV-33487 / PR#3087 patch.
First, some general remarks.
This patch addresses a specific bottleneck when binlogging huge transactions
(eg. multi-gigabyte table updates logged in ROW mode).
The root cause is a fundamental limitation in the current binlog format that
requires all transactions to be binlogged as a single event group, using
consecutive bytes in a single binlog file. Thus, while writing the large
transaction to the binlog during commit, all other transaction commits will
have to wait before they can commit.
Eventually, we should move towards an enhanced binlog format that doesn't
have this limitation, so that transactions can be partially binlogged during
execution, avoiding this bottlenect. There are a number of other problems
caused by this limitation that are not addressed by this patch. Changing the
binlog format is a complex task not easily done, so this patch should be
seen as a temporary solution to one specific bottleneck, until if/when this
underlying limitation can hopefully be removed.
Below my detailed review of the code in the patch. This patch touches core
code of the binlog group commit mechanism, which is complex code with a
number of tricky issues to consider, and very important to keep correct and
maintainable. I have tried to give concrete suggestions on ways to write the
code, as well as some necessary extra tests. I think if MariaDB PLC wants
this patch to be merged, they need to assign a core developer to take
responsibility for it to ensure that any problems that turn up later will be
handled in a reasonable time frame (eg. prioritise time to work with the
author to debug and get fixes reviewed and merged).
> From: poempeng <poempeng(a)tencent.com>
>
> During the commit stage of a large transaction, it may cost too much
> time to write binlog cache and block all subsequent transactions for
> a long time. One possible solution of this issue is to write the binlog
> of the large transaction to a temporary file and then rename it to the
> next new binlog file.
It's not really the cost of writing (which is similar with and without the
patch), the problem is doing costly I/O under the global mutex LOCK_log that
blocks other commits. I suggest making that clear from the commit message,
for example:
Very large transactions (eg. gigabytes) can stall other transactions for a
long time because the data is written while holding LOCK_log, which blocks
other commits from binlogging. One possible solution to this issue is to
write a large transaction to its own binlog file outside of holding
LOCK_log, and then forcing a binlog rotate after obtaining LOCK_log,
simply renaming the new binlog file in place.
> diff --git a/sql/log.cc b/sql/log.cc
> index 460cefea47b..9de1c6118cd 100644
> --- a/sql/log.cc
> +++ b/sql/log.cc
> @@ -74,6 +74,9 @@
> #include <utility> // pair
> #endif
>
> +#include <atomic>
> +#include <chrono>
These should not be needed after changes mentioned below.
> @@ -3727,7 +3730,10 @@ bool MYSQL_BIN_LOG::open(const char *log_name,
> enum cache_type io_cache_type_arg,
> ulong max_size_arg,
> bool null_created_arg,
> - bool need_mutex)
> + bool need_mutex,
> + const char *file_to_rename,
> + my_off_t file_size,
> + group_commit_entry *entry)
> + if (file_to_rename)
> + {
> + if (write_gtid_and_skip_event(file_size, entry))
> + goto err;
> + }
> +
I don't think you need to write these here, inside MYSQL_BIN_LOG::open().
Instead you can do it in the caller after open() returns, right? Then you
can avoid the extra arguments file_size and entry, and the conditional here.
> @@ -6826,7 +6850,6 @@ MYSQL_BIN_LOG::write_gtid_event(THD *thd, bool standalone,
> thd->variables.server_id= global_system_variables.server_id;
> }
> #endif
> -
Avoid unrelated changes in the diff like this one (to simplify later merges
and history search).
> @@ -7849,11 +7872,18 @@ int Event_log::write_cache_raw(THD *thd, IO_CACHE *cache)
> - mysql_mutex_assert_owner(&LOCK_log);
> +
> + IO_CACHE *out_file= f;
> + if (likely(f == nullptr))
> + {
> + mysql_mutex_assert_owner(&LOCK_log);
> + out_file= get_log_file();
> + }
Don't add another conditional here. Instead just pass in the IO_CACHE to use
as a parameter in each call site, eg. make the `f` parameter mandatory, not
optional.
> @@ -8581,7 +8612,23 @@ MYSQL_BIN_LOG::queue_for_group_commit(group_commit_entry *orig_entry)
> bool
> MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry)
> {
> - int is_leader= queue_for_group_commit(entry);
> + int is_leader;
> + if (unlikely(entry->cache_mngr->trx_cache.get_byte_position() >=
> + non_blocking_binlog_threshold) ||
> + DBUG_IF("non_blocking_binlog_ignore_cache_size"))
> + {
> + if (!can_use_non_blocking_binlog(entry) ||
> + write_non_blocking_binlog(entry))
> + goto original_commit_path;
> + /* thread using non-blocking binlog is treated as a single group */
> + is_leader= 1;
> + entry->non_blocking_log= true;
> + entry->next= nullptr;
> + goto group_commit_leader;
> + }
> +original_commit_path:
> +
> + is_leader= queue_for_group_commit(entry);
Ok, so when we force a binlog rotation due to binlog_non_blocking_threshold,
we do this by itself, not as part of a group commit. I think that is
sensible.
The way this gets integrated in the existing code is not very nice, with all
these extra goto's and conditionals on entry->non_blocking_log. Can we
instead make the decision on binlog_non_blocking_threshold in the function
MYSQL_BIN_LOG::write_transaction_to_binlog()? And then call into
MYSQL_BIN_LOG::trx_group_commit_leader() directly, skipping the group commit
code with `is_leader` in write_transaction_to_binlog_events(). That should
greatly simplify the logic.
Probably the trx_group_commit_leader() needs to be refactored a bit, split
in two, where the first part is dealing with grabbing the group commit
queue, while only the second part is called in the
binlog_non_blocking_threshold case, to do the actual binlogging.
You might also consider using a separate function for the binlogging in the
binlog_non_blocking_threshold case, extracting pieces from
trx_group_commit_leader() into smaller shared functions; it depends on what
gives the cleanest and simplest code, and is best decided as part of the
refactoring of trx_group_commit_leader().
Another thing related to is this: How does the code ensure that commit
ordering (on the slave) is preserved? This is the wait_for_prior_commit()
mechanism, and is needed to ensure on the slave that the GTIDs are binlogged
in the right order. This is normally handled by queue_for_group_commit(),
which is skipped in the binlog_non_blocking_threshold case. It was not clear
to me from the code how this gets handled. This probably also needs a test
case to test that binlog order will be correct on the slave when multiple
transactions commit in parallel (eg. optimistic parallel replication), and
several of them use the binlog_non_blocking_threshold case.
> @@ -12459,6 +12519,7 @@ mysql_bin_log_commit_pos(THD *thd, ulonglong *out_pos, const char **out_file)
> }
> #endif /* INNODB_COMPATIBILITY_HOOKS */
>
> +mysql_rwlock_t binlog_checksum_rwlock;
>
> static void
> binlog_checksum_update(MYSQL_THD thd, struct st_mysql_sys_var *var,
> @@ -12468,6 +12529,7 @@ binlog_checksum_update(MYSQL_THD thd, struct st_mysql_sys_var *var,
> bool check_purge= false;
> ulong UNINIT_VAR(prev_binlog_id);
>
> + mysql_rwlock_wrlock(&binlog_checksum_rwlock);
> mysql_mutex_lock(mysql_bin_log.get_log_lock());
> if(mysql_bin_log.is_open())
> {
Ok, so you introduce a new lock to protect changing binlog_checksum_options.
This is otherwise protected by LOCK_log, but we don't want to be holding
that while writing the bin transaction, obviously.
But did you check all places where binlog_checksum_options is accessed? Your
patch doesn't use the new binlog_checksum_rwlock anywhere except in
binlog_checksum_update(). There might be some places that currently use
LOCK_log to protect the access, and could instead use the new
binlog_checksum_rwlock (reducing contention on LOCK_log). Or did you check
already, and there was no such opportunity?
> +bool MYSQL_BIN_LOG::can_use_non_blocking_binlog(group_commit_entry *entry)
> +{
> + DBUG_ASSERT(entry->cache_mngr->trx_cache.get_byte_position() >=
> + non_blocking_binlog_threshold ||
> + DBUG_IF("non_blocking_binlog_ignore_cache_size"));
> + THD *thd= entry->thd;
> + if (unlikely(!is_open()) || encrypt_binlog ||
> + !entry->cache_mngr->stmt_cache.empty() ||
> + entry->cache_mngr->trx_cache.has_incident() || thd->slave_thread ||
> + thd->wait_for_commit_ptr)
Oh. So you disable this on the slave. Is that really necessary?
Ok, on the slave, in many cases the following transactions in any case are
blocked from committing before the large one, so I see that this could be
acceptable. It does seem a very special-purpose use case though. So if this
is kept disabled for slave threads, it needs to be very clearly documented,
and there should be a comment here as well in the code about why this is
done.
> +std::string MYSQL_BIN_LOG::generate_random_file_name()
"generate_tmp_binlog_file_name" ?
> + if (unlikely(!binlog_dir_inited.load(std::memory_order_acquire)))
> + {
> + char dev[FN_REFLEN];
> + size_t dev_length;
> + mysql_mutex_lock(&LOCK_log);
> + if (!binlog_dir_inited.load(std::memory_order_relaxed) && name != nullptr)
This is not necessary. Just initialize what you need when the binlog is
initialized, just like other binlog initialization, avoiding the check for
binlog_dir_inited. See for example init_server_components() in sql/mysqld.cc
> + std::string temp_file_name;
Normally we don't use std::string in the server source, as it makes it
harder to control the memory allocation and/or to avoid dynamic memory
allocation. In this case we would normally use a fixed-size buffer of length
FN_REFLEN, or the class String from sql/sql_string.h with some fixed
stack-allocated initial buffer for the common case of not too long name, to
avoid a dynamic memory allocation.
> + temp_file_name.reserve(FN_REFLEN);
> + auto now_in_sys= std::chrono::system_clock::now().time_since_epoch();
> + auto now_in_ms=
> + std::chrono::duration_cast<std::chrono::milliseconds>(now_in_sys)
> + .count();
You don't need to add this millisecond timestamp, do you? The file name will
already be unique from temp_bin_counter?
> + auto count= temp_bin_counter.fetch_add(1);
This by default uses std::memory_order_seq_cst, which is overly restrictive.
You can just use class Atomic_counter from include/my_counter.h to get a
simple atomic counter with std::memory_order_relaxed semantics.
> + temp_file_name.append(binlog_dir);
> + temp_file_name.append("_temp_bin_");
> + temp_file_name.append(std::to_string(now_in_ms));
> + temp_file_name.push_back('_');
> + temp_file_name.append(std::to_string(count));
> +template <typename F= std::function<void()>> class Scoped_guard
> +{
> +public:
> + Scoped_guard(F f);
> + ~Scoped_guard();
> +
> +private:
> + F func;
> +};
Hm. Yes, this is nice, but it is not good if each new feature implements its
own specific class like this.
So either this should go somewhere shared where it can be used in the future
by other code (did you check that there isn't already something similar that
can be used, in the MariaDB source code or in the C++ standard library?),
preferably as a separate commit.
Or alternatively stay with the existing old-fascioned style for error
handling used elsewhere in sql/log.cc.
> + size_t skip_event_len= non_blocking_binlog_reserved_size - skip_event_start;
> + skip_event_buf=
> + (uchar *) my_malloc(PSI_INSTRUMENT_ME, skip_event_len, MYF(MY_ZEROFILL));
> + generate_skip_event(skip_event_buf, skip_event_len, skip_event_start,
> + entry->thd);
> + if (my_b_write(&log_file, skip_event_buf, skip_event_len) != 0)
> + return true;
Do we really need to generate this skip event? It seems quite hacky to have
a random dummy event in the middle of the binlog.
The size you need to reserve should be possible to pre-calculate:
- Format_description_log_event
- Binlog_checkpoint_event
- Gtid_list_log_event
- Gtid_log_event
The only issue should be the Gtid_list_log_event. This can grow in size, but
only if a new GTID domain id or server id gets added during the writing of
the temporary file. This is very unlikely to happen, and the code can check
for if this happens and in this case fall back to writing the data to the
binlog file normally under LOCK_log.
This way, we can avoid this strange skip event.
> + if (temp_file_name.size() >= FN_REFLEN)
> + {
> + sql_print_warning("The name of temp file '%s' is too long!",
> + temp_file_name.c_str());
> + return true;
This will go in the server log for the user/DBA to wonder about. So add a
bit of context to the message so that it is clear that it is about the
temporary file used for binlog_non_blocking_threshold.
> + thd->push_internal_handler(&dummy_error_handler);
Hm. Why do you need to do this? If this is really needed, then it needs a
good comment explaning why.
But I suspect you might not need this at all. Is it to avoid assertions when
an error is handled on the file operations? If you don't want
mysql_file_open() and so to set an error with my_error() (because you want
to handle the error yourself somehow), then just don't pass MY_WME as a flag
to the function.
> + if (unlikely(generate_new_name(new_name, name, 0)))
> + abort();
> + new_name_ptr= new_name;
> + /*
> + We log the whole file name for log file as the user may decide
> + to change base names at some point.
> + */
> + Rotate_log_event r(new_name + dirname_length(new_name), 0,
> + LOG_EVENT_OFFSET, 0);
> + if (unlikely(write_event(&r, alg)))
> + abort();
> + if (unlikely(flush_io_cache(&log_file) != 0))
> + abort();
No, don't crash the server like this.
Errors during writing binlog are indeed very tricky. But it shouldn't crash
the server. In the worst case, we may be left with a corrup binlog, but the
error should be handled and reported to the user/DBA, similar to how it is
done in other parts of the code.
How does this code work with respect to the binlog checkpoint mechanism?
The binlog checkpoint mechanism is a way to ensure that sufficient binlog
files are kept to ensure correct crash recovery, in an asynchroneous way
that does not require syncing the InnoDB redo log at binlog rotation. It
uses the struct xid_count_per_binlog to keep track of which binlogs are
active, and uses this to write binlog checkpoint events as appropriate.
There are some tricky concurrency issues around this mechanism, so it is
important to consider carefully how this will be handled in different cases
by the code. For example, if the binlog checkpoint gets delayed and spans
multiple binlog files including one or more generated by this patch; or what
happens if RESET MASTER is run in parallel, etc.
I don't see any changes related to xid_count_per_binlog in the patch. Is
this because this is all handled correctly by MYSQL_BIN_LOG::open()? I would
like to see a comment somewhere explaining how this is working.
Also I think there needs to be some test cases testing this, eg. delayed
binlog checkpointing when several binlog_non_blocking_threshold binlog
rotations are running in parallel.
> diff --git a/sql/log.h b/sql/log.h
> index fc5209d1922..af2b56da802 100644
> --- a/sql/log.h
> +++ b/sql/log.h
> @@ -19,6 +19,7 @@
> +private:
> + // reserved size for format_description_log_event, gtid_list_log_event ...
> + static const my_off_t non_blocking_binlog_reserved_size= 10 * 1024 * 1024;
Oh, 10 *megabyte* default for the dummy event? That's surely excessive? But
hopefully we can avoid the dummy event completely as suggested above.
> +static Sys_var_ulonglong Sys_var_non_blocking_binlog_threshold(
> + "non_blocking_binlog_threshold",
> + "If the binlog size of a transaction exceeds this value, we write it to a "
> + "new temporary file and rename it to the next binlog file.",
> + GLOBAL_VAR(non_blocking_binlog_threshold), CMD_LINE(OPT_ARG),
> + VALID_RANGE(256 * 1024 * 1024, ULONGLONG_MAX), DEFAULT(ULONGLONG_MAX),
> + BLOCK_SIZE(1));
Maybe name it so that it starts with "binlog_", eg. binlog_non_blocking_threshold?
I think the VALID_RANGE should be extended, there is no need to forbid users
to set it low for testing purposes etc. (even though I agree normally a
large value would be strongly recommended).
Suggestion for better description:
If the binlog size in bytes of a transaction exceeds this value, force a
binlog rotation and write the transaction to its own binlog file. This can
reduce stalls of other `commits when binlogging very large transactions, at
the cost of creating extra binlog files smaller than the configured
--max-binlog-size.
- Kristian.