> From: Libing Song <anders.slb(a)alibaba-inc.com>
>
> Description
> ===========
> When a transaction commits, it copies the binlog events from
> binlog cache to binlog file. Very large transactions
> (eg. gigabytes) can stall other transactions for a long time
> because the data is copied while holding LOCK_log, which blocks
> other commits from binlogging.
Hi Libing Song,
Here is my full review of the patch for MDEV-32014 Rename binlog cache
temporary file to binlog file for large transaction.
First let me say that thanks, reviewing this patch was a pleasant surprise.
The binlog code that you are integrating this new feature in is complex to
work with, but overall I think you have managed to do it a pretty clean way.
That is great work, and I think a lot of effort must have gone into getting
it this way. I have a number of comments and requests for changes below, so
I wanted to give this overall good impression up-front.
Also thanks to Brandon for doing first review. My apologies if I am
repeating something already discussed, but Github pull request reviews of
non-trivial patches are almost impossible to read, with reviews being split
in fragments, patches disappearing after force-pushes, etc.
The root cause for the problem being addressed by the patch is a fundamental
limitation in the current binlog format. It requires all transactions to be
binlogged as a single event group, using consecutive bytes in a single
binlog file. Thus, while writing the large transaction to the binlog during
commit, all other transaction commits will have to wait before they can
commit.
Eventually, we should move towards an enhanced binlog format that doesn't
have this limitation, so that transactions can be partially binlogged during
execution, avoiding this bottlenect. There are a number of other problems
caused by this limitation that are not addressed by this patch. Changing the
binlog format is a complex task not easily done, so this patch should be
seen as a temporary solution to one specific bottleneck, until if/when this
underlying limitation can hopefully be removed.
First some overall comments:
As Brandon also remarked, the name "binlog_free_flush" does not describe the
feature very clearly. I would suggest to use the term "rotate" in the name,
as the concept of binlog rotation is well known in MariaDB replication, and
the patch effectively performs large commits by doing it as a binlog
rotation instead of a simple write to the existing binlog file.
So I suggest to name the option --binlog-commit-by-rotate-threshold. And I
suggest in the code (names and comments) to use the phrase
"rotate_by_commit" instead of "free_flush".
The patch enables the feature by default, with a threshold of 128MB.
We need to consider if that is appropriate, or if the feature should be
disabled by default.
Reasons for disabling by default:
- The feature causes some binlog files to be cut short of the configured
--max-binlog-size, which will be a surprise to users.
- The feature adds an overhead of 4096 pre-allocated bytes at the start of
every trx cache (though the overhead is minimal).
- A default of 128MB is somewhat arbitrary.
- Releasing this feature disabled by default is somewhat less risky wrt.
regressions for users that upgrade.
Reasons to enable by default:
- Most users will otherwise not know the feature exists, and will thus not
benefit from this work.
- It will get much less testing "in the wild".
So the decision will be a compromise between these two. If I had to make the
choice, I would default to this feature being disabled. But I'm open to hear
opinions the other way.
The patch calls the trx_group_commit_leader() function. That is ackward, as
commit-by-rotate does not actually participate in group commit. But the
problem is with the existing function trx_group_commit_leader(), which does
too much in a single function.
So please do a small refactoring of trx_group_commit_leader() and keep only
the first part of that function that fetches the group commit queue. Move
the second part to a new function trx_group_commit_with_engines() that is
called by trx_group_commit_leader() passing the "queue" and "last_in_queue"
values. And assert in trx_group_commit_with_engines() that LOCK_log is held
on entry and comment that this function will release the lock.
Then you can from your Binlog_free_flush::commit() function call into
trx_group_commit_with_engines() directly, avoiding the need for adding a
true/false flag, and also avoiding that the trx_group_commit_leader()
function is sometimes called with LOCK_log held and sometimes not.
Same situation with MYSQL_BIN_LOG::write_transaction_to_binlog_events(),
this function is also doing too much and makes integrating your code
ackward.
Please refactor the MYSQL_BIN_LOG::write_transaction_to_binlog_events()
function something like this:
bool
MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry)
{
if (binlog_commit_by_rotate.should_commit_by_rotate(entry) &&
binlog_commit_by_rotate.commit(entry))
{
/* Commit done by renaming the trx/stmt cache. */
}
else
{
if (write_transaction_with_group_commit(entry))
return true; /* Error */
}
if (unlikely(entry->error))
{
write_transaction_handle_error(entry);
return 1;
}
return 0;
}
Introducing two new functions:
bool write_transaction_with_group_commit(group_commit_entry *entry)
This function contains the code from write_transaction_to_binlog_events() up
to and including this:
if (likely(!entry->error))
return entry->thd->wait_for_prior_commit();
You don't need the wait_for_prior_commit() in the commit-by-rotate case, as
you already did this in Binlog_free_flush::commit()).
You also don't need to handle the (!opt_optimize_thread_scheduling) case for
commit-by-rotate, this is old code that is being deprecated. Just add a
condition in should_free_flush() to skip if opt_optimize_thread_scheduling
is set.
Other new function:
void write_transaction_handle_error(group_commit_entry *entry)
This function contains the remainder of the
write_transaction_to_binlog_events() function starting from
"switch (entry->error) ...".
This way, the commit-by-rotate case is cleanly separated from the normal
group commit case and the code of write_transaction_to_binlog_events()
becomes much cleaner.
Now follows more detailed comments on specific parts of the patch:
> diff --git a/sql/log.cc b/sql/log.cc
> index 34f9ad745fc..3dc57b21c05 100644
> --- a/sql/log.cc
> +++ b/sql/log.cc
> @@ -163,6 +163,111 @@ static SHOW_VAR binlog_status_vars_detail[]=
> + void set_reserved_bytes(uint32 header_len)
> + {
> + // Gtid event length
> + header_len+= LOG_EVENT_HEADER_LEN + Gtid_log_event::max_data_length +
> + BINLOG_CHECKSUM_LEN;
> + header_len= header_len - (header_len % IO_SIZE) + IO_SIZE;
This has an off-by-one error, it will round up IO_SIZE to 2*IO_SIZE.
Use eg. header_len = (header_len + (IO_SIZE - 1)) & ~(IO_SIZE - 1)
You can use static_assert((IO_SIZE & (IO_SIZE - 1)) == 0, "power of two")
if you want to catch if anyone would ever change IO_SIZE to be not a power
of two.
+private:
+ Binlog_free_flush &operator=(const Binlog_free_flush &);
+ Binlog_free_flush(const Binlog_free_flush &);
Add before this a comment "singleton object, prevent copying by mistake".
> + char m_cache_dir[FN_REFLEN];
This member appears to be unused?
> + /** The cache_data which will be renamed to binlog. */
> + binlog_cache_data *m_cache_data{nullptr};
Also appears unused.
Maybe add a comment that m_entry and m_replaced are protected by LOCK_log?
> +static Binlog_free_flush binlog_free_flush;
> +ulonglong opt_binlog_free_flush_threshold= 10 * 1024 * 1024;
This initializes the value to 10 MB, but the default value in sys_vars.cc is
128MB, so that seems inconsistent.
> @@ -4126,8 +4238,7 @@ bool MYSQL_BIN_LOG::open(const char *log_name,
> /* Notify the io thread that binlog is rotated to a new file */
> if (is_relay_log)
> signal_relay_log_update();
> - else
> - update_binlog_end_pos();
> +
Why do you remove update_binlog_end_pos() here? Doesn't the binlog dump
threads need this update?
> @@ -8703,7 +8821,16 @@ MYSQL_BIN_LOG::queue_for_group_commit(group_commit_entry *orig_entry)
> bool
> MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry)
> {
> - int is_leader= queue_for_group_commit(entry);
> + int is_leader;
> +
> + if (binlog_free_flush.should_free_flush(entry) &&
> + binlog_free_flush.commit(entry))
> + {
> + is_leader= 1;
> + goto commit;
> + }
> +
> + is_leader= queue_for_group_commit(entry);
As described above, re-factor the write_transaction_to_binlog_events()
function to avoid this assignment of is_leader and the goto.
> @@ -8863,6 +8992,16 @@ MYSQL_BIN_LOG::trx_group_commit_leader(group_commit_entry *leader)
> uint64 commit_id;
> DBUG_ENTER("MYSQL_BIN_LOG::trx_group_commit_leader");
>
> + /*
> + When move a binlog cache to a binlog file, the transaction itself is
> + a group.
> + */
> + if (unlikely(is_free_flush))
> + {
> + last_in_queue= leader;
> + queue= leader;
> + }
> + else
As described above, re-factor trx_group_commit_leader() to avoid this check
in trx_group_commit_leader(), instead calling directly into the relevant
code now in trx_group_commit_with_engines().
> +inline bool Binlog_free_flush::should_free_flush(
> + const MYSQL_BIN_LOG::group_commit_entry *entry) const
> +{
> + /* Do not free flush if total_bytes smaller than limit size. */
> + if (likely(cache_data->get_byte_position() <=
> + opt_binlog_free_flush_threshold))
> + return false;
Put this check as the first check in the function. So the common case of a
small transaction does not need to run any of the other checks.
And as suggested above, put a check:
if (unlikely(!opt_optimize_thread_scheduling))
return false;
(This is not critical, but just to make things a bit easier; there is no
reason for you/us to worry about making commit-by-rotate work with disabled
optimize_thread_scheduling, this option isn't used any more. Otherwise you'd
need to test that your patch works correctly with optimize_thread_scheduling
disabled, which would just be a waste of time.)
> +bool Binlog_free_flush::commit(MYSQL_BIN_LOG::group_commit_entry *entry)
> +{
> + // It will be released by trx_group_commit_leader
> + mysql_mutex_lock(&mysql_bin_log.LOCK_log);
After suggested changes, comment should read "It will be released by
trx_group_commit_with_engines()."
> + mysql_bin_log.trx_group_commit_leader(entry, true /* is_free_flush */);
As discussed above, change this to call a new function
mysql_bin_log.trx_group_commit_with_engines(), which contains the second
part of trx_group_commit_leader() that you need to run. Then there is no
longer a need to pass a `true` flag here to skip the first part.
Also, before the call, this is the place to set entry->next to nullptr, to
convert the stand-alone entry into a (singleton) list (instead of
initializing it in MYSQL_BIN_LOG::write_transaction_to_binlog().)
> @@ -8301,6 +8418,7 @@ MYSQL_BIN_LOG::write_transaction_to_binlog(THD *thd,
> DBUG_RETURN(0);
> }
>
> + entry.next= nullptr;
> entry.thd= thd;
> entry.cache_mngr= cache_mngr;
> entry.error= 0;
Do you need to initialize entry->next here, and if so why?
At this point, the entry is not a list, it is a stand-alone struct that is
not yet linked into any list, the next pointer should be set appropriately
when/if the struct is later put into a list.
More important than the minor cost of initializing, by redundantly
initializing here we are hiding bugs (eg. Valgrind/msan/asan) if some code
would incorrectly read the next pointer.
> +bool Binlog_free_flush::replace_binlog_file()
> +{
> + size_t binlog_size= my_b_tell(&mysql_bin_log.log_file);
> + size_t required_size= binlog_size;
> + // space for Gtid_log_event
> + required_size+= LOG_EVENT_HEADER_LEN + Gtid_log_event::max_data_length +
> + BINLOG_CHECKSUM_LEN;
Just a remark, here you actually know the real required size of the GTID
event, so you don't need to conservatively estimate
Gtid_log_event::max_data_length. But I think doing it this way is fine, it
is simpler and in most cases we are rounding up to IO_SIZE anyway, so I'm
fine with keeping it this way.
> + sql_print_information("Could not rename binlog cache to binlog, "
> + "require %llu bytes but only %llu bytes reserved.",
> + required_size, m_cache_data->file_reserved_bytes());
This log message will perhaps appear a bit out-of-context in the error log.
Suggested clarification:
"Could not rename binlog cache to binlog (as requested by
--binlog-commit-by-rotate-threshold). Required ..."
> + // Set the cache file as binlog file.
> + mysql_file_close(mysql_bin_log.log_file.file, MYF(MY_WME));
Do not to pass MY_WME here (instead pass MYF(0)). Since you are ignoring the
return of the function, so we don't want mysql_file_close to call my_error()
leaving an error code set even though the operation succeeds.
> +size_t Binlog_free_flush::get_gtid_event_pad_size()
> +{
> + size_t begin_pos= my_b_tell(&mysql_bin_log.log_file);
> + size_t pad_to_size=
> + m_cache_data->file_reserved_bytes() - begin_pos - LOG_EVENT_HEADER_LEN;
> +
> + if (binlog_checksum_options != BINLOG_CHECKSUM_ALG_OFF)
> + pad_to_size-= BINLOG_CHECKSUM_LEN;
Adjusting with BINLOG_CHECKSUM_LEN is to account for the 4 bytes of checksum
at the end of the GTID_EVENT, right?
But why do you subtract LOG_EVENT_HEADER_LEN when calculating pad_to_size?
(I mean, it is probably correct, otherwise nothing would work I think. I
just did not understand it, maybe you can explain in a short comment. Or
maybe Gtid_log_event::write() could be modified to do the adjusting for
LOG_EVENT_HEADER_LEN and BINLOG_CHECKSUM_LEN).
> + // offset must be saved before replace_binlog_file(), it will update the pos
> + my_off_t offset= my_b_tell(&log_file);
Suggest to clarify comment to "..., it will update the file position".
> @@ -756,18 +759,20 @@ class MYSQL_BIN_LOG: public TC_LOG, private Event_log
> new_file() is locking. new_file_without_locking() does not acquire
> LOCK_log.
> */
> - int new_file_impl();
> + int new_file_impl(bool is_free_flush= false);
> void do_checkpoint_request(ulong binlog_id);
> - int write_transaction_or_stmt(group_commit_entry *entry, uint64 commit_id);
> + int write_transaction_or_stmt(group_commit_entry *entry, uint64 commit_id,
> + bool is_free_flush= false);
> int queue_for_group_commit(group_commit_entry *entry);
> bool write_transaction_to_binlog_events(group_commit_entry *entry);
> - void trx_group_commit_leader(group_commit_entry *leader);
> + void trx_group_commit_leader(group_commit_entry *leader,
> + bool is_free_flush= false);
> bool is_xidlist_idle_nolock();
> void update_gtid_index(uint32 offset, rpl_gtid gtid);
>
> public:
> void purge(bool all);
> - int new_file_without_locking();
> + int new_file_without_locking(bool is_free_flush= false);
> /*
> A list of struct xid_count_per_binlog is used to keep track of how many
> XIDs are in prepared, but not committed, state in each binlog. And how
> @@ -997,7 +1002,8 @@ class MYSQL_BIN_LOG: public TC_LOG, private Event_log
> enum cache_type io_cache_type_arg,
> ulong max_size,
> bool null_created,
> - bool need_mutex);
> + bool need_mutex,
> + bool is_free_flush = false);
> bool open_index_file(const char *index_file_name_arg,
> const char *log_name, bool need_mutex);
> /* Use this to start writing a new log file */
> @@ -1037,7 +1043,7 @@ class MYSQL_BIN_LOG: public TC_LOG, private Event_log
> bool is_active(const char* log_file_name);
> bool can_purge_log(const char *log_file_name, bool interactive);
> int update_log_index(LOG_INFO* linfo, bool need_update_threads);
> - int rotate(bool force_rotate, bool* check_purge);
> + int rotate(bool force_rotate, bool* check_purge, bool is_free_flush= false);
> void checkpoint_and_purge(ulong binlog_id);
> int rotate_and_purge(bool force_rotate, DYNAMIC_ARRAY* drop_gtid_domain= NULL);
> /**
Style preference: please add the these new boolean parameters without
default value, and instead change existing callers to pass false explicitly.
(And the parameter will be "commit_by_rotate" after name change.)
> +
> +extern ulonglong opt_binlog_free_flush_threshold;
> +static Sys_var_ulonglong Sys_binlog_free_flush_threshold(
> + "binlog_free_flush_threshold",
> + "Try to rename the binlog cache temporary file of the commiting "
> + "transaction to a binlog file when its binlog cache size "
> + "is bigger than the value of this variable",
> + GLOBAL_VAR(opt_binlog_free_flush_threshold),
> + CMD_LINE(REQUIRED_ARG), VALID_RANGE(10 * 1024 * 1024, ULLONG_MAX),
> + DEFAULT(128 * 1024 * 1024), BLOCK_SIZE(1));
As described above:
- Suggested name: "binlog_commit_by_rotate_threshold".
- Suggested text: "For transactions that exceed this size in the binlog,
commit by a more efficient method that forces a binlog rotation and
re-uses the pre-existing temporary file as the next binlog file."
- Let me hear opinions on whether this should default to 128 MB or should
default to eg. ULONGLONG_MAX to disable the feature by default.
> diff --git a/mysql-test/main/tmp_space_usage.test b/mysql-test/main/tmp_space_usage.test
> index af7b295f343..1685dbbc450 100644
> --- a/mysql-test/main/tmp_space_usage.test
> +++ b/mysql-test/main/tmp_space_usage.test
> @@ -215,7 +215,8 @@ select count(distinct concat(seq,repeat('x',1000))) from seq_1_to_1000;
>
> set @save_max_tmp_total_space_usage=@@global.max_tmp_total_space_usage;
> set @@global.max_tmp_total_space_usage=64*1024*1024;
> -set @@max_tmp_session_space_usage=1179648;
> +# Binlog cache reserve 4096 bytes at the begin of the temporary file.
> +set @@max_tmp_session_space_usage=1179648+65536;
> select @@max_tmp_session_space_usage;
> set @save_aria_repair_threads=@@aria_repair_threads;
> set @@aria_repair_threads=2;
> @@ -224,6 +225,7 @@ set @@max_heap_table_size=16777216;
>
> CREATE TABLE t1 (a CHAR(255),b INT,INDEX (b));
> INSERT INTO t1 SELECT SEQ,SEQ FROM seq_1_to_100000;
> +set @@max_tmp_session_space_usage=1179648;
> --error 200
> SELECT * FROM t1 UNION SELECT * FROM t1;
> DROP TABLE t1;
> @@ -266,11 +268,15 @@ SELECT MIN(VARIABLE_VALUE) OVER (), NTILE(1) OVER (), MAX(VARIABLE_NAME) OVER ()
>
> connect(c1, localhost, root,,);
> set @@binlog_format=row;
> -CREATE OR REPLACE TABLE t1 (a DATETIME) ENGINE=MyISAM;
> +CREATE OR REPLACE TABLE t1 (a DATETIME) ENGINE=InnoDB;
> +# Use the transaction to keep binlog cache temporary file large enough
> +BEGIN;
> INSERT INTO t1 SELECT NOW() FROM seq_1_to_6000;
> +
> SET max_tmp_session_space_usage = 64*1024;
> --error 200
> SELECT * FROM information_schema.ALL_PLUGINS LIMIT 2;
> +ROLLBACK;
> drop table t1;
Can you explain why you need these changes to the test case? I was not able
to understand that from the patch.
With these changes, I approve of this patch / pull-request to merge into the
next MariaDB release.
- Kristian.