Re: [PATCH] MDEV-32014 Rename binlog cache temporary file to binlog file for large transaction
From: Libing Song <anders.slb@alibaba-inc.com>
Description =========== When a transaction commits, it copies the binlog events from binlog cache to binlog file. Very large transactions (eg. gigabytes) can stall other transactions for a long time because the data is copied while holding LOCK_log, which blocks other commits from binlogging.
Hi Libing Song, Here is my full review of the patch for MDEV-32014 Rename binlog cache temporary file to binlog file for large transaction. First let me say that thanks, reviewing this patch was a pleasant surprise. The binlog code that you are integrating this new feature in is complex to work with, but overall I think you have managed to do it a pretty clean way. That is great work, and I think a lot of effort must have gone into getting it this way. I have a number of comments and requests for changes below, so I wanted to give this overall good impression up-front. Also thanks to Brandon for doing first review. My apologies if I am repeating something already discussed, but Github pull request reviews of non-trivial patches are almost impossible to read, with reviews being split in fragments, patches disappearing after force-pushes, etc. The root cause for the problem being addressed by the patch is a fundamental limitation in the current binlog format. It requires all transactions to be binlogged as a single event group, using consecutive bytes in a single binlog file. Thus, while writing the large transaction to the binlog during commit, all other transaction commits will have to wait before they can commit. Eventually, we should move towards an enhanced binlog format that doesn't have this limitation, so that transactions can be partially binlogged during execution, avoiding this bottlenect. There are a number of other problems caused by this limitation that are not addressed by this patch. Changing the binlog format is a complex task not easily done, so this patch should be seen as a temporary solution to one specific bottleneck, until if/when this underlying limitation can hopefully be removed. First some overall comments: As Brandon also remarked, the name "binlog_free_flush" does not describe the feature very clearly. I would suggest to use the term "rotate" in the name, as the concept of binlog rotation is well known in MariaDB replication, and the patch effectively performs large commits by doing it as a binlog rotation instead of a simple write to the existing binlog file. So I suggest to name the option --binlog-commit-by-rotate-threshold. And I suggest in the code (names and comments) to use the phrase "rotate_by_commit" instead of "free_flush". The patch enables the feature by default, with a threshold of 128MB. We need to consider if that is appropriate, or if the feature should be disabled by default. Reasons for disabling by default: - The feature causes some binlog files to be cut short of the configured --max-binlog-size, which will be a surprise to users. - The feature adds an overhead of 4096 pre-allocated bytes at the start of every trx cache (though the overhead is minimal). - A default of 128MB is somewhat arbitrary. - Releasing this feature disabled by default is somewhat less risky wrt. regressions for users that upgrade. Reasons to enable by default: - Most users will otherwise not know the feature exists, and will thus not benefit from this work. - It will get much less testing "in the wild". So the decision will be a compromise between these two. If I had to make the choice, I would default to this feature being disabled. But I'm open to hear opinions the other way. The patch calls the trx_group_commit_leader() function. That is ackward, as commit-by-rotate does not actually participate in group commit. But the problem is with the existing function trx_group_commit_leader(), which does too much in a single function. So please do a small refactoring of trx_group_commit_leader() and keep only the first part of that function that fetches the group commit queue. Move the second part to a new function trx_group_commit_with_engines() that is called by trx_group_commit_leader() passing the "queue" and "last_in_queue" values. And assert in trx_group_commit_with_engines() that LOCK_log is held on entry and comment that this function will release the lock. Then you can from your Binlog_free_flush::commit() function call into trx_group_commit_with_engines() directly, avoiding the need for adding a true/false flag, and also avoiding that the trx_group_commit_leader() function is sometimes called with LOCK_log held and sometimes not. Same situation with MYSQL_BIN_LOG::write_transaction_to_binlog_events(), this function is also doing too much and makes integrating your code ackward. Please refactor the MYSQL_BIN_LOG::write_transaction_to_binlog_events() function something like this: bool MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry) { if (binlog_commit_by_rotate.should_commit_by_rotate(entry) && binlog_commit_by_rotate.commit(entry)) { /* Commit done by renaming the trx/stmt cache. */ } else { if (write_transaction_with_group_commit(entry)) return true; /* Error */ } if (unlikely(entry->error)) { write_transaction_handle_error(entry); return 1; } return 0; } Introducing two new functions: bool write_transaction_with_group_commit(group_commit_entry *entry) This function contains the code from write_transaction_to_binlog_events() up to and including this: if (likely(!entry->error)) return entry->thd->wait_for_prior_commit(); You don't need the wait_for_prior_commit() in the commit-by-rotate case, as you already did this in Binlog_free_flush::commit()). You also don't need to handle the (!opt_optimize_thread_scheduling) case for commit-by-rotate, this is old code that is being deprecated. Just add a condition in should_free_flush() to skip if opt_optimize_thread_scheduling is set. Other new function: void write_transaction_handle_error(group_commit_entry *entry) This function contains the remainder of the write_transaction_to_binlog_events() function starting from "switch (entry->error) ...". This way, the commit-by-rotate case is cleanly separated from the normal group commit case and the code of write_transaction_to_binlog_events() becomes much cleaner. Now follows more detailed comments on specific parts of the patch:
diff --git a/sql/log.cc b/sql/log.cc index 34f9ad745fc..3dc57b21c05 100644 --- a/sql/log.cc +++ b/sql/log.cc @@ -163,6 +163,111 @@ static SHOW_VAR binlog_status_vars_detail[]=
+ void set_reserved_bytes(uint32 header_len) + { + // Gtid event length + header_len+= LOG_EVENT_HEADER_LEN + Gtid_log_event::max_data_length + + BINLOG_CHECKSUM_LEN; + header_len= header_len - (header_len % IO_SIZE) + IO_SIZE;
This has an off-by-one error, it will round up IO_SIZE to 2*IO_SIZE. Use eg. header_len = (header_len + (IO_SIZE - 1)) & ~(IO_SIZE - 1) You can use static_assert((IO_SIZE & (IO_SIZE - 1)) == 0, "power of two") if you want to catch if anyone would ever change IO_SIZE to be not a power of two. +private: + Binlog_free_flush &operator=(const Binlog_free_flush &); + Binlog_free_flush(const Binlog_free_flush &); Add before this a comment "singleton object, prevent copying by mistake".
+ char m_cache_dir[FN_REFLEN];
This member appears to be unused?
+ /** The cache_data which will be renamed to binlog. */ + binlog_cache_data *m_cache_data{nullptr};
Also appears unused. Maybe add a comment that m_entry and m_replaced are protected by LOCK_log?
+static Binlog_free_flush binlog_free_flush; +ulonglong opt_binlog_free_flush_threshold= 10 * 1024 * 1024;
This initializes the value to 10 MB, but the default value in sys_vars.cc is 128MB, so that seems inconsistent.
@@ -4126,8 +4238,7 @@ bool MYSQL_BIN_LOG::open(const char *log_name, /* Notify the io thread that binlog is rotated to a new file */ if (is_relay_log) signal_relay_log_update(); - else - update_binlog_end_pos(); +
Why do you remove update_binlog_end_pos() here? Doesn't the binlog dump threads need this update?
@@ -8703,7 +8821,16 @@ MYSQL_BIN_LOG::queue_for_group_commit(group_commit_entry *orig_entry) bool MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry) { - int is_leader= queue_for_group_commit(entry); + int is_leader; + + if (binlog_free_flush.should_free_flush(entry) && + binlog_free_flush.commit(entry)) + { + is_leader= 1; + goto commit; + } + + is_leader= queue_for_group_commit(entry);
As described above, re-factor the write_transaction_to_binlog_events() function to avoid this assignment of is_leader and the goto.
@@ -8863,6 +8992,16 @@ MYSQL_BIN_LOG::trx_group_commit_leader(group_commit_entry *leader) uint64 commit_id; DBUG_ENTER("MYSQL_BIN_LOG::trx_group_commit_leader");
+ /* + When move a binlog cache to a binlog file, the transaction itself is + a group. + */ + if (unlikely(is_free_flush)) + { + last_in_queue= leader; + queue= leader; + } + else
As described above, re-factor trx_group_commit_leader() to avoid this check in trx_group_commit_leader(), instead calling directly into the relevant code now in trx_group_commit_with_engines().
+inline bool Binlog_free_flush::should_free_flush( + const MYSQL_BIN_LOG::group_commit_entry *entry) const +{
+ /* Do not free flush if total_bytes smaller than limit size. */ + if (likely(cache_data->get_byte_position() <= + opt_binlog_free_flush_threshold)) + return false;
Put this check as the first check in the function. So the common case of a small transaction does not need to run any of the other checks. And as suggested above, put a check: if (unlikely(!opt_optimize_thread_scheduling)) return false; (This is not critical, but just to make things a bit easier; there is no reason for you/us to worry about making commit-by-rotate work with disabled optimize_thread_scheduling, this option isn't used any more. Otherwise you'd need to test that your patch works correctly with optimize_thread_scheduling disabled, which would just be a waste of time.)
+bool Binlog_free_flush::commit(MYSQL_BIN_LOG::group_commit_entry *entry) +{
+ // It will be released by trx_group_commit_leader + mysql_mutex_lock(&mysql_bin_log.LOCK_log);
After suggested changes, comment should read "It will be released by trx_group_commit_with_engines()."
+ mysql_bin_log.trx_group_commit_leader(entry, true /* is_free_flush */);
As discussed above, change this to call a new function mysql_bin_log.trx_group_commit_with_engines(), which contains the second part of trx_group_commit_leader() that you need to run. Then there is no longer a need to pass a `true` flag here to skip the first part. Also, before the call, this is the place to set entry->next to nullptr, to convert the stand-alone entry into a (singleton) list (instead of initializing it in MYSQL_BIN_LOG::write_transaction_to_binlog().)
@@ -8301,6 +8418,7 @@ MYSQL_BIN_LOG::write_transaction_to_binlog(THD *thd, DBUG_RETURN(0); }
+ entry.next= nullptr; entry.thd= thd; entry.cache_mngr= cache_mngr; entry.error= 0;
Do you need to initialize entry->next here, and if so why? At this point, the entry is not a list, it is a stand-alone struct that is not yet linked into any list, the next pointer should be set appropriately when/if the struct is later put into a list. More important than the minor cost of initializing, by redundantly initializing here we are hiding bugs (eg. Valgrind/msan/asan) if some code would incorrectly read the next pointer.
+bool Binlog_free_flush::replace_binlog_file() +{ + size_t binlog_size= my_b_tell(&mysql_bin_log.log_file); + size_t required_size= binlog_size; + // space for Gtid_log_event + required_size+= LOG_EVENT_HEADER_LEN + Gtid_log_event::max_data_length + + BINLOG_CHECKSUM_LEN;
Just a remark, here you actually know the real required size of the GTID event, so you don't need to conservatively estimate Gtid_log_event::max_data_length. But I think doing it this way is fine, it is simpler and in most cases we are rounding up to IO_SIZE anyway, so I'm fine with keeping it this way.
+ sql_print_information("Could not rename binlog cache to binlog, " + "require %llu bytes but only %llu bytes reserved.", + required_size, m_cache_data->file_reserved_bytes());
This log message will perhaps appear a bit out-of-context in the error log. Suggested clarification: "Could not rename binlog cache to binlog (as requested by --binlog-commit-by-rotate-threshold). Required ..."
+ // Set the cache file as binlog file. + mysql_file_close(mysql_bin_log.log_file.file, MYF(MY_WME));
Do not to pass MY_WME here (instead pass MYF(0)). Since you are ignoring the return of the function, so we don't want mysql_file_close to call my_error() leaving an error code set even though the operation succeeds.
+size_t Binlog_free_flush::get_gtid_event_pad_size() +{ + size_t begin_pos= my_b_tell(&mysql_bin_log.log_file); + size_t pad_to_size= + m_cache_data->file_reserved_bytes() - begin_pos - LOG_EVENT_HEADER_LEN; + + if (binlog_checksum_options != BINLOG_CHECKSUM_ALG_OFF) + pad_to_size-= BINLOG_CHECKSUM_LEN;
Adjusting with BINLOG_CHECKSUM_LEN is to account for the 4 bytes of checksum at the end of the GTID_EVENT, right? But why do you subtract LOG_EVENT_HEADER_LEN when calculating pad_to_size? (I mean, it is probably correct, otherwise nothing would work I think. I just did not understand it, maybe you can explain in a short comment. Or maybe Gtid_log_event::write() could be modified to do the adjusting for LOG_EVENT_HEADER_LEN and BINLOG_CHECKSUM_LEN).
+ // offset must be saved before replace_binlog_file(), it will update the pos + my_off_t offset= my_b_tell(&log_file);
Suggest to clarify comment to "..., it will update the file position".
@@ -756,18 +759,20 @@ class MYSQL_BIN_LOG: public TC_LOG, private Event_log new_file() is locking. new_file_without_locking() does not acquire LOCK_log. */ - int new_file_impl(); + int new_file_impl(bool is_free_flush= false); void do_checkpoint_request(ulong binlog_id); - int write_transaction_or_stmt(group_commit_entry *entry, uint64 commit_id); + int write_transaction_or_stmt(group_commit_entry *entry, uint64 commit_id, + bool is_free_flush= false); int queue_for_group_commit(group_commit_entry *entry); bool write_transaction_to_binlog_events(group_commit_entry *entry); - void trx_group_commit_leader(group_commit_entry *leader); + void trx_group_commit_leader(group_commit_entry *leader, + bool is_free_flush= false); bool is_xidlist_idle_nolock(); void update_gtid_index(uint32 offset, rpl_gtid gtid);
public: void purge(bool all); - int new_file_without_locking(); + int new_file_without_locking(bool is_free_flush= false); /* A list of struct xid_count_per_binlog is used to keep track of how many XIDs are in prepared, but not committed, state in each binlog. And how @@ -997,7 +1002,8 @@ class MYSQL_BIN_LOG: public TC_LOG, private Event_log enum cache_type io_cache_type_arg, ulong max_size, bool null_created, - bool need_mutex); + bool need_mutex, + bool is_free_flush = false); bool open_index_file(const char *index_file_name_arg, const char *log_name, bool need_mutex); /* Use this to start writing a new log file */ @@ -1037,7 +1043,7 @@ class MYSQL_BIN_LOG: public TC_LOG, private Event_log bool is_active(const char* log_file_name); bool can_purge_log(const char *log_file_name, bool interactive); int update_log_index(LOG_INFO* linfo, bool need_update_threads); - int rotate(bool force_rotate, bool* check_purge); + int rotate(bool force_rotate, bool* check_purge, bool is_free_flush= false); void checkpoint_and_purge(ulong binlog_id); int rotate_and_purge(bool force_rotate, DYNAMIC_ARRAY* drop_gtid_domain= NULL); /**
Style preference: please add the these new boolean parameters without default value, and instead change existing callers to pass false explicitly. (And the parameter will be "commit_by_rotate" after name change.)
+ +extern ulonglong opt_binlog_free_flush_threshold; +static Sys_var_ulonglong Sys_binlog_free_flush_threshold( + "binlog_free_flush_threshold", + "Try to rename the binlog cache temporary file of the commiting " + "transaction to a binlog file when its binlog cache size " + "is bigger than the value of this variable", + GLOBAL_VAR(opt_binlog_free_flush_threshold), + CMD_LINE(REQUIRED_ARG), VALID_RANGE(10 * 1024 * 1024, ULLONG_MAX), + DEFAULT(128 * 1024 * 1024), BLOCK_SIZE(1));
As described above: - Suggested name: "binlog_commit_by_rotate_threshold". - Suggested text: "For transactions that exceed this size in the binlog, commit by a more efficient method that forces a binlog rotation and re-uses the pre-existing temporary file as the next binlog file." - Let me hear opinions on whether this should default to 128 MB or should default to eg. ULONGLONG_MAX to disable the feature by default.
diff --git a/mysql-test/main/tmp_space_usage.test b/mysql-test/main/tmp_space_usage.test index af7b295f343..1685dbbc450 100644 --- a/mysql-test/main/tmp_space_usage.test +++ b/mysql-test/main/tmp_space_usage.test @@ -215,7 +215,8 @@ select count(distinct concat(seq,repeat('x',1000))) from seq_1_to_1000;
set @save_max_tmp_total_space_usage=@@global.max_tmp_total_space_usage; set @@global.max_tmp_total_space_usage=64*1024*1024; -set @@max_tmp_session_space_usage=1179648; +# Binlog cache reserve 4096 bytes at the begin of the temporary file. +set @@max_tmp_session_space_usage=1179648+65536; select @@max_tmp_session_space_usage; set @save_aria_repair_threads=@@aria_repair_threads; set @@aria_repair_threads=2; @@ -224,6 +225,7 @@ set @@max_heap_table_size=16777216;
CREATE TABLE t1 (a CHAR(255),b INT,INDEX (b)); INSERT INTO t1 SELECT SEQ,SEQ FROM seq_1_to_100000; +set @@max_tmp_session_space_usage=1179648; --error 200 SELECT * FROM t1 UNION SELECT * FROM t1; DROP TABLE t1; @@ -266,11 +268,15 @@ SELECT MIN(VARIABLE_VALUE) OVER (), NTILE(1) OVER (), MAX(VARIABLE_NAME) OVER ()
connect(c1, localhost, root,,); set @@binlog_format=row; -CREATE OR REPLACE TABLE t1 (a DATETIME) ENGINE=MyISAM; +CREATE OR REPLACE TABLE t1 (a DATETIME) ENGINE=InnoDB; +# Use the transaction to keep binlog cache temporary file large enough +BEGIN; INSERT INTO t1 SELECT NOW() FROM seq_1_to_6000; + SET max_tmp_session_space_usage = 64*1024; --error 200 SELECT * FROM information_schema.ALL_PLUGINS LIMIT 2; +ROLLBACK; drop table t1;
Can you explain why you need these changes to the test case? I was not able to understand that from the patch. With these changes, I approve of this patch / pull-request to merge into the next MariaDB release. - Kristian.
Hi Kristian, Thanks for the detail review. First of all, about the name --binlog-commit-by-rotate-threshold, It is better than free flush. but IMHO, rotate cannot show the feature directly. Rotate is just an effect of the feature. So how about to name it --binlog-commit-by-rename-threshold? I already used 'rename' in comments and function names.
The patch enables the feature by default, with a threshold of 128MB. We need to consider if that is appropriate, or if the feature should be disabled by default.
Reasons for disabling by default:
- The feature causes some binlog files to be cut short of the configured --max-binlog-size, which will be a surprise to users.
- The feature adds an overhead of 4096 pre-allocated bytes at the start of every trx cache (though the overhead is minimal).
- A default of 128MB is somewhat arbitrary.
That is according to Alibaba's cloud ssd's throughput which is 350MB/s. So the copy may take 1/3s if the data is still in the page cache, otherwise it will take 2/3s. I hope binlog commit should not take more than 1/2s. It could take less time on fast local ssd, but still rename is much faster than copy.
- Releasing this feature disabled by default is somewhat less risky wrt. regressions for users that upgrade.
If there is any problem, users can disable it with a huge value. I originally added separated variable to enable/disable the feature. I removed it because binlog_commit_by_rotate_threshold can serve the same purpose.
The patch calls the trx_group_commit_leader() function. That is ackward, as commit-by-rotate does not actually participate in group commit. But the problem is with the existing function trx_group_commit_leader(), which does too much in a single function. It is updated in the latest patch, now rename is called in trx_group_commit_leader(), it could be in a commit queue with other transactions together.
@@ -4126,8 +4238,7 @@ bool MYSQL_BIN_LOG::open(const char *log_name, /* Notify the io thread that binlog is rotated to a new file */ if (is_relay_log) signal_relay_log_update(); - else - update_binlog_end_pos(); + Why do you remove update_binlog_end_pos() here? Doesn't the binlog dump threads need this update? reset_binlog_end_pos() already did the same thing. ... +size_t Binlog_free_flush::get_gtid_event_pad_size() +{ + size_t begin_pos= my_b_tell(&mysql_bin_log.log_file); + size_t pad_to_size= + m_cache_data->file_reserved_bytes() - begin_pos - LOG_EVENT_HEADER_LEN; + + if (binlog_checksum_options != BINLOG_CHECKSUM_ALG_OFF) + pad_to_size-= BINLOG_CHECKSUM_LEN; Adjusting with BINLOG_CHECKSUM_LEN is to account for the 4 bytes of checksum at the end of the GTID_EVENT, right?
But why do you subtract LOG_EVENT_HEADER_LEN when calculating pad_to_size? pad_size is data size of Gtid_event, so it doesn't include the size of header. Maybe it is better to call it get_gtid_event_pad_data_size(). I also renamed pad_to_size to pad_data_to_size.
-CREATE OR REPLACE TABLE t1 (a DATETIME) ENGINE=MyISAM; +CREATE OR REPLACE TABLE t1 (a DATETIME) ENGINE=InnoDB; +# Use the transaction to keep binlog cache temporary file large enough +BEGIN; INSERT INTO t1 SELECT NOW() FROM seq_1_to_6000; + SET max_tmp_session_space_usage = 64*1024; --error 200 SELECT * FROM information_schema.ALL_PLUGINS LIMIT 2; +ROLLBACK; drop table t1; Can you explain why you need these changes to the test case? I was not able to understand that from the patch.
With reserved size, it exceeds CACHE_FILE_TRUNC_SIZE, thus binlog cache is truncated after commit. so keep the transaction to keep binlog cache large enough
"slb.songlibing--- via developers" <developers@lists.mariadb.org> writes:
Hi Kristian, Thanks for the detail review.
First of all, about the name --binlog-commit-by-rotate-threshold, It is better than free flush. but IMHO, rotate cannot show the feature directly. Rotate is just an effect of the feature. So how about to name it --binlog-commit-by-rename-threshold? I already used 'rename' in comments and function names.
I would go along with --binlog-fast-rotate-threshold Rename or copy in the name hardly make sense for a plain user. And 'fast' sounds to me as something exceptional to --max-binlog-size=# Binary log will be rotated automatically when the size exceeds this value. which clearly needs mentioning (sql/sys_vars.cc) of the exception the current work introduces.
Hi Libing Song, Thanks for the explanations, all clear now. "slb.songlibing--- via developers" <developers@lists.mariadb.org> writes:
The patch enables the feature by default, with a threshold of 128MB. We need to consider if that is appropriate, or if the feature should be disabled by default.
That is according to Alibaba's cloud ssd's throughput which is 350MB/s. So the copy may take 1/3s if the data is still in the page cache, otherwise it will take 2/3s. I hope binlog commit should not take more than 1/2s. It could take less time on fast local ssd, but still rename is much faster than copy.
Ah, thanks for the explanation. It sounds like a good rule of thumb, set it to make commits not take more than approximately 0.5 seconds.
If there is any problem, users can disable it with a huge value. I
Ok, I'm fine with this for now. The 4kB will always be reserved at the start of the trx cache, if I understand correctly, but the commit by rotate/rename is configurable, as you say.
Why do you remove update_binlog_end_pos() here? Doesn't the binlog dump threads need this update?
reset_binlog_end_pos() already did the same thing.
Aha, I see, thanks for the explanation.
But why do you subtract LOG_EVENT_HEADER_LEN when calculating pad_to_size?
pad_size is data size of Gtid_event, so it doesn't include the size of header. Maybe it is
Aha, got it, the length argument to Log_event::write_header() is exclusive the LOG_EVENT_HEADER_LEN bytes, thanks for the explanation.
-CREATE OR REPLACE TABLE t1 (a DATETIME) ENGINE=MyISAM; +CREATE OR REPLACE TABLE t1 (a DATETIME) ENGINE=InnoDB; +# Use the transaction to keep binlog cache temporary file large enough +BEGIN; INSERT INTO t1 SELECT NOW() FROM seq_1_to_6000; + SET max_tmp_session_space_usage = 64*1024; --error 200 SELECT * FROM information_schema.ALL_PLUGINS LIMIT 2; +ROLLBACK; drop table t1; Can you explain why you need these changes to the test case? I was not able to understand that from the patch.
With reserved size, it exceeds CACHE_FILE_TRUNC_SIZE, thus binlog cache is truncated after commit. so keep the transaction to keep binlog cache large enough
Oh, I see. Agree, this is better, it seems somewhat accidental that the original test case worked. Nice find! :-) - Kristian.
participants (3)
-
andrei.elkin@pp.inet.fi
-
Kristian Nielsen
-
slb.songlibing@gmail.com