Hi Kristian, Thanks for the detail review. First of all, about the name --binlog-commit-by-rotate-threshold, It is better than free flush. but IMHO, rotate cannot show the feature directly. Rotate is just an effect of the feature. So how about to name it --binlog-commit-by-rename-threshold? I already used 'rename' in comments and function names.
The patch enables the feature by default, with a threshold of 128MB. We need to consider if that is appropriate, or if the feature should be disabled by default.
Reasons for disabling by default:
- The feature causes some binlog files to be cut short of the configured --max-binlog-size, which will be a surprise to users.
- The feature adds an overhead of 4096 pre-allocated bytes at the start of every trx cache (though the overhead is minimal).
- A default of 128MB is somewhat arbitrary.
That is according to Alibaba's cloud ssd's throughput which is 350MB/s. So the copy may take 1/3s if the data is still in the page cache, otherwise it will take 2/3s. I hope binlog commit should not take more than 1/2s. It could take less time on fast local ssd, but still rename is much faster than copy.
- Releasing this feature disabled by default is somewhat less risky wrt. regressions for users that upgrade.
If there is any problem, users can disable it with a huge value. I originally added separated variable to enable/disable the feature. I removed it because binlog_commit_by_rotate_threshold can serve the same purpose.
The patch calls the trx_group_commit_leader() function. That is ackward, as commit-by-rotate does not actually participate in group commit. But the problem is with the existing function trx_group_commit_leader(), which does too much in a single function. It is updated in the latest patch, now rename is called in trx_group_commit_leader(), it could be in a commit queue with other transactions together.
@@ -4126,8 +4238,7 @@ bool MYSQL_BIN_LOG::open(const char *log_name, /* Notify the io thread that binlog is rotated to a new file */ if (is_relay_log) signal_relay_log_update(); - else - update_binlog_end_pos(); + Why do you remove update_binlog_end_pos() here? Doesn't the binlog dump threads need this update? reset_binlog_end_pos() already did the same thing. ... +size_t Binlog_free_flush::get_gtid_event_pad_size() +{ + size_t begin_pos= my_b_tell(&mysql_bin_log.log_file); + size_t pad_to_size= + m_cache_data->file_reserved_bytes() - begin_pos - LOG_EVENT_HEADER_LEN; + + if (binlog_checksum_options != BINLOG_CHECKSUM_ALG_OFF) + pad_to_size-= BINLOG_CHECKSUM_LEN; Adjusting with BINLOG_CHECKSUM_LEN is to account for the 4 bytes of checksum at the end of the GTID_EVENT, right?
But why do you subtract LOG_EVENT_HEADER_LEN when calculating pad_to_size? pad_size is data size of Gtid_event, so it doesn't include the size of header. Maybe it is better to call it get_gtid_event_pad_data_size(). I also renamed pad_to_size to pad_data_to_size.
-CREATE OR REPLACE TABLE t1 (a DATETIME) ENGINE=MyISAM; +CREATE OR REPLACE TABLE t1 (a DATETIME) ENGINE=InnoDB; +# Use the transaction to keep binlog cache temporary file large enough +BEGIN; INSERT INTO t1 SELECT NOW() FROM seq_1_to_6000; + SET max_tmp_session_space_usage = 64*1024; --error 200 SELECT * FROM information_schema.ALL_PLUGINS LIMIT 2; +ROLLBACK; drop table t1; Can you explain why you need these changes to the test case? I was not able to understand that from the patch.
With reserved size, it exceeds CACHE_FILE_TRUNC_SIZE, thus binlog cache is truncated after commit. so keep the transaction to keep binlog cache large enough