Re: [PATCH 4/4] MDEV-31273: Precompute binlog checksums
Hi! Review of MDEV-31273: Precompute binlog checksums On Fri, Aug 25, 2023 at 10:16 AM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Compute binlog checksums (when enabled) already when writing events into the statement or transaction caches, where before it was done when the caches are copied to the real binlog file. This moves the checksum computation outside of holding LOCK_log, improving scalabitily.
DBUG_RETURN(my_b_copy_to_cache(from_cache, to_cache, SIZE_T_MAX)); You could use from_cache->end_of_file instead of SIZE_T_MAX <cut> uchar checksum_opt; Wouldn't it be better to have this as an "enum_binlog_checksum_alg" to avoid some casts ?
} else if (data) return (enum enum_binlog_checksum_alg)data->checksum_opt; else return BINLOG_CHECKSUM_ALG_OFF; }
You can remove the else's <cut> ulong *param_ptr_binlog_cache_disk_use, bool precompute_checksums) : stmt_cache(precompute_checksums), trx_cache(precompute_checksums), last_commit_pos_offset(0), using_xa(FALSE), xa_xid(0) { stmt_cache.set_binlog_cache_info(param_max_binlog_stmt_cache_size, param_ptr_binlog_stmt_cache_use, This code was a bit confusing as we first initialize stmt_cache and trx_cache with checksum and then we initialize them again in the next two lines. I understand why and this does not have to be changed, but it does look a bit strange. The other option would be to have set_binlog_cache_info() take checksum as an argument(). <cut> This one confused me a bit: Log_event_writer writer(file, 0, checksum_alg, &crypto); Log_event_writer writer(file, cache_data, checksum_alg, &crypto); Why this change? It may at least affect calls to: void Log_event_writer::add_status(enum_logged_status status) { if (likely(cache_data)) Fix (after discussions on slack): Remove the following lines in MYSQL_BIN_LOG::write_event(Log_event *ev, binlog_cache_data *cache_data, if (cache_data) cache_data->add_status(ev->logged_status()); <cut>
Don't attempt to precompute checksums if: - Disabled by user request, --binlog-legacy-event-pos - Binlog is encrypted, cannot use precomputed checksums - WSREP/Galera.
Why Galera? Would be good to explain this in the commit comment. <cut>
/* If possible, just copy the cache over byte-by-byte with pre-computed checksums. */ if (likely(binlog_checksum_options == cache_data->checksum_opt) && likely(!crypto.scheme) && likely(!opt_binlog_legacy_event_pos)) { int res= my_b_copy_all_to_cache(cache, &log_file); status_var_add(thd->status_var.binlog_bytes_written, my_b_tell(cache)); DBUG_RETURN(res ? ER_ERROR_ON_WRITE : 0); }
I was just wondering if this could sometimes be optimized to write directly to the real binlog file without another cache in between. <cut> bool precomputed_checksums= (cache_data->checksum_opt != BINLOG_CHECKSUM_ALG_OFF); uint old_checksum_len= precomputed_checksums ? BINLOG_CHECKSUM_LEN : 0; Why two variables when one can as easily do: uint old_checksum_len= ((cache_data->checksum_opt != BINLOG_CHECKSUM_ALG_OFF) ? BINLOG_CHECSUM_LEN : 0); In current write_cache() code, group is a bad name. Please rename 'group' to 'log_file_pos'. By the way, great that you removed the 'end_log_file_pos' variable, which also was a very confusing name related how it was used!
/* Any old precomputed checksum must _not_ be written here. Instead, it must be discarded; the new checksum, if needed, is written by writer.write_footer(). */ if (ev_len > old_checksum_len) { uint bytes_to_skip= old_checksum_len - std::min(old_checksum_len, ev_len - chunk); if (writer.write_data(cache->read_pos, chunk - bytes_to_skip)) goto error_in_write; }
The above is likely wrong as for long events we would execute the inner part multiple times, while there is only on checksum. As checksum is last in the event, why not just do 'even_len-= old_checksum' Before the loop to copy the event and then disregard 'old_checksum_len' bytes from the cache at the end of the loop? Anyway, to find bugs like this, we need to have a test case with events that are bigger than the IO_CACHE size for cache. Setting binlog_file_cache_size to 8192 (min value) should make it easy to test this. <cut>
binlog_checksum_options= value; my_atomic_storeul_explicit(&binlog_checksum_options, value, MY_MEMORY_ORDER_RELAXED); Atom is not needed as we have a lock on binlog.
...... Something totally different. I noticed in MYSQL_BIN_LOG::write_cache(): group= (size_t)my_b_tell(&log_file); val= uint4korr(header + LOG_POS_OFFSET) + group + end_log_pos_inc; int4store(header + LOG_POS_OFFSET, val); The first log entry is probably ok, as we have done a rotate() event before. However the cache can have a lot of log_events (as part of a transaction). The end_log_pos after the first event can be wrong as it may not fit into 4 bytes. My understanding is that this is not a problem anymore as we are not using end_log_pos anymore. However I still wonder if rotate() should not only consider the current log file size but also the size of all events we plan to write to the log and do a rotate if the total new log file size > 4G. Regards, Monty
Michael Widenius via developers <developers@lists.mariadb.org> writes:
DBUG_RETURN(my_b_copy_to_cache(from_cache, to_cache, SIZE_T_MAX));
You could use from_cache->end_of_file instead of SIZE_T_MAX
Fixed.
uchar checksum_opt;
Wouldn't it be better to have this as an "enum_binlog_checksum_alg" to avoid some casts ?
Fixed.
else if (data) return (enum enum_binlog_checksum_alg)data->checksum_opt; else return BINLOG_CHECKSUM_ALG_OFF; }
You can remove the else's
Fixed.
Fix (after discussions on IRC): Remove the following lines in MYSQL_BIN_LOG::write_event(Log_event *ev, binlog_cache_data *cache_data,
if (cache_data) cache_data->add_status(ev->logged_status());
Done, good catch!
Don't attempt to precompute checksums if: - Disabled by user request, --binlog-legacy-event-pos - Binlog is encrypted, cannot use precomputed checksums - WSREP/Galera.
Why Galera? Would be good to explain this in the commit comment.
Commit message extended.
In current write_cache() code, group is a bad name. Please rename 'group' to 'log_file_pos'.
Agree, fixed.
if (ev_len > old_checksum_len) { uint bytes_to_skip= old_checksum_len - std::min(old_checksum_len, ev_len - chunk); if (writer.write_data(cache->read_pos, chunk - bytes_to_skip)) goto error_in_write; }
As checksum is last in the event, why not just do 'even_len-= old_checksum' Before the loop to copy the event and then disregard 'old_checksum_len' bytes from the cache at the end of the loop?
Agree, that's *much* better, thanks for the suggestion. I removed this complex logic and just did a my_b_read(cache, buf, old_checksum_len) at the end.
The above is likely wrong as for long events we would execute the inner part multiple times, while there is only one checksum.
The complex std_min() expression will be equal to 0 (old_checksum_len - old_checksum_len) except at the end of the loop. But this is absolutely not clear, and I'm much happier with the fixed code.
Anyway, to find bugs like this, we need to have a test case with events that are bigger than the IO_CACHE size for cache. Setting binlog_file_cache_size to 8192 (min value) should make it easy to test this.
There is a test for this: mysql-test/suite/rpl/t/rpl_checksum_cache.test It sets the --binlog-cache-size to 4096 and tests with large events and such.
binlog_checksum_options= value; my_atomic_storeul_explicit(&binlog_checksum_options, value, MY_MEMORY_ORDER_RELAXED); Atom is not needed as we have a lock on binlog.
As we discussed, I removed the atomics changes for binlog_checksum_options. The option is accessed only under LOCK_log, except to get the value of how to pre-compute the checksums. And this value is re-checked again (now under LOCK_log) in MYSQL_BIN_LOG::write_cache().
I noticed in MYSQL_BIN_LOG::write_cache():
group= (size_t)my_b_tell(&log_file); val= uint4korr(header + LOG_POS_OFFSET) + group + end_log_pos_inc; int4store(header + LOG_POS_OFFSET, val);
The first log entry is probably ok, as we have done a rotate() event before. However the cache can have a lot of log_events (as part of a transaction). The end_log_pos after the first event can be wrong as it may not fit into 4 bytes. My understanding is that this is not a problem anymore as we are not using end_log_pos anymore. However I still wonder if rotate() should not only consider the current log file size but also the size of all events we plan to write to the log and do a rotate if the total new log file size > 4G.
I agree, this is a valid concern and I'm not sure exactly what will happen. Something that would be good to look into, maybe someone in the mariadb.com test team? - Kristian.
participants (2)
-
Kristian Nielsen
-
Michael Widenius