Hi Monty,
Here is my review of the MDEV-33856 patch:
> commit cc516f4d9913d732065c7abef983df42e5a8bab3 (origin/bb-11.6-MDEV-33856-seconds_behind_master)
> Author: Monty <monty(a)mariadb.org>
> Date: Tue May 14 23:47:59 2024 +0300
>
> MDEV-33856 New definition for Seconds_Behind_Master
If I understand the patch correctly, in introduces a new
`Master_slave_time_diff` value as an alternative to the traditional
Seconds_behind_master, with a clearer semantics. And does not change how
Seconds_behind_master works.
I think this is fine, and can help to preserve backwards compatibility. Just
maybe mention this in the commit message, and/or change the bug title to
"Alternative definition for Seconds_Behind_Master".
> diff --git a/mysql-test/suite/perfschema/t/relaylog.test b/mysql-test/suite/perfschema/t/relaylog.test
> index 12fc96a8b27..16f72efae98 100644
> --- a/mysql-test/suite/perfschema/t/relaylog.test
> +++ b/mysql-test/suite/perfschema/t/relaylog.test
> @@ -43,7 +43,7 @@ select
> if (sum_number_of_bytes_read > 0, "MANY", "NONE") as SUM_NUMBER_OF_BYTES_READ,
> if (sum_number_of_bytes_write > 0, "MANY", "NONE") as SUM_NUMBER_OF_BYTES_WRITE
> from performance_schema.file_summary_by_instance
> - where file_name like "%master-%" order by file_name;
> + where file_name like "%master-%" and file_name not like "%sql/share%" order by file_name;
>
> select * from performance_schema.file_summary_by_instance
> where file_name like "%slave-%" order by file_name;
> @@ -96,7 +96,7 @@ sync_slave_with_master;
> -- echo "============ Performance schema on slave ============"
>
> select * from performance_schema.file_summary_by_instance
> - where file_name like "%master-%" order by file_name;
> + where file_name like "%master-%" and file_name not like "%sql/share%" order by file_name;
>
> select
> substring(file_name, locate("slave-", file_name)) as FILE_NAME,
What is the reason for these changes (adding match on "%sql/share%" ?
I don't understand the relation to the subject of the patch.
> diff --git a/mysql-test/suite/rpl/t/master_last_event_time.inc b/mysql-test/suite/rpl/t/master_last_event_time.inc
> new file mode 100644
> index 00000000000..0790d557a7b
> --- /dev/null
> +++ b/mysql-test/suite/rpl/t/master_last_event_time.inc
>
> +--connection server_3
> +--sleep 1
> +--source include/start_slave.inc
> +--source include/sync_with_master_gtid.inc
What is the purpose of --sleep 1 here?
Usually it's wrong to have sleeps in the tests. Sometimes it is correct, but
then there should be a comment explaining why a sleep is necessary (and
sufficient) here. This sleep looks wrong, since there's a proper wait
immediately after for the slave to sync up.
> diff --git a/mysql-test/suite/rpl/t/master_last_event_time_stmt.test b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test
> new file mode 100644
> index 00000000000..fe0f16a3db3
> --- /dev/null
> +++ b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test
> @@ -0,0 +1,84 @@
> +#
> +# Statement specific tests for master_last_event_time
> +#
> +--source include/have_binlog_format_statement.inc
>
> +# Slave will replicate the above with a row event which will be very fast
> +# compared to the master event.
I don't understand this comment. Why will the slave replicate using a row
event, when the test case sets binlog_format=statement?
> diff --git a/sql/log.cc b/sql/log.cc
> index 1cc0fa77f08..63d9c756911 100644
> --- a/sql/log.cc
> +++ b/sql/log.cc
> @@ -1898,10 +1898,20 @@ static inline int
> binlog_commit_flush_xid_caches(THD *thd, binlog_cache_mngr *cache_mngr,
> bool all, my_xid xid)
> Xid_log_event end_evt(thd, xid, TRUE);
> - return (binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE, TRUE));
> + if (!thd->slave_thread && ! thd->user_time.val)
> + {
> + /*
> + Ensure that on the master the event time is the time of commit,
> + not the start of statement time.
> + */
> + my_hrtime_t hrtime= my_hrtime();
> + end_evt.when= hrtime_to_my_time(hrtime);
> + }
> + DBUG_RETURN(binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE, TRUE));
Ok, so correct me if I'm wrong, but I think the purpose here is: You want to
have the time when the transaction ended (in case of long-running query).
But for auto-committed query, there is no explicit COMMIT query, so the
thd->start_time is the start of the transaction. So here you ensure that the
XID_EVENT will have the time at start of commit in either case. If this is
correct, probably could use a comment explaining this.
Some comments:
1. It is better to check for thd->rgi_slave being non-NULL than for
thd->slave_thread. The thd->rgi_slave is associated with the replication of
some event or transaction. The thd->slave_thread just identifies the
thread/connection. (This is mostly style/code clarity, as in current code
the two values would track each other.)
2. Why the condition on thd->user_time.val? What is the situation where we
are not replicating an event, but user_time is set and we do not want to put
the current time in the XID_EVENT?
3. This handles an XID end event, which is for InnoDB transactions. But
MyISAM/Aria use a different end event, a COMMIT query, and there are are few
other cases as well. Don't you want consistently to have the end of the
query as the time in all end events? Then this code should probably instead
be in binlog_flush_cache() or perhaps write_transaction_to_binlog()?
4. Even given (3), there are still cases (eg. DDL) where there is no end
event, and thus there is no event containing the time for the end of the
operation on the master. Won't this cause some inconsistency in how the
slave determines and displays its timings with this patch? If not, why not?
> for (; has_xid && !entry.need_unlog && ha_info; ha_info= ha_info->next())
> {
> if (ha_info->is_started() && ha_info->ht() != binlog_hton &&
> !ha_info->ht()->commit_checkpoint_request)
> + {
> entry.need_unlog= true;
> + break;
> + }
> }
Agree, this code always irritated me ;-)
You could also move the condition (!entry.need_unlog && ha_info) out of the
loop in an outer if () to not have to check it on every loop iteration (just
an observation, entirely up to you, it's not something your patch
introduced).
> diff --git a/sql/log_event.cc b/sql/log_event.cc
> index 3802f76efe0..706a4903eb0 100644
> --- a/sql/log_event.cc
> +++ b/sql/log_event.cc
> @@ -1104,10 +1116,12 @@ Log_event* Log_event::read_log_event(const uchar *buf, uint event_len,
> switch(event_type) {
> case QUERY_EVENT:
> ev= new Query_log_event(buf, event_len, fdle, QUERY_EVENT);
> + set_orig_exec_time_in_thd(((Query_log_event*) ev)->exec_time);
> break;
> case QUERY_COMPRESSED_EVENT:
> ev= new Query_compressed_log_event(buf, event_len, fdle,
> - QUERY_COMPRESSED_EVENT);
> + QUERY_COMPRESSED_EVENT);
> + set_orig_exec_time_in_thd(((Query_compressed_log_event*) ev)->exec_time);
> @@ -964,6 +964,18 @@ Log_event* Log_event::read_log_event(IO_CACHE* file,
> DBUG_RETURN(res);
> }
>
> +/*
> + Update thd->orig_exec_time
> +*/
> +
> +inline void set_orig_exec_time_in_thd(my_time_t exec_time)
> +{
> +#if !defined(MYSQL_CLIENT) && defined(HAVE_REPLICATION)
> + THD *thd= current_thd;
> + if (likely(thd))
> + thd->orig_exec_time= exec_time;
> +#endif
> +}
Using current_thd doesn't seem right here, and in fact it's not, as seen by
this part of the patch where thd->orig_exec_time has to be set again for
parallel replication:
> diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
> index 51ca3b183ec..d279761767a 100644
> --- a/sql/rpl_parallel.cc
> +++ b/sql/rpl_parallel.cc
> @@ -60,7 +60,10 @@ rpt_handle_event(rpl_parallel_thread::queued_event *qev,
> rgi->future_event_relay_log_pos= qev->future_event_relay_log_pos;
> strcpy(rgi->future_event_master_log_name, qev->future_event_master_log_name);
> if (event_can_update_last_master_timestamp(ev))
> + {
> rgi->last_master_timestamp= ev->when + ev->exec_time;
> + thd->orig_exec_time= ev->exec_time;
> + }
So let's not set thd->orig_exec_time from Log_event::read_log_event(), it's
redundant (for parallel replication), and logically setting it should be
done when applying the event, not when reading the event.
Instead, set the thd->orig_exec_time in exec_relay_log_event(), so we retain
similarity with how replication is done in the parallel and non-parallel
case. There is already a similar if (event_can_update_last_master_timestamp(ev))
in exec_relay_log_event().
(Also see comments below about moving this from THD to rpl_group_info to not
increase THD just for this very specific case).
> +
> +/*
> + Get the time when the event had been executed on the master.
> + This works for both query events and load data events.
> +*/
> +
> +#if Q_EXEC_TIME_OFFSET != L_EXEC_TIME_OFFSET
> +#error "Q_EXEC_TIME_OFFSET is not same as L_EXEC_TIME_OFFSET"
> +#endif
> +
> +time_t query_event_get_time(const uchar *buf,
> + const Format_description_log_event
> + *description_event)
> +{
> + time_t when= uint4korr(buf);
> + buf+= description_event->common_header_len;
> + return when + uint4korr(buf + Q_EXEC_TIME_OFFSET);
Suggestion: Add an assert here:
DBUG_ASSERT(LOG_EVENT_IS_QUERY(buf[EVENT_TYPE_OFFSET]) ||
LOG_EVENT_IS_LOAD_DATA(buf[EVENT_TYPE_OFFSET]));
just to avoid someone mistakenly adding code later to call this for a wrong
event type.
> diff --git a/sql/sql_class.h b/sql/sql_class.h
> index 2d671139a6c..05c9c4e8012 100644
> --- a/sql/sql_class.h
> +++ b/sql/sql_class.h
> @@ -3168,6 +3168,7 @@ class THD: public THD_count, /* this must be first */
> ulonglong start_utime, utime_after_lock, utime_after_query;
> /* This can be used by handlers to send signals to the SQL level */
> ulonglong replication_flags;
> + my_time_t orig_exec_time; // Exec time for last read binlog event
> @@ -1348,8 +1346,14 @@ Query_log_event::Query_log_event(THD* thd_arg, const char* query_arg,
> memset(&host, 0, sizeof(host));
> error_code= errcode;
>
> - end_time= my_time(0);
> - exec_time = (ulong) (end_time - thd_arg->start_time);
> + /*
> + For slave threads, remember the original master exec time.
> + This is needed to be able to calculate the master commit time.
> + */
> + exec_time= ((thd->slave_thread) ?
> + thd->orig_exec_time :
> + (my_time(0) - thd_arg->start_time));
> +
I think we can here avoid increasing the THD size, which is sad for a
special case like this. Instead put the orig_exec_time into THD::rgi_slave,
extending the struct rpl_group_info.
Then the assignment above will be to rgi->orig_exec_time instead of
thd->orig_exec_time, and the check here will be for thd->rgi_slave instead
of thd->slave_thread. The rgi is the appropriate place for orig_exec_time,
as it is a property of the transaction/event group being replicated.
@@ -3171,146 +2856,170 @@ static const LEX_CSTRING msg_ignored= { STRING_WITH_LEN("Ignored") };
#endif
> -static bool send_show_master_info_data(THD *thd, Master_info *mi, bool full,
> - String *gtid_pos)
> +inline void store_string_or_null(Field **field, const char *str)
> +{
>
> - mysql_mutex_unlock(&mi->run_lock);
> +void store_master_info(THD *thd, Master_info *mi, TABLE *table,
>
> - protocol->store_string_or_null(mi->host, &my_charset_bin);
> - protocol->store_string_or_null(mi->user, &my_charset_bin);
> - protocol->store((uint32) mi->port);
> - protocol->store((uint32) mi->connect_retry);
> - protocol->store(mi->master_log_name, strlen(mi->master_log_name),
> + store_string_or_null(field++, mi->host);
> + store_string_or_null(field++, mi->user);
> + (*field++)->store((uint32) mi->port);
> + (*field++)->store((uint32) mi->connect_retry);
> + (*field++)->store(mi->master_log_name, strlen(mi->master_log_name),
So this patch refactors the code for SHOW SLAVE STATUS to use a shared code
path with information_schema tables? That sounds like a good change.
But for the future, please do such refactoring as a separate patch, you can
see my patch series of pre-computing binlog checksums (MDEV-31273) for an
example. Doing it together makes it very hard to review the actual code
changes and see what is refactoring and what is new logic for
master_slave_time_diff.
> + if (unlikely(!rli->slave_timestamp) && Log_event::is_group_event(typ))
> + {
> + /*
> + First event for this slave. Assume that all the slave was up to date
> + with the master just before the current event.
> + */
> + rli->slave_timestamp= (time_t) ev->when + (time_t) ev->exec_time-1;
I think the comment is meant to explain why there is the "-1" / minus-one in
this expression, but it's not really clear exactly what is meant, so perhaps
elaborate it a bit.
Does it mean that master_slave_time_diff will start out as "1 second" just
after the slave restarts?
One way or the other, I do not have a strong opinion. But I want to give a
general remark, from my experience with the troubles over the years with
Seconds_Behind_Master. The problem with Seconds_Behind_Master was always
that it didn't quite do what someone expected in some particular case, and
then a new special case was added to try to accomodate. And after a number
of iterations of this, we ended up with a semantics for
Seconds_Behind_Master that no-one _really_ understands, and which still
doesn't quite always do what people expect (which is in any case impossible
if different people have different expectations).
This "-1" here is clearly on the same road towards adding special cases. So
my strong advice would be that, going forward, be _extremely_ conservative
about modifying the semantics of master_slave_time_diff. The one good thing
about the new status is that it has a (hopefully) clear semantics that can
be documented (now including this "-1") and completely understood by users.
Let's stick to that no matter who comes and wants just another small tweak,
unless such tweaks actually make the semantics and documentation _simpler_.
> + /*
> + replay_log.description_event_for_exec can be null if the slave thread
> + is getting killed
> + */
> + if (LOG_EVENT_IS_QUERY((Log_event_type) buf[EVENT_TYPE_OFFSET]) ||
> + LOG_EVENT_IS_LOAD_DATA((Log_event_type) buf[EVENT_TYPE_OFFSET]))
> + {
> + time_t exec_time= query_event_get_time(buf, rli->relay_log.
> + description_event_for_queue);
> + set_if_bigger(rli->newest_master_timestamp, exec_time);
> + }
This confused me, because in Query_log_event, the exec_time is the duration
of the query (a time interval), not an absolute time.
So please use the name "end_time" for the variable instead of "exec_time",
and consider renaming query_event_get_time() to query_event_get_end_time().
>
> +static int make_slave_status_old_format(THD *thd, ST_SCHEMA_TABLE *schema_table)
> +{
> + ST_FIELD_INFO *field_info= schema_table->fields_info;
> + Name_resolution_context *context= &thd->lex->first_select_lex()->context;
> + DBUG_ASSERT(thd->lex->sql_command == SQLCOM_SHOW_SLAVE_STAT);
> + bool all_slaves= thd->lex->mi.show_all_slaves;
> + ulonglong used_fields= ~0ULL;
> +
> + if (!all_slaves)
> + {
> + /* Remove 2 first fields and all fields above and including field 55 */
> + used_fields&= ~((1ULL << 0) | (1ULL << 1));
> + used_fields&= ((1ULL << 56)-1);
Should there be some constant used for the magic numbers 55/56?
And/or an assertion that this value is correct?
Also, is the intention here that we will from here on never add more fields
to legacy SHOW SLAVE STATUS? If so, there should probably be a comment about
this; also see the test case
mysql-test/suite/rpl/t/rpl_show_slave_status.test which has comments that
might need updating.
- Kristian