Hi Kristian,

Thanks for the review! I'll be taking over for Monty to address/incorporate your comments, see my responses below. 

On Mon, Jul 22, 2024 at 8:02 AM Kristian Nielsen via developers <developers@lists.mariadb.org> wrote:
Hi Monty,

Here is my review of the MDEV-33856 patch:

> commit cc516f4d9913d732065c7abef983df42e5a8bab3 (origin/bb-11.6-MDEV-33856-seconds_behind_master)
> Author: Monty <monty@mariadb.org>
> Date:   Tue May 14 23:47:59 2024 +0300
>
>     MDEV-33856 New definition for Seconds_Behind_Master

If I understand the patch correctly, in introduces a new
`Master_slave_time_diff` value as an alternative to the traditional
Seconds_behind_master, with a clearer semantics. And does not change how
Seconds_behind_master works.

I think this is fine, and can help to preserve backwards compatibility. Just
maybe mention this in the commit message, and/or change the bug title to
"Alternative definition for Seconds_Behind_Master".


Sure, will do.
 

> diff --git a/mysql-test/suite/perfschema/t/relaylog.test b/mysql-test/suite/perfschema/t/relaylog.test
> index 12fc96a8b27..16f72efae98 100644
> --- a/mysql-test/suite/perfschema/t/relaylog.test
> +++ b/mysql-test/suite/perfschema/t/relaylog.test
> @@ -43,7 +43,7 @@ select
>    if (sum_number_of_bytes_read > 0, "MANY", "NONE") as SUM_NUMBER_OF_BYTES_READ,
>    if (sum_number_of_bytes_write > 0, "MANY", "NONE") as SUM_NUMBER_OF_BYTES_WRITE
>    from performance_schema.file_summary_by_instance
> -  where file_name like "%master-%" order by file_name;
> +  where file_name like "%master-%" and file_name not like "%sql/share%" order by file_name;

>  select * from performance_schema.file_summary_by_instance
>    where file_name like "%slave-%" order by file_name;
> @@ -96,7 +96,7 @@ sync_slave_with_master;
>  -- echo "============ Performance schema on slave ============"

>  select * from performance_schema.file_summary_by_instance
> -  where file_name like "%master-%" order by file_name;
> +  where file_name like "%master-%" and file_name not like "%sql/share%" order by file_name;

>  select
>    substring(file_name, locate("slave-", file_name)) as FILE_NAME,

What is the reason for these changes (adding match on "%sql/share%" ?
I don't understand the relation to the subject of the patch.


I'm not immediately sure about this one. I'll investigate, and remove if it is proven
unnecessary.
 

> diff --git a/mysql-test/suite/rpl/t/master_last_event_time.inc b/mysql-test/suite/rpl/t/master_last_event_time.inc
> new file mode 100644
> index 00000000000..0790d557a7b
> --- /dev/null
> +++ b/mysql-test/suite/rpl/t/master_last_event_time.inc
>
> +--connection server_3
> +--sleep 1
> +--source include/start_slave.inc
> +--source include/sync_with_master_gtid.inc

What is the purpose of --sleep 1 here?

Usually it's wrong to have sleeps in the tests. Sometimes it is correct, but
then there should be a comment explaining why a sleep is necessary (and
sufficient) here. This sleep looks wrong, since there's a proper wait
immediately after for the slave to sync up.


Indeed this is probably left-over from something prior, I'll remove it. 
 

> diff --git a/mysql-test/suite/rpl/t/master_last_event_time_stmt.test b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test
> new file mode 100644
> index 00000000000..fe0f16a3db3
> --- /dev/null
> +++ b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test
> @@ -0,0 +1,84 @@
> +#
> +# Statement specific tests for master_last_event_time
> +#
> +--source include/have_binlog_format_statement.inc
>
> +# Slave will replicate the above with a row event which will be very fast
> +# compared to the master event.

I don't understand this comment. Why will the slave replicate using a row
event, when the test case sets binlog_format=statement?


This whole check should go in the *_row.test version. I'll move it.
 

> diff --git a/sql/log.cc b/sql/log.cc
> index 1cc0fa77f08..63d9c756911 100644
> --- a/sql/log.cc
> +++ b/sql/log.cc
> @@ -1898,10 +1898,20 @@ static inline int
>  binlog_commit_flush_xid_caches(THD *thd, binlog_cache_mngr *cache_mngr,
>                                 bool all, my_xid xid)

>    Xid_log_event end_evt(thd, xid, TRUE);
> -  return (binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE, TRUE));
> +  if (!thd->slave_thread && ! thd->user_time.val)
> +  {
> +    /*
> +      Ensure that on the master the event time is the time of commit,
> +      not the start of statement time.
> +    */
> +    my_hrtime_t hrtime= my_hrtime();
> +    end_evt.when= hrtime_to_my_time(hrtime);
> +  }
> +  DBUG_RETURN(binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE, TRUE));

Ok, so correct me if I'm wrong, but I think the purpose here is: You want to
have the time when the transaction ended (in case of long-running query).
But for auto-committed query, there is no explicit COMMIT query, so the
thd->start_time is the start of the transaction. So here you ensure that the
XID_EVENT will have the time at start of commit in either case. If this is
correct, probably could use a comment explaining this.

Right, will do. 
 

Some comments:

1. It is better to check for thd->rgi_slave being non-NULL than for
thd->slave_thread. The thd->rgi_slave is associated with the replication of
some event or transaction. The thd->slave_thread just identifies the
thread/connection. (This is mostly style/code clarity, as in current code
the two values would track each other.)

Sure. 
 

2. Why the condition on thd->user_time.val? What is the situation where we
are not replicating an event, but user_time is set and we do not want to put
the current time in the XID_EVENT?
 
When SET TIMESTAMP is set, the XID_EVENT should use the user value,
rather than the current time (as all other events of the transaction will use the
SET TIMESTAMP value).
 

3. This handles an XID end event, which is for InnoDB transactions. But
MyISAM/Aria use a different end event, a COMMIT query, and there are are few
other cases as well. Don't you want consistently to have the end of the
query as the time in all end events? Then this code should probably instead
be in binlog_flush_cache() or perhaps write_transaction_to_binlog()?

Hmm, but I think that would break your point of (4) where there are 
Query_log_events without end events. So if we tag their `when` to be at
binlog time, then when we calculate the end of the Query_log_event on
the slave, it would be when + exec_time, which could put us significantly
past the real end time of the event. So I think the XID and Query_log events
should be treated differently.
 

4. Even given (3), there are still cases (eg. DDL) where there is no end
event, and thus there is no event containing the time for the end of the
operation on the master. Won't this cause some inconsistency in how the
slave determines and displays its timings with this patch? If not, why not?

And to elaborate on the response of (3) for consistency, the
Query_log_event's `when` uses thd->start_time, even for the end event. 
So the end Query_log_event for say a MyISAM/Aria transaction will be
consistent with the single Query_log_event of an event without an explicit
ending transaction.
 


>    for (; has_xid && !entry.need_unlog && ha_info; ha_info= ha_info->next())
>    {
>      if (ha_info->is_started() && ha_info->ht() != binlog_hton &&
>          !ha_info->ht()->commit_checkpoint_request)
> +    {
>        entry.need_unlog= true;
> +      break;
> +    }
>    }

Agree, this code always irritated me ;-)

You could also move the condition (!entry.need_unlog && ha_info) out of the
loop in an outer if () to not have to check it on every loop iteration (just
an observation, entirely up to you, it's not something your patch
introduced).


> diff --git a/sql/log_event.cc b/sql/log_event.cc
> index 3802f76efe0..706a4903eb0 100644
> --- a/sql/log_event.cc
> +++ b/sql/log_event.cc

> @@ -1104,10 +1116,12 @@ Log_event* Log_event::read_log_event(const uchar *buf, uint event_len,
>      switch(event_type) {
>      case QUERY_EVENT:
>        ev= new Query_log_event(buf, event_len, fdle, QUERY_EVENT);
> +      set_orig_exec_time_in_thd(((Query_log_event*) ev)->exec_time);
>        break;
>      case QUERY_COMPRESSED_EVENT:
>        ev= new Query_compressed_log_event(buf, event_len, fdle,
> -                                          QUERY_COMPRESSED_EVENT);
> +                                         QUERY_COMPRESSED_EVENT);
> +      set_orig_exec_time_in_thd(((Query_compressed_log_event*) ev)->exec_time);

> @@ -964,6 +964,18 @@ Log_event* Log_event::read_log_event(IO_CACHE* file,
>    DBUG_RETURN(res);
>  }

> +/*
> +  Update thd->orig_exec_time
> +*/
> +
> +inline void set_orig_exec_time_in_thd(my_time_t exec_time)
> +{
> +#if !defined(MYSQL_CLIENT) && defined(HAVE_REPLICATION)
> +  THD *thd= current_thd;
> +  if (likely(thd))
> +    thd->orig_exec_time= exec_time;
> +#endif
> +}

Using current_thd doesn't seem right here, and in fact it's not, as seen by
this part of the patch where thd->orig_exec_time has to be set again for
parallel replication:

> diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
> index 51ca3b183ec..d279761767a 100644
> --- a/sql/rpl_parallel.cc
> +++ b/sql/rpl_parallel.cc
> @@ -60,7 +60,10 @@ rpt_handle_event(rpl_parallel_thread::queued_event *qev,
>    rgi->future_event_relay_log_pos= qev->future_event_relay_log_pos;
>    strcpy(rgi->future_event_master_log_name, qev->future_event_master_log_name);
>    if (event_can_update_last_master_timestamp(ev))
> +  {
>      rgi->last_master_timestamp= ev->when + ev->exec_time;
> +    thd->orig_exec_time= ev->exec_time;
> +  }

So let's not set thd->orig_exec_time from Log_event::read_log_event(), it's
redundant (for parallel replication), and logically setting it should be
done when applying the event, not when reading the event.

Instead, set the thd->orig_exec_time in exec_relay_log_event(), so we retain
similarity with how replication is done in the parallel and non-parallel
case. There is already a similar if (event_can_update_last_master_timestamp(ev))
in exec_relay_log_event().

(Also see comments below about moving this from THD to rpl_group_info to not
increase THD just for this very specific case).

Sure, makes sense (to also use RGI).
 


> +
> +/*
> +  Get the time when the event had been executed on the master.
> +  This works for both query events and load data events.
> +*/
> +
> +#if Q_EXEC_TIME_OFFSET != L_EXEC_TIME_OFFSET
> +#error "Q_EXEC_TIME_OFFSET is not same as L_EXEC_TIME_OFFSET"
> +#endif
> +
> +time_t query_event_get_time(const uchar *buf,
> +                            const Format_description_log_event
> +                            *description_event)
> +{
> +  time_t when= uint4korr(buf);
> +  buf+= description_event->common_header_len;
> +  return when + uint4korr(buf + Q_EXEC_TIME_OFFSET);

Suggestion: Add an assert here:

  DBUG_ASSERT(LOG_EVENT_IS_QUERY(buf[EVENT_TYPE_OFFSET]) ||
              LOG_EVENT_IS_LOAD_DATA(buf[EVENT_TYPE_OFFSET]));

just to avoid someone mistakenly adding code later to call this for a wrong
event type.

Sure, good idea.
 


> diff --git a/sql/sql_class.h b/sql/sql_class.h
> index 2d671139a6c..05c9c4e8012 100644
> --- a/sql/sql_class.h
> +++ b/sql/sql_class.h
> @@ -3168,6 +3168,7 @@ class THD: public THD_count, /* this must be first */
>    ulonglong  start_utime, utime_after_lock, utime_after_query;
>    /* This can be used by handlers to send signals to the SQL level */
>    ulonglong  replication_flags;
> +  my_time_t orig_exec_time;         // Exec time for last read binlog event

> @@ -1348,8 +1346,14 @@ Query_log_event::Query_log_event(THD* thd_arg, const char* query_arg,
>    memset(&host, 0, sizeof(host));
>    error_code= errcode;

> -  end_time= my_time(0);
> -  exec_time = (ulong) (end_time  - thd_arg->start_time);
> +  /*
> +    For slave threads, remember the original master exec time.
> +    This is needed to be able to calculate the master commit time.
> +  */
> +  exec_time= ((thd->slave_thread) ?
> +              thd->orig_exec_time :
> +              (my_time(0) - thd_arg->start_time));
> +

I think we can here avoid increasing the THD size, which is sad for a
special case like this. Instead put the orig_exec_time into THD::rgi_slave,
extending the struct rpl_group_info.

Then the assignment above will be to rgi->orig_exec_time instead of
thd->orig_exec_time, and the check here will be for thd->rgi_slave instead
of thd->slave_thread. The rgi is the appropriate place for orig_exec_time,
as it is a property of the transaction/event group being replicated.


@@ -3171,146 +2856,170 @@ static const LEX_CSTRING msg_ignored=  { STRING_WITH_LEN("Ignored") };
 #endif


> -static bool send_show_master_info_data(THD *thd, Master_info *mi, bool full,
> -                                       String *gtid_pos)
> +inline void store_string_or_null(Field **field, const char *str)
> +{
>
> -    mysql_mutex_unlock(&mi->run_lock);
> +void store_master_info(THD *thd, Master_info *mi, TABLE *table,
>
> -    protocol->store_string_or_null(mi->host, &my_charset_bin);
> -    protocol->store_string_or_null(mi->user, &my_charset_bin);
> -    protocol->store((uint32) mi->port);
> -    protocol->store((uint32) mi->connect_retry);
> -    protocol->store(mi->master_log_name, strlen(mi->master_log_name),
> +  store_string_or_null(field++, mi->host);
> +  store_string_or_null(field++, mi->user);
> +  (*field++)->store((uint32) mi->port);
> +  (*field++)->store((uint32) mi->connect_retry);
> +  (*field++)->store(mi->master_log_name, strlen(mi->master_log_name),

So this patch refactors the code for SHOW SLAVE STATUS to use a shared code
path with information_schema tables? That sounds like a good change.
But for the future, please do such refactoring as a separate patch, you can
see my patch series of pre-computing binlog checksums (MDEV-31273) for an
example. Doing it together makes it very hard to review the actual code
changes and see what is refactoring and what is new logic for
master_slave_time_diff.


> +      if (unlikely(!rli->slave_timestamp) && Log_event::is_group_event(typ))
> +      {
> +        /*
> +          First event for this slave. Assume that all the slave was up to date
> +          with the master just before the current event.
> +        */
> +        rli->slave_timestamp= (time_t) ev->when + (time_t) ev->exec_time-1;

I think the comment is meant to explain why there is the "-1" / minus-one in
this expression, but it's not really clear exactly what is meant, so perhaps
elaborate it a bit.

Does it mean that master_slave_time_diff will start out as "1 second" just
after the slave restarts?

One way or the other, I do not have a strong opinion. But I want to give a
general remark, from my experience with the troubles over the years with
Seconds_Behind_Master. The problem with Seconds_Behind_Master was always
that it didn't quite do what someone expected in some particular case, and
then a new special case was added to try to accomodate. And after a number
of iterations of this, we ended up with a semantics for
Seconds_Behind_Master that no-one _really_ understands, and which still
doesn't quite always do what people expect (which is in any case impossible
if different people have different expectations).

This "-1" here is clearly on the same road towards adding special cases. So
my strong advice would be that, going forward, be _extremely_ conservative
about modifying the semantics of master_slave_time_diff. The one good thing
about the new status is that it has a (hopefully) clear semantics that can
be documented (now including this "-1") and completely understood by users.
Let's stick to that no matter who comes and wants just another small tweak,
unless such tweaks actually make the semantics and documentation _simpler_.


> +    /*
> +      replay_log.description_event_for_exec can be null if the slave thread
> +      is getting killed
> +    */
> +    if (LOG_EVENT_IS_QUERY((Log_event_type) buf[EVENT_TYPE_OFFSET]) ||
> +        LOG_EVENT_IS_LOAD_DATA((Log_event_type) buf[EVENT_TYPE_OFFSET]))
> +    {
> +      time_t exec_time= query_event_get_time(buf, rli->relay_log.
> +                                             description_event_for_queue);
> +      set_if_bigger(rli->newest_master_timestamp, exec_time);
> +    }

This confused me, because in Query_log_event, the exec_time is the duration
of the query (a time interval), not an absolute time.

So please use the name "end_time" for the variable instead of "exec_time",
and consider renaming query_event_get_time() to query_event_get_end_time().

Right, makes sense.
 

>
> +static int make_slave_status_old_format(THD *thd, ST_SCHEMA_TABLE *schema_table)
> +{
> +  ST_FIELD_INFO *field_info= schema_table->fields_info;
> +  Name_resolution_context *context= &thd->lex->first_select_lex()->context;
> +  DBUG_ASSERT(thd->lex->sql_command == SQLCOM_SHOW_SLAVE_STAT);
> +  bool all_slaves= thd->lex->mi.show_all_slaves;
> +  ulonglong used_fields= ~0ULL;
> +
> +  if (!all_slaves)
> +  {
> +    /* Remove 2 first fields and all fields above and including field 55 */
> +    used_fields&= ~((1ULL << 0) | (1ULL << 1));
> +    used_fields&= ((1ULL << 56)-1);

Should there be some constant used for the magic numbers 55/56?
And/or an assertion that this value is correct?

That'd make sense, sure.
 

Also, is the intention here that we will from here on never add more fields
to legacy SHOW SLAVE STATUS? If so, there should probably be a comment about
this; also see the test case
mysql-test/suite/rpl/t/rpl_show_slave_status.test which has comments that
might need updating.

Not that I know of, why wouldn't we want to keep them in-sync? 
 

 - Kristian
_______________________________________________
developers mailing list -- developers@lists.mariadb.org
To unsubscribe send an email to developers-leave@lists.mariadb.org