Hi Monty, Here is my review of the MDEV-33856 patch:
commit cc516f4d9913d732065c7abef983df42e5a8bab3 (origin/bb-11.6-MDEV-33856-seconds_behind_master) Author: Monty <monty@mariadb.org> Date: Tue May 14 23:47:59 2024 +0300
MDEV-33856 New definition for Seconds_Behind_Master
If I understand the patch correctly, in introduces a new `Master_slave_time_diff` value as an alternative to the traditional Seconds_behind_master, with a clearer semantics. And does not change how Seconds_behind_master works. I think this is fine, and can help to preserve backwards compatibility. Just maybe mention this in the commit message, and/or change the bug title to "Alternative definition for Seconds_Behind_Master".
diff --git a/mysql-test/suite/perfschema/t/relaylog.test b/mysql-test/suite/perfschema/t/relaylog.test index 12fc96a8b27..16f72efae98 100644 --- a/mysql-test/suite/perfschema/t/relaylog.test +++ b/mysql-test/suite/perfschema/t/relaylog.test @@ -43,7 +43,7 @@ select if (sum_number_of_bytes_read > 0, "MANY", "NONE") as SUM_NUMBER_OF_BYTES_READ, if (sum_number_of_bytes_write > 0, "MANY", "NONE") as SUM_NUMBER_OF_BYTES_WRITE from performance_schema.file_summary_by_instance - where file_name like "%master-%" order by file_name; + where file_name like "%master-%" and file_name not like "%sql/share%" order by file_name;
select * from performance_schema.file_summary_by_instance where file_name like "%slave-%" order by file_name; @@ -96,7 +96,7 @@ sync_slave_with_master; -- echo "============ Performance schema on slave ============"
select * from performance_schema.file_summary_by_instance - where file_name like "%master-%" order by file_name; + where file_name like "%master-%" and file_name not like "%sql/share%" order by file_name;
select substring(file_name, locate("slave-", file_name)) as FILE_NAME,
What is the reason for these changes (adding match on "%sql/share%" ? I don't understand the relation to the subject of the patch.
diff --git a/mysql-test/suite/rpl/t/master_last_event_time.inc b/mysql-test/suite/rpl/t/master_last_event_time.inc new file mode 100644 index 00000000000..0790d557a7b --- /dev/null +++ b/mysql-test/suite/rpl/t/master_last_event_time.inc
+--connection server_3 +--sleep 1 +--source include/start_slave.inc +--source include/sync_with_master_gtid.inc
What is the purpose of --sleep 1 here? Usually it's wrong to have sleeps in the tests. Sometimes it is correct, but then there should be a comment explaining why a sleep is necessary (and sufficient) here. This sleep looks wrong, since there's a proper wait immediately after for the slave to sync up.
diff --git a/mysql-test/suite/rpl/t/master_last_event_time_stmt.test b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test new file mode 100644 index 00000000000..fe0f16a3db3 --- /dev/null +++ b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test @@ -0,0 +1,84 @@ +# +# Statement specific tests for master_last_event_time +# +--source include/have_binlog_format_statement.inc
+# Slave will replicate the above with a row event which will be very fast +# compared to the master event.
I don't understand this comment. Why will the slave replicate using a row event, when the test case sets binlog_format=statement?
diff --git a/sql/log.cc b/sql/log.cc index 1cc0fa77f08..63d9c756911 100644 --- a/sql/log.cc +++ b/sql/log.cc @@ -1898,10 +1898,20 @@ static inline int binlog_commit_flush_xid_caches(THD *thd, binlog_cache_mngr *cache_mngr, bool all, my_xid xid)
Xid_log_event end_evt(thd, xid, TRUE); - return (binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE, TRUE)); + if (!thd->slave_thread && ! thd->user_time.val) + { + /* + Ensure that on the master the event time is the time of commit, + not the start of statement time. + */ + my_hrtime_t hrtime= my_hrtime(); + end_evt.when= hrtime_to_my_time(hrtime); + } + DBUG_RETURN(binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE, TRUE));
Ok, so correct me if I'm wrong, but I think the purpose here is: You want to have the time when the transaction ended (in case of long-running query). But for auto-committed query, there is no explicit COMMIT query, so the thd->start_time is the start of the transaction. So here you ensure that the XID_EVENT will have the time at start of commit in either case. If this is correct, probably could use a comment explaining this. Some comments: 1. It is better to check for thd->rgi_slave being non-NULL than for thd->slave_thread. The thd->rgi_slave is associated with the replication of some event or transaction. The thd->slave_thread just identifies the thread/connection. (This is mostly style/code clarity, as in current code the two values would track each other.) 2. Why the condition on thd->user_time.val? What is the situation where we are not replicating an event, but user_time is set and we do not want to put the current time in the XID_EVENT? 3. This handles an XID end event, which is for InnoDB transactions. But MyISAM/Aria use a different end event, a COMMIT query, and there are are few other cases as well. Don't you want consistently to have the end of the query as the time in all end events? Then this code should probably instead be in binlog_flush_cache() or perhaps write_transaction_to_binlog()? 4. Even given (3), there are still cases (eg. DDL) where there is no end event, and thus there is no event containing the time for the end of the operation on the master. Won't this cause some inconsistency in how the slave determines and displays its timings with this patch? If not, why not?
for (; has_xid && !entry.need_unlog && ha_info; ha_info= ha_info->next()) { if (ha_info->is_started() && ha_info->ht() != binlog_hton && !ha_info->ht()->commit_checkpoint_request) + { entry.need_unlog= true; + break; + } }
Agree, this code always irritated me ;-) You could also move the condition (!entry.need_unlog && ha_info) out of the loop in an outer if () to not have to check it on every loop iteration (just an observation, entirely up to you, it's not something your patch introduced).
diff --git a/sql/log_event.cc b/sql/log_event.cc index 3802f76efe0..706a4903eb0 100644 --- a/sql/log_event.cc +++ b/sql/log_event.cc
@@ -1104,10 +1116,12 @@ Log_event* Log_event::read_log_event(const uchar *buf, uint event_len, switch(event_type) { case QUERY_EVENT: ev= new Query_log_event(buf, event_len, fdle, QUERY_EVENT); + set_orig_exec_time_in_thd(((Query_log_event*) ev)->exec_time); break; case QUERY_COMPRESSED_EVENT: ev= new Query_compressed_log_event(buf, event_len, fdle, - QUERY_COMPRESSED_EVENT); + QUERY_COMPRESSED_EVENT); + set_orig_exec_time_in_thd(((Query_compressed_log_event*) ev)->exec_time);
@@ -964,6 +964,18 @@ Log_event* Log_event::read_log_event(IO_CACHE* file, DBUG_RETURN(res); }
+/* + Update thd->orig_exec_time +*/ + +inline void set_orig_exec_time_in_thd(my_time_t exec_time) +{ +#if !defined(MYSQL_CLIENT) && defined(HAVE_REPLICATION) + THD *thd= current_thd; + if (likely(thd)) + thd->orig_exec_time= exec_time; +#endif +}
Using current_thd doesn't seem right here, and in fact it's not, as seen by this part of the patch where thd->orig_exec_time has to be set again for parallel replication:
diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 51ca3b183ec..d279761767a 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc @@ -60,7 +60,10 @@ rpt_handle_event(rpl_parallel_thread::queued_event *qev, rgi->future_event_relay_log_pos= qev->future_event_relay_log_pos; strcpy(rgi->future_event_master_log_name, qev->future_event_master_log_name); if (event_can_update_last_master_timestamp(ev)) + { rgi->last_master_timestamp= ev->when + ev->exec_time; + thd->orig_exec_time= ev->exec_time; + }
So let's not set thd->orig_exec_time from Log_event::read_log_event(), it's redundant (for parallel replication), and logically setting it should be done when applying the event, not when reading the event. Instead, set the thd->orig_exec_time in exec_relay_log_event(), so we retain similarity with how replication is done in the parallel and non-parallel case. There is already a similar if (event_can_update_last_master_timestamp(ev)) in exec_relay_log_event(). (Also see comments below about moving this from THD to rpl_group_info to not increase THD just for this very specific case).
+ +/* + Get the time when the event had been executed on the master. + This works for both query events and load data events. +*/ + +#if Q_EXEC_TIME_OFFSET != L_EXEC_TIME_OFFSET +#error "Q_EXEC_TIME_OFFSET is not same as L_EXEC_TIME_OFFSET" +#endif + +time_t query_event_get_time(const uchar *buf, + const Format_description_log_event + *description_event) +{ + time_t when= uint4korr(buf); + buf+= description_event->common_header_len; + return when + uint4korr(buf + Q_EXEC_TIME_OFFSET);
Suggestion: Add an assert here: DBUG_ASSERT(LOG_EVENT_IS_QUERY(buf[EVENT_TYPE_OFFSET]) || LOG_EVENT_IS_LOAD_DATA(buf[EVENT_TYPE_OFFSET])); just to avoid someone mistakenly adding code later to call this for a wrong event type.
diff --git a/sql/sql_class.h b/sql/sql_class.h index 2d671139a6c..05c9c4e8012 100644 --- a/sql/sql_class.h +++ b/sql/sql_class.h @@ -3168,6 +3168,7 @@ class THD: public THD_count, /* this must be first */ ulonglong start_utime, utime_after_lock, utime_after_query; /* This can be used by handlers to send signals to the SQL level */ ulonglong replication_flags; + my_time_t orig_exec_time; // Exec time for last read binlog event
@@ -1348,8 +1346,14 @@ Query_log_event::Query_log_event(THD* thd_arg, const char* query_arg, memset(&host, 0, sizeof(host)); error_code= errcode;
- end_time= my_time(0); - exec_time = (ulong) (end_time - thd_arg->start_time); + /* + For slave threads, remember the original master exec time. + This is needed to be able to calculate the master commit time. + */ + exec_time= ((thd->slave_thread) ? + thd->orig_exec_time : + (my_time(0) - thd_arg->start_time)); +
I think we can here avoid increasing the THD size, which is sad for a special case like this. Instead put the orig_exec_time into THD::rgi_slave, extending the struct rpl_group_info. Then the assignment above will be to rgi->orig_exec_time instead of thd->orig_exec_time, and the check here will be for thd->rgi_slave instead of thd->slave_thread. The rgi is the appropriate place for orig_exec_time, as it is a property of the transaction/event group being replicated. @@ -3171,146 +2856,170 @@ static const LEX_CSTRING msg_ignored= { STRING_WITH_LEN("Ignored") }; #endif
-static bool send_show_master_info_data(THD *thd, Master_info *mi, bool full, - String *gtid_pos) +inline void store_string_or_null(Field **field, const char *str) +{
- mysql_mutex_unlock(&mi->run_lock); +void store_master_info(THD *thd, Master_info *mi, TABLE *table,
- protocol->store_string_or_null(mi->host, &my_charset_bin); - protocol->store_string_or_null(mi->user, &my_charset_bin); - protocol->store((uint32) mi->port); - protocol->store((uint32) mi->connect_retry); - protocol->store(mi->master_log_name, strlen(mi->master_log_name), + store_string_or_null(field++, mi->host); + store_string_or_null(field++, mi->user); + (*field++)->store((uint32) mi->port); + (*field++)->store((uint32) mi->connect_retry); + (*field++)->store(mi->master_log_name, strlen(mi->master_log_name),
So this patch refactors the code for SHOW SLAVE STATUS to use a shared code path with information_schema tables? That sounds like a good change. But for the future, please do such refactoring as a separate patch, you can see my patch series of pre-computing binlog checksums (MDEV-31273) for an example. Doing it together makes it very hard to review the actual code changes and see what is refactoring and what is new logic for master_slave_time_diff.
+ if (unlikely(!rli->slave_timestamp) && Log_event::is_group_event(typ)) + { + /* + First event for this slave. Assume that all the slave was up to date + with the master just before the current event. + */ + rli->slave_timestamp= (time_t) ev->when + (time_t) ev->exec_time-1;
I think the comment is meant to explain why there is the "-1" / minus-one in this expression, but it's not really clear exactly what is meant, so perhaps elaborate it a bit. Does it mean that master_slave_time_diff will start out as "1 second" just after the slave restarts? One way or the other, I do not have a strong opinion. But I want to give a general remark, from my experience with the troubles over the years with Seconds_Behind_Master. The problem with Seconds_Behind_Master was always that it didn't quite do what someone expected in some particular case, and then a new special case was added to try to accomodate. And after a number of iterations of this, we ended up with a semantics for Seconds_Behind_Master that no-one _really_ understands, and which still doesn't quite always do what people expect (which is in any case impossible if different people have different expectations). This "-1" here is clearly on the same road towards adding special cases. So my strong advice would be that, going forward, be _extremely_ conservative about modifying the semantics of master_slave_time_diff. The one good thing about the new status is that it has a (hopefully) clear semantics that can be documented (now including this "-1") and completely understood by users. Let's stick to that no matter who comes and wants just another small tweak, unless such tweaks actually make the semantics and documentation _simpler_.
+ /* + replay_log.description_event_for_exec can be null if the slave thread + is getting killed + */ + if (LOG_EVENT_IS_QUERY((Log_event_type) buf[EVENT_TYPE_OFFSET]) || + LOG_EVENT_IS_LOAD_DATA((Log_event_type) buf[EVENT_TYPE_OFFSET])) + { + time_t exec_time= query_event_get_time(buf, rli->relay_log. + description_event_for_queue); + set_if_bigger(rli->newest_master_timestamp, exec_time); + }
This confused me, because in Query_log_event, the exec_time is the duration of the query (a time interval), not an absolute time. So please use the name "end_time" for the variable instead of "exec_time", and consider renaming query_event_get_time() to query_event_get_end_time().
+static int make_slave_status_old_format(THD *thd, ST_SCHEMA_TABLE *schema_table) +{ + ST_FIELD_INFO *field_info= schema_table->fields_info; + Name_resolution_context *context= &thd->lex->first_select_lex()->context; + DBUG_ASSERT(thd->lex->sql_command == SQLCOM_SHOW_SLAVE_STAT); + bool all_slaves= thd->lex->mi.show_all_slaves; + ulonglong used_fields= ~0ULL; + + if (!all_slaves) + { + /* Remove 2 first fields and all fields above and including field 55 */ + used_fields&= ~((1ULL << 0) | (1ULL << 1)); + used_fields&= ((1ULL << 56)-1);
Should there be some constant used for the magic numbers 55/56? And/or an assertion that this value is correct? Also, is the intention here that we will from here on never add more fields to legacy SHOW SLAVE STATUS? If so, there should probably be a comment about this; also see the test case mysql-test/suite/rpl/t/rpl_show_slave_status.test which has comments that might need updating. - Kristian