Re: [MariaDB commits] [PATCH] MDEV-33856 New definition for Seconds_Behind_Master
Hi Monty, Here is my review of the MDEV-33856 patch:
commit cc516f4d9913d732065c7abef983df42e5a8bab3 (origin/bb-11.6-MDEV-33856-seconds_behind_master) Author: Monty <monty@mariadb.org> Date: Tue May 14 23:47:59 2024 +0300
MDEV-33856 New definition for Seconds_Behind_Master
If I understand the patch correctly, in introduces a new `Master_slave_time_diff` value as an alternative to the traditional Seconds_behind_master, with a clearer semantics. And does not change how Seconds_behind_master works. I think this is fine, and can help to preserve backwards compatibility. Just maybe mention this in the commit message, and/or change the bug title to "Alternative definition for Seconds_Behind_Master".
diff --git a/mysql-test/suite/perfschema/t/relaylog.test b/mysql-test/suite/perfschema/t/relaylog.test index 12fc96a8b27..16f72efae98 100644 --- a/mysql-test/suite/perfschema/t/relaylog.test +++ b/mysql-test/suite/perfschema/t/relaylog.test @@ -43,7 +43,7 @@ select if (sum_number_of_bytes_read > 0, "MANY", "NONE") as SUM_NUMBER_OF_BYTES_READ, if (sum_number_of_bytes_write > 0, "MANY", "NONE") as SUM_NUMBER_OF_BYTES_WRITE from performance_schema.file_summary_by_instance - where file_name like "%master-%" order by file_name; + where file_name like "%master-%" and file_name not like "%sql/share%" order by file_name;
select * from performance_schema.file_summary_by_instance where file_name like "%slave-%" order by file_name; @@ -96,7 +96,7 @@ sync_slave_with_master; -- echo "============ Performance schema on slave ============"
select * from performance_schema.file_summary_by_instance - where file_name like "%master-%" order by file_name; + where file_name like "%master-%" and file_name not like "%sql/share%" order by file_name;
select substring(file_name, locate("slave-", file_name)) as FILE_NAME,
What is the reason for these changes (adding match on "%sql/share%" ? I don't understand the relation to the subject of the patch.
diff --git a/mysql-test/suite/rpl/t/master_last_event_time.inc b/mysql-test/suite/rpl/t/master_last_event_time.inc new file mode 100644 index 00000000000..0790d557a7b --- /dev/null +++ b/mysql-test/suite/rpl/t/master_last_event_time.inc
+--connection server_3 +--sleep 1 +--source include/start_slave.inc +--source include/sync_with_master_gtid.inc
What is the purpose of --sleep 1 here? Usually it's wrong to have sleeps in the tests. Sometimes it is correct, but then there should be a comment explaining why a sleep is necessary (and sufficient) here. This sleep looks wrong, since there's a proper wait immediately after for the slave to sync up.
diff --git a/mysql-test/suite/rpl/t/master_last_event_time_stmt.test b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test new file mode 100644 index 00000000000..fe0f16a3db3 --- /dev/null +++ b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test @@ -0,0 +1,84 @@ +# +# Statement specific tests for master_last_event_time +# +--source include/have_binlog_format_statement.inc
+# Slave will replicate the above with a row event which will be very fast +# compared to the master event.
I don't understand this comment. Why will the slave replicate using a row event, when the test case sets binlog_format=statement?
diff --git a/sql/log.cc b/sql/log.cc index 1cc0fa77f08..63d9c756911 100644 --- a/sql/log.cc +++ b/sql/log.cc @@ -1898,10 +1898,20 @@ static inline int binlog_commit_flush_xid_caches(THD *thd, binlog_cache_mngr *cache_mngr, bool all, my_xid xid)
Xid_log_event end_evt(thd, xid, TRUE); - return (binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE, TRUE)); + if (!thd->slave_thread && ! thd->user_time.val) + { + /* + Ensure that on the master the event time is the time of commit, + not the start of statement time. + */ + my_hrtime_t hrtime= my_hrtime(); + end_evt.when= hrtime_to_my_time(hrtime); + } + DBUG_RETURN(binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE, TRUE));
Ok, so correct me if I'm wrong, but I think the purpose here is: You want to have the time when the transaction ended (in case of long-running query). But for auto-committed query, there is no explicit COMMIT query, so the thd->start_time is the start of the transaction. So here you ensure that the XID_EVENT will have the time at start of commit in either case. If this is correct, probably could use a comment explaining this. Some comments: 1. It is better to check for thd->rgi_slave being non-NULL than for thd->slave_thread. The thd->rgi_slave is associated with the replication of some event or transaction. The thd->slave_thread just identifies the thread/connection. (This is mostly style/code clarity, as in current code the two values would track each other.) 2. Why the condition on thd->user_time.val? What is the situation where we are not replicating an event, but user_time is set and we do not want to put the current time in the XID_EVENT? 3. This handles an XID end event, which is for InnoDB transactions. But MyISAM/Aria use a different end event, a COMMIT query, and there are are few other cases as well. Don't you want consistently to have the end of the query as the time in all end events? Then this code should probably instead be in binlog_flush_cache() or perhaps write_transaction_to_binlog()? 4. Even given (3), there are still cases (eg. DDL) where there is no end event, and thus there is no event containing the time for the end of the operation on the master. Won't this cause some inconsistency in how the slave determines and displays its timings with this patch? If not, why not?
for (; has_xid && !entry.need_unlog && ha_info; ha_info= ha_info->next()) { if (ha_info->is_started() && ha_info->ht() != binlog_hton && !ha_info->ht()->commit_checkpoint_request) + { entry.need_unlog= true; + break; + } }
Agree, this code always irritated me ;-) You could also move the condition (!entry.need_unlog && ha_info) out of the loop in an outer if () to not have to check it on every loop iteration (just an observation, entirely up to you, it's not something your patch introduced).
diff --git a/sql/log_event.cc b/sql/log_event.cc index 3802f76efe0..706a4903eb0 100644 --- a/sql/log_event.cc +++ b/sql/log_event.cc
@@ -1104,10 +1116,12 @@ Log_event* Log_event::read_log_event(const uchar *buf, uint event_len, switch(event_type) { case QUERY_EVENT: ev= new Query_log_event(buf, event_len, fdle, QUERY_EVENT); + set_orig_exec_time_in_thd(((Query_log_event*) ev)->exec_time); break; case QUERY_COMPRESSED_EVENT: ev= new Query_compressed_log_event(buf, event_len, fdle, - QUERY_COMPRESSED_EVENT); + QUERY_COMPRESSED_EVENT); + set_orig_exec_time_in_thd(((Query_compressed_log_event*) ev)->exec_time);
@@ -964,6 +964,18 @@ Log_event* Log_event::read_log_event(IO_CACHE* file, DBUG_RETURN(res); }
+/* + Update thd->orig_exec_time +*/ + +inline void set_orig_exec_time_in_thd(my_time_t exec_time) +{ +#if !defined(MYSQL_CLIENT) && defined(HAVE_REPLICATION) + THD *thd= current_thd; + if (likely(thd)) + thd->orig_exec_time= exec_time; +#endif +}
Using current_thd doesn't seem right here, and in fact it's not, as seen by this part of the patch where thd->orig_exec_time has to be set again for parallel replication:
diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 51ca3b183ec..d279761767a 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc @@ -60,7 +60,10 @@ rpt_handle_event(rpl_parallel_thread::queued_event *qev, rgi->future_event_relay_log_pos= qev->future_event_relay_log_pos; strcpy(rgi->future_event_master_log_name, qev->future_event_master_log_name); if (event_can_update_last_master_timestamp(ev)) + { rgi->last_master_timestamp= ev->when + ev->exec_time; + thd->orig_exec_time= ev->exec_time; + }
So let's not set thd->orig_exec_time from Log_event::read_log_event(), it's redundant (for parallel replication), and logically setting it should be done when applying the event, not when reading the event. Instead, set the thd->orig_exec_time in exec_relay_log_event(), so we retain similarity with how replication is done in the parallel and non-parallel case. There is already a similar if (event_can_update_last_master_timestamp(ev)) in exec_relay_log_event(). (Also see comments below about moving this from THD to rpl_group_info to not increase THD just for this very specific case).
+ +/* + Get the time when the event had been executed on the master. + This works for both query events and load data events. +*/ + +#if Q_EXEC_TIME_OFFSET != L_EXEC_TIME_OFFSET +#error "Q_EXEC_TIME_OFFSET is not same as L_EXEC_TIME_OFFSET" +#endif + +time_t query_event_get_time(const uchar *buf, + const Format_description_log_event + *description_event) +{ + time_t when= uint4korr(buf); + buf+= description_event->common_header_len; + return when + uint4korr(buf + Q_EXEC_TIME_OFFSET);
Suggestion: Add an assert here: DBUG_ASSERT(LOG_EVENT_IS_QUERY(buf[EVENT_TYPE_OFFSET]) || LOG_EVENT_IS_LOAD_DATA(buf[EVENT_TYPE_OFFSET])); just to avoid someone mistakenly adding code later to call this for a wrong event type.
diff --git a/sql/sql_class.h b/sql/sql_class.h index 2d671139a6c..05c9c4e8012 100644 --- a/sql/sql_class.h +++ b/sql/sql_class.h @@ -3168,6 +3168,7 @@ class THD: public THD_count, /* this must be first */ ulonglong start_utime, utime_after_lock, utime_after_query; /* This can be used by handlers to send signals to the SQL level */ ulonglong replication_flags; + my_time_t orig_exec_time; // Exec time for last read binlog event
@@ -1348,8 +1346,14 @@ Query_log_event::Query_log_event(THD* thd_arg, const char* query_arg, memset(&host, 0, sizeof(host)); error_code= errcode;
- end_time= my_time(0); - exec_time = (ulong) (end_time - thd_arg->start_time); + /* + For slave threads, remember the original master exec time. + This is needed to be able to calculate the master commit time. + */ + exec_time= ((thd->slave_thread) ? + thd->orig_exec_time : + (my_time(0) - thd_arg->start_time)); +
I think we can here avoid increasing the THD size, which is sad for a special case like this. Instead put the orig_exec_time into THD::rgi_slave, extending the struct rpl_group_info. Then the assignment above will be to rgi->orig_exec_time instead of thd->orig_exec_time, and the check here will be for thd->rgi_slave instead of thd->slave_thread. The rgi is the appropriate place for orig_exec_time, as it is a property of the transaction/event group being replicated. @@ -3171,146 +2856,170 @@ static const LEX_CSTRING msg_ignored= { STRING_WITH_LEN("Ignored") }; #endif
-static bool send_show_master_info_data(THD *thd, Master_info *mi, bool full, - String *gtid_pos) +inline void store_string_or_null(Field **field, const char *str) +{
- mysql_mutex_unlock(&mi->run_lock); +void store_master_info(THD *thd, Master_info *mi, TABLE *table,
- protocol->store_string_or_null(mi->host, &my_charset_bin); - protocol->store_string_or_null(mi->user, &my_charset_bin); - protocol->store((uint32) mi->port); - protocol->store((uint32) mi->connect_retry); - protocol->store(mi->master_log_name, strlen(mi->master_log_name), + store_string_or_null(field++, mi->host); + store_string_or_null(field++, mi->user); + (*field++)->store((uint32) mi->port); + (*field++)->store((uint32) mi->connect_retry); + (*field++)->store(mi->master_log_name, strlen(mi->master_log_name),
So this patch refactors the code for SHOW SLAVE STATUS to use a shared code path with information_schema tables? That sounds like a good change. But for the future, please do such refactoring as a separate patch, you can see my patch series of pre-computing binlog checksums (MDEV-31273) for an example. Doing it together makes it very hard to review the actual code changes and see what is refactoring and what is new logic for master_slave_time_diff.
+ if (unlikely(!rli->slave_timestamp) && Log_event::is_group_event(typ)) + { + /* + First event for this slave. Assume that all the slave was up to date + with the master just before the current event. + */ + rli->slave_timestamp= (time_t) ev->when + (time_t) ev->exec_time-1;
I think the comment is meant to explain why there is the "-1" / minus-one in this expression, but it's not really clear exactly what is meant, so perhaps elaborate it a bit. Does it mean that master_slave_time_diff will start out as "1 second" just after the slave restarts? One way or the other, I do not have a strong opinion. But I want to give a general remark, from my experience with the troubles over the years with Seconds_Behind_Master. The problem with Seconds_Behind_Master was always that it didn't quite do what someone expected in some particular case, and then a new special case was added to try to accomodate. And after a number of iterations of this, we ended up with a semantics for Seconds_Behind_Master that no-one _really_ understands, and which still doesn't quite always do what people expect (which is in any case impossible if different people have different expectations). This "-1" here is clearly on the same road towards adding special cases. So my strong advice would be that, going forward, be _extremely_ conservative about modifying the semantics of master_slave_time_diff. The one good thing about the new status is that it has a (hopefully) clear semantics that can be documented (now including this "-1") and completely understood by users. Let's stick to that no matter who comes and wants just another small tweak, unless such tweaks actually make the semantics and documentation _simpler_.
+ /* + replay_log.description_event_for_exec can be null if the slave thread + is getting killed + */ + if (LOG_EVENT_IS_QUERY((Log_event_type) buf[EVENT_TYPE_OFFSET]) || + LOG_EVENT_IS_LOAD_DATA((Log_event_type) buf[EVENT_TYPE_OFFSET])) + { + time_t exec_time= query_event_get_time(buf, rli->relay_log. + description_event_for_queue); + set_if_bigger(rli->newest_master_timestamp, exec_time); + }
This confused me, because in Query_log_event, the exec_time is the duration of the query (a time interval), not an absolute time. So please use the name "end_time" for the variable instead of "exec_time", and consider renaming query_event_get_time() to query_event_get_end_time().
+static int make_slave_status_old_format(THD *thd, ST_SCHEMA_TABLE *schema_table) +{ + ST_FIELD_INFO *field_info= schema_table->fields_info; + Name_resolution_context *context= &thd->lex->first_select_lex()->context; + DBUG_ASSERT(thd->lex->sql_command == SQLCOM_SHOW_SLAVE_STAT); + bool all_slaves= thd->lex->mi.show_all_slaves; + ulonglong used_fields= ~0ULL; + + if (!all_slaves) + { + /* Remove 2 first fields and all fields above and including field 55 */ + used_fields&= ~((1ULL << 0) | (1ULL << 1)); + used_fields&= ((1ULL << 56)-1);
Should there be some constant used for the magic numbers 55/56? And/or an assertion that this value is correct? Also, is the intention here that we will from here on never add more fields to legacy SHOW SLAVE STATUS? If so, there should probably be a comment about this; also see the test case mysql-test/suite/rpl/t/rpl_show_slave_status.test which has comments that might need updating. - Kristian
Hi Kristian, Thanks for the review! I'll be taking over for Monty to address/incorporate your comments, see my responses below. On Mon, Jul 22, 2024 at 8:02 AM Kristian Nielsen via developers < developers@lists.mariadb.org> wrote:
Hi Monty,
Here is my review of the MDEV-33856 patch:
commit cc516f4d9913d732065c7abef983df42e5a8bab3 (origin/bb-11.6-MDEV-33856-seconds_behind_master) Author: Monty <monty@mariadb.org> Date: Tue May 14 23:47:59 2024 +0300
MDEV-33856 New definition for Seconds_Behind_Master
If I understand the patch correctly, in introduces a new `Master_slave_time_diff` value as an alternative to the traditional Seconds_behind_master, with a clearer semantics. And does not change how Seconds_behind_master works.
I think this is fine, and can help to preserve backwards compatibility. Just maybe mention this in the commit message, and/or change the bug title to "Alternative definition for Seconds_Behind_Master".
Sure, will do.
diff --git a/mysql-test/suite/perfschema/t/relaylog.test b/mysql-test/suite/perfschema/t/relaylog.test index 12fc96a8b27..16f72efae98 100644 --- a/mysql-test/suite/perfschema/t/relaylog.test +++ b/mysql-test/suite/perfschema/t/relaylog.test @@ -43,7 +43,7 @@ select if (sum_number_of_bytes_read > 0, "MANY", "NONE") as SUM_NUMBER_OF_BYTES_READ, if (sum_number_of_bytes_write > 0, "MANY", "NONE") as SUM_NUMBER_OF_BYTES_WRITE from performance_schema.file_summary_by_instance - where file_name like "%master-%" order by file_name; + where file_name like "%master-%" and file_name not like "%sql/share%" order by file_name;
select * from performance_schema.file_summary_by_instance where file_name like "%slave-%" order by file_name; @@ -96,7 +96,7 @@ sync_slave_with_master; -- echo "============ Performance schema on slave ============"
select * from performance_schema.file_summary_by_instance - where file_name like "%master-%" order by file_name; + where file_name like "%master-%" and file_name not like "%sql/share%" order by file_name;
select substring(file_name, locate("slave-", file_name)) as FILE_NAME,
What is the reason for these changes (adding match on "%sql/share%" ? I don't understand the relation to the subject of the patch.
I'm not immediately sure about this one. I'll investigate, and remove if it is proven unnecessary.
diff --git a/mysql-test/suite/rpl/t/master_last_event_time.inc b/mysql-test/suite/rpl/t/master_last_event_time.inc new file mode 100644 index 00000000000..0790d557a7b --- /dev/null +++ b/mysql-test/suite/rpl/t/master_last_event_time.inc
+--connection server_3 +--sleep 1 +--source include/start_slave.inc +--source include/sync_with_master_gtid.inc
What is the purpose of --sleep 1 here?
Usually it's wrong to have sleeps in the tests. Sometimes it is correct, but then there should be a comment explaining why a sleep is necessary (and sufficient) here. This sleep looks wrong, since there's a proper wait immediately after for the slave to sync up.
Indeed this is probably left-over from something prior, I'll remove it.
diff --git a/mysql-test/suite/rpl/t/master_last_event_time_stmt.test b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test new file mode 100644 index 00000000000..fe0f16a3db3 --- /dev/null +++ b/mysql-test/suite/rpl/t/master_last_event_time_stmt.test @@ -0,0 +1,84 @@ +# +# Statement specific tests for master_last_event_time +# +--source include/have_binlog_format_statement.inc
+# Slave will replicate the above with a row event which will be very fast +# compared to the master event.
I don't understand this comment. Why will the slave replicate using a row event, when the test case sets binlog_format=statement?
This whole check should go in the *_row.test version. I'll move it.
diff --git a/sql/log.cc b/sql/log.cc index 1cc0fa77f08..63d9c756911 100644 --- a/sql/log.cc +++ b/sql/log.cc @@ -1898,10 +1898,20 @@ static inline int binlog_commit_flush_xid_caches(THD *thd, binlog_cache_mngr *cache_mngr, bool all, my_xid xid)
Xid_log_event end_evt(thd, xid, TRUE); - return (binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE, TRUE)); + if (!thd->slave_thread && ! thd->user_time.val) + { + /* + Ensure that on the master the event time is the time of commit, + not the start of statement time. + */ + my_hrtime_t hrtime= my_hrtime(); + end_evt.when= hrtime_to_my_time(hrtime); + } + DBUG_RETURN(binlog_flush_cache(thd, cache_mngr, &end_evt, all, TRUE, TRUE));
Ok, so correct me if I'm wrong, but I think the purpose here is: You want to have the time when the transaction ended (in case of long-running query). But for auto-committed query, there is no explicit COMMIT query, so the thd->start_time is the start of the transaction. So here you ensure that the XID_EVENT will have the time at start of commit in either case. If this is correct, probably could use a comment explaining this.
Right, will do.
Some comments:
1. It is better to check for thd->rgi_slave being non-NULL than for thd->slave_thread. The thd->rgi_slave is associated with the replication of some event or transaction. The thd->slave_thread just identifies the thread/connection. (This is mostly style/code clarity, as in current code the two values would track each other.)
Sure.
2. Why the condition on thd->user_time.val? What is the situation where we are not replicating an event, but user_time is set and we do not want to put the current time in the XID_EVENT?
When SET TIMESTAMP is set, the XID_EVENT should use the user value, rather than the current time (as all other events of the transaction will use the SET TIMESTAMP value).
3. This handles an XID end event, which is for InnoDB transactions. But MyISAM/Aria use a different end event, a COMMIT query, and there are are few other cases as well. Don't you want consistently to have the end of the query as the time in all end events? Then this code should probably instead be in binlog_flush_cache() or perhaps write_transaction_to_binlog()?
Hmm, but I think that would break your point of (4) where there are Query_log_events without end events. So if we tag their `when` to be at binlog time, then when we calculate the end of the Query_log_event on the slave, it would be when + exec_time, which could put us significantly past the real end time of the event. So I think the XID and Query_log events should be treated differently.
4. Even given (3), there are still cases (eg. DDL) where there is no end event, and thus there is no event containing the time for the end of the operation on the master. Won't this cause some inconsistency in how the slave determines and displays its timings with this patch? If not, why not?
And to elaborate on the response of (3) for consistency, the Query_log_event's `when` uses thd->start_time, even for the end event. So the end Query_log_event for say a MyISAM/Aria transaction will be consistent with the single Query_log_event of an event without an explicit ending transaction.
for (; has_xid && !entry.need_unlog && ha_info; ha_info= ha_info->next()) { if (ha_info->is_started() && ha_info->ht() != binlog_hton && !ha_info->ht()->commit_checkpoint_request) + { entry.need_unlog= true; + break; + } }
Agree, this code always irritated me ;-)
You could also move the condition (!entry.need_unlog && ha_info) out of the loop in an outer if () to not have to check it on every loop iteration (just an observation, entirely up to you, it's not something your patch introduced).
diff --git a/sql/log_event.cc b/sql/log_event.cc index 3802f76efe0..706a4903eb0 100644 --- a/sql/log_event.cc +++ b/sql/log_event.cc
@@ -1104,10 +1116,12 @@ Log_event* Log_event::read_log_event(const uchar *buf, uint event_len, switch(event_type) { case QUERY_EVENT: ev= new Query_log_event(buf, event_len, fdle, QUERY_EVENT); + set_orig_exec_time_in_thd(((Query_log_event*) ev)->exec_time); break; case QUERY_COMPRESSED_EVENT: ev= new Query_compressed_log_event(buf, event_len, fdle, - QUERY_COMPRESSED_EVENT); + QUERY_COMPRESSED_EVENT); + set_orig_exec_time_in_thd(((Query_compressed_log_event*) ev)->exec_time);
@@ -964,6 +964,18 @@ Log_event* Log_event::read_log_event(IO_CACHE* file, DBUG_RETURN(res); }
+/* + Update thd->orig_exec_time +*/ + +inline void set_orig_exec_time_in_thd(my_time_t exec_time) +{ +#if !defined(MYSQL_CLIENT) && defined(HAVE_REPLICATION) + THD *thd= current_thd; + if (likely(thd)) + thd->orig_exec_time= exec_time; +#endif +}
Using current_thd doesn't seem right here, and in fact it's not, as seen by this part of the patch where thd->orig_exec_time has to be set again for parallel replication:
diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 51ca3b183ec..d279761767a 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc @@ -60,7 +60,10 @@ rpt_handle_event(rpl_parallel_thread::queued_event *qev, rgi->future_event_relay_log_pos= qev->future_event_relay_log_pos; strcpy(rgi->future_event_master_log_name, qev->future_event_master_log_name); if (event_can_update_last_master_timestamp(ev)) + { rgi->last_master_timestamp= ev->when + ev->exec_time; + thd->orig_exec_time= ev->exec_time; + }
So let's not set thd->orig_exec_time from Log_event::read_log_event(), it's redundant (for parallel replication), and logically setting it should be done when applying the event, not when reading the event.
Instead, set the thd->orig_exec_time in exec_relay_log_event(), so we retain similarity with how replication is done in the parallel and non-parallel case. There is already a similar if (event_can_update_last_master_timestamp(ev)) in exec_relay_log_event().
(Also see comments below about moving this from THD to rpl_group_info to not increase THD just for this very specific case).
Sure, makes sense (to also use RGI).
+ +/* + Get the time when the event had been executed on the master. + This works for both query events and load data events. +*/ + +#if Q_EXEC_TIME_OFFSET != L_EXEC_TIME_OFFSET +#error "Q_EXEC_TIME_OFFSET is not same as L_EXEC_TIME_OFFSET" +#endif + +time_t query_event_get_time(const uchar *buf, + const Format_description_log_event + *description_event) +{ + time_t when= uint4korr(buf); + buf+= description_event->common_header_len; + return when + uint4korr(buf + Q_EXEC_TIME_OFFSET);
Suggestion: Add an assert here:
DBUG_ASSERT(LOG_EVENT_IS_QUERY(buf[EVENT_TYPE_OFFSET]) || LOG_EVENT_IS_LOAD_DATA(buf[EVENT_TYPE_OFFSET]));
just to avoid someone mistakenly adding code later to call this for a wrong event type.
Sure, good idea.
diff --git a/sql/sql_class.h b/sql/sql_class.h index 2d671139a6c..05c9c4e8012 100644 --- a/sql/sql_class.h +++ b/sql/sql_class.h @@ -3168,6 +3168,7 @@ class THD: public THD_count, /* this must be first */ ulonglong start_utime, utime_after_lock, utime_after_query; /* This can be used by handlers to send signals to the SQL level */ ulonglong replication_flags; + my_time_t orig_exec_time; // Exec time for last read binlog event
@@ -1348,8 +1346,14 @@ Query_log_event::Query_log_event(THD* thd_arg, const char* query_arg, memset(&host, 0, sizeof(host)); error_code= errcode;
- end_time= my_time(0); - exec_time = (ulong) (end_time - thd_arg->start_time); + /* + For slave threads, remember the original master exec time. + This is needed to be able to calculate the master commit time. + */ + exec_time= ((thd->slave_thread) ? + thd->orig_exec_time : + (my_time(0) - thd_arg->start_time)); +
I think we can here avoid increasing the THD size, which is sad for a special case like this. Instead put the orig_exec_time into THD::rgi_slave, extending the struct rpl_group_info.
Then the assignment above will be to rgi->orig_exec_time instead of thd->orig_exec_time, and the check here will be for thd->rgi_slave instead of thd->slave_thread. The rgi is the appropriate place for orig_exec_time, as it is a property of the transaction/event group being replicated.
@@ -3171,146 +2856,170 @@ static const LEX_CSTRING msg_ignored= { STRING_WITH_LEN("Ignored") }; #endif
-static bool send_show_master_info_data(THD *thd, Master_info *mi, bool full, - String *gtid_pos) +inline void store_string_or_null(Field **field, const char *str) +{
- mysql_mutex_unlock(&mi->run_lock); +void store_master_info(THD *thd, Master_info *mi, TABLE *table,
- protocol->store_string_or_null(mi->host, &my_charset_bin); - protocol->store_string_or_null(mi->user, &my_charset_bin); - protocol->store((uint32) mi->port); - protocol->store((uint32) mi->connect_retry); - protocol->store(mi->master_log_name, strlen(mi->master_log_name), + store_string_or_null(field++, mi->host); + store_string_or_null(field++, mi->user); + (*field++)->store((uint32) mi->port); + (*field++)->store((uint32) mi->connect_retry); + (*field++)->store(mi->master_log_name, strlen(mi->master_log_name),
So this patch refactors the code for SHOW SLAVE STATUS to use a shared code path with information_schema tables? That sounds like a good change. But for the future, please do such refactoring as a separate patch, you can see my patch series of pre-computing binlog checksums (MDEV-31273) for an example. Doing it together makes it very hard to review the actual code changes and see what is refactoring and what is new logic for master_slave_time_diff.
+ if (unlikely(!rli->slave_timestamp) && Log_event::is_group_event(typ)) + { + /* + First event for this slave. Assume that all the slave was up to date + with the master just before the current event. + */ + rli->slave_timestamp= (time_t) ev->when + (time_t) ev->exec_time-1;
I think the comment is meant to explain why there is the "-1" / minus-one in this expression, but it's not really clear exactly what is meant, so perhaps elaborate it a bit.
Does it mean that master_slave_time_diff will start out as "1 second" just after the slave restarts?
One way or the other, I do not have a strong opinion. But I want to give a general remark, from my experience with the troubles over the years with Seconds_Behind_Master. The problem with Seconds_Behind_Master was always that it didn't quite do what someone expected in some particular case, and then a new special case was added to try to accomodate. And after a number of iterations of this, we ended up with a semantics for Seconds_Behind_Master that no-one _really_ understands, and which still doesn't quite always do what people expect (which is in any case impossible if different people have different expectations).
This "-1" here is clearly on the same road towards adding special cases. So my strong advice would be that, going forward, be _extremely_ conservative about modifying the semantics of master_slave_time_diff. The one good thing about the new status is that it has a (hopefully) clear semantics that can be documented (now including this "-1") and completely understood by users. Let's stick to that no matter who comes and wants just another small tweak, unless such tweaks actually make the semantics and documentation _simpler_.
+ /* + replay_log.description_event_for_exec can be null if the slave thread + is getting killed + */ + if (LOG_EVENT_IS_QUERY((Log_event_type) buf[EVENT_TYPE_OFFSET]) || + LOG_EVENT_IS_LOAD_DATA((Log_event_type) buf[EVENT_TYPE_OFFSET])) + { + time_t exec_time= query_event_get_time(buf, rli->relay_log. + description_event_for_queue); + set_if_bigger(rli->newest_master_timestamp, exec_time); + }
This confused me, because in Query_log_event, the exec_time is the duration of the query (a time interval), not an absolute time.
So please use the name "end_time" for the variable instead of "exec_time", and consider renaming query_event_get_time() to query_event_get_end_time().
Right, makes sense.
+static int make_slave_status_old_format(THD *thd, ST_SCHEMA_TABLE
*schema_table)
+{ + ST_FIELD_INFO *field_info= schema_table->fields_info; + Name_resolution_context *context= &thd->lex->first_select_lex()->context; + DBUG_ASSERT(thd->lex->sql_command == SQLCOM_SHOW_SLAVE_STAT); + bool all_slaves= thd->lex->mi.show_all_slaves; + ulonglong used_fields= ~0ULL; + + if (!all_slaves) + { + /* Remove 2 first fields and all fields above and including field 55 */ + used_fields&= ~((1ULL << 0) | (1ULL << 1)); + used_fields&= ((1ULL << 56)-1);
Should there be some constant used for the magic numbers 55/56? And/or an assertion that this value is correct?
That'd make sense, sure.
Also, is the intention here that we will from here on never add more fields to legacy SHOW SLAVE STATUS? If so, there should probably be a comment about this; also see the test case mysql-test/suite/rpl/t/rpl_show_slave_status.test which has comments that might need updating.
Not that I know of, why wouldn't we want to keep them in-sync?
- Kristian _______________________________________________ developers mailing list -- developers@lists.mariadb.org To unsubscribe send an email to developers-leave@lists.mariadb.org
Brandon Nesterenko via developers <developers@lists.mariadb.org> writes:
2. Why the condition on thd->user_time.val? What is the situation where we are not replicating an event, but user_time is set and we do not want to put the current time in the XID_EVENT?
When SET TIMESTAMP is set, the XID_EVENT should use the user value, rather than the current time (as all other events of the transaction will use the SET TIMESTAMP value).
Hm, ok. I don't 100% recall what SET TIMESTAMP is used for, but I guess that makes sense.
3. This handles an XID end event, which is for InnoDB transactions. But MyISAM/Aria use a different end event, a COMMIT query, and there are are few other cases as well. Don't you want consistently to have the end of the query as the time in all end events? Then this code should probably instead be in binlog_flush_cache() or perhaps write_transaction_to_binlog()?
Hmm, but I think that would break your point of (4) where there are Query_log_events without end events. So if we tag their `when` to be at binlog time, then when we calculate the end of the Query_log_event on the slave, it would be when + exec_time, which could put us significantly past the real end time of the event. So I think the XID and Query_log events should be treated differently.
Aha, so what I missed is that the end_event will have exec_time set to the duration of the prior query in the event group. So it looks like this, where the last event is a COMMIT end_event with exec_time=2: #240725 11:13:19 server id 1 end_log_pos 927 CRC32 0x2ba8f136 GTID 0-1-4 #240725 11:13:19 server id 1 end_log_pos 0 CRC32 0x327ace12 Query thread_id=8 exec_time=2 error_code=0 xid=0 #240725 11:13:19 server id 1 end_log_pos 1102 CRC32 0x4e855e67 Query thread_id=8 exec_time=2 error_code=0 xid=0 Whereas for an autocommit with XID_log_event as end_event, there is no exec_time: #240725 11:13:21 server id 1 end_log_pos 1144 CRC32 0xc8677301 GTID 0-1-5 trans #240725 11:13:21 server id 1 end_log_pos 0 CRC32 0x54ccf6ee Query thread_id=8 exec_time=2 error_code=0 xid=0 #240725 11:13:21 server id 1 end_log_pos 1277 CRC32 0x51a85907 Xid = 136 So this is the case being fixed here, set the Xid=136 event to have timestamp equal to the (when + exec_time) of the prior query. So I guess an alternative would be to have the "COMMIT" end_event have exec_time=0 and timestamp set to the time of commit, just like the Xid_log_event? Anyway, then this sounds ok. For a multi-statement transaction, the GTID event has the time of the end of the transaction, not the start: #240725 11:13:29 server id 1 end_log_pos 1970 CRC32 0x6dbb2f78 GTID 0-1-9 trans #240725 11:13:26 server id 1 end_log_pos 0 CRC32 0xdd94a6da Query thread_id=8 exec_time=1 error_code=0 xid=0 #240725 11:13:27 server id 1 end_log_pos 0 CRC32 0xc3c9da02 Query thread_id=8 exec_time=1 error_code=0 xid=0 #240725 11:13:28 server id 1 end_log_pos 0 CRC32 0x778bdcb3 Query thread_id=8 exec_time=1 error_code=0 xid=0 #240725 11:13:29 server id 1 end_log_pos 2307 CRC32 0x5750cba9 Xid = 147 This is inconsistent with an autocommit transaction where the the GTID has timestamp of transaction start. Does this need to be fixed as well? Or doesn't it matter, because the master_last_event_time will just be set ahead slightly earlier by the IO thread, and the SQL threads only use the end_event to update the slave_last_event_time? Is that correctly understood?
4. Even given (3), there are still cases (eg. DDL) where there is no end event, and thus there is no event containing the time for the end of the operation on the master. Won't this cause some inconsistency in how the slave determines and displays its timings with this patch? If not, why not?
And to elaborate on the response of (3) for consistency, the Query_log_event's `when` uses thd->start_time, even for the end event. So the end Query_log_event for say a MyISAM/Aria transaction will be consistent with the single Query_log_event of an event without an explicit ending transaction.
Right, but this is because the time used for slave_last_event_time is set to the (when + exec_time) which is effectively the end time of the event. And the "COMMIT" query end event happens to get the when/exec_time values from the prior query execution, so it ends up consistent in the end. Right? The thing that confused me here was missing that the exec_time gets set for the end_event here and thus effectively gives the correct end time.
+ /* Remove 2 first fields and all fields above and including field 55 */ + used_fields&= ~((1ULL << 0) | (1ULL << 1)); + used_fields&= ((1ULL << 56)-1);
Should there be some constant used for the magic numbers 55/56? And/or an assertion that this value is correct?
That'd make sense, sure.
Also, is the intention here that we will from here on never add more fields to legacy SHOW SLAVE STATUS? If so, there should probably be a comment about this; also see the test case mysql-test/suite/rpl/t/rpl_show_slave_status.test which has comments that might need updating.
Not that I know of, why wouldn't we want to keep them in-sync?
Agree, it seems fine to me to continue to keep them in sync. My comment was referring to the fixed 55/56 constants in the code, which suggested to me that those were not expected to change any more. But with some constant and/or assert for that it should be fine, probably. - Kristian.
participants (2)
-
Brandon Nesterenko
-
Kristian Nielsen