Re: [PATCH] MDEV-36099 Ensure that creation and usage of temporary tables in replication is predictable

Hi Monty,
From: Monty <monty@mariadb.org>
The purpose of this commit is to ensure that creation and changes of temporary tables are properly and predicable logged to the binary log. It also fixes some bugs where ROW logging was used in MIXED mode, when STATEMENT would be a better (and expected) choice.
Here is my review of the patch for MDEV-36099 predictable binlog format for temporary tables: Overall this seems like solid work, thanks! As Elena also remarked, this does significant changes to a very tricky and bug-ridden area of replication, that of temporary tables, this is scary stuff. But the patch does look good and relatively simple. A main change with this patch is that now all use of temporary tables will by default be binlogged in row mode. I think this is a reasonable change. There are still remaining problems with temporary tables in statement-based mode; and it also helps with parallel replication, which cannot run in parallel statement-logged temporary tables. And you provided --create-temporary-table-binlog-formats as a way to enable backwards-compatible behaviour. I think it is important to mention this prominently in the release notes. There will invariably be some users that manipulate a large number of rows with temporary tables, eg. BEGIN; CREATE TEMPORARY TABLE tmp AS SELECT a, b, c FROM large_table1 JOIN large_table2 ON ... INSERT INTO other_table SELECT b, c FROM tmp WHERE a <100; DROP TEMPORARY TABLE tmp; COMMIT; This will change from binlogging a few hundred bytes of Query events to potentially many megabytes of row events, which could cause problems in some cases. I think the change is still justified from the benefits it gives and the relative few cases where this will cause serious problems, but it is still important to try to document it as well as possible. I read the code part of the patch through in detail, but I did not spot anything in it that looks wrong. There may be things missing from the patch that are hard to spot in a review, I think that's unavoidable in a change like this. But I did try to think of possible issues without finding anything, and the patch does look quite simple compared to what it is doing. So looks good I think. Just a few questions on details I did not understand, and some minor suggestions for changes. Questions:
diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc index c2f842aea56..3d1df5ac42f 100644 --- a/sql/sql_parse.cc +++ b/sql/sql_parse.cc @@ -531,6 +531,7 @@ void init_update_queries(void) CF_SCHEMA_CHANGE; sql_command_flags[SQLCOM_CREATE_SEQUENCE]= (CF_CHANGES_DATA | CF_REEXECUTION_FRAGILE | + CF_FORCE_ORIGINAL_BINLOG_FORMAT | CF_AUTO_COMMIT_TRANS | CF_SCHEMA_CHANGE);
I did not get the reason for this change. I assume some bug fix found during test case writing?
--- a/sql/sql_class.h +++ b/sql/sql_class.h @@ -817,6 +817,7 @@ typedef struct system_variables /* Flags for slow log filtering */ ulong log_slow_rate_limit; ulong binlog_format; ///< binlog format for this thd (see enum_binlog_format) + ulong create_temporary_table_binlog_formats;
Is there any way to make this take less space than 8 bytes (on 64-bit)? It goes into every THD, and only two bits are used.
+ + Temporary tables will be logged only on CREATE in STMT format + or on INSERT if all changes to the table is in the binlog. */ if ((WSREP_EMULATE_BINLOG(thd) || mysql_bin_log.is_open()) && - (likely(!error) || thd->transaction->stmt.modified_non_trans_table || - thd->log_current_statement())) + (table->s->using_binlog() || + ((in_create_table && + (!table->s->tmp_table || thd->binlog_create_tmp_table())))) && + (likely(!error) || + (!in_create_table && + (thd->transaction->stmt.modified_non_trans_table || + thd->log_current_statement()))))
This conditional is now very hard to understand. I think I managed to understand all parts of it, except the part "thd->log_current_statement()", but that is unchanged with your patch. This if () would now benefit from being split into multiple parts, each of which explained in a comment.
diff --git a/mysql-test/main/statistics.result b/mysql-test/main/statistics.result index bb6469c46da..ebd9cb7f36e 100644 --- a/mysql-test/main/statistics.result +++ b/mysql-test/main/statistics.result @@ -1278,7 +1278,7 @@ Table Op Msg_type Msg_text mysql.column_stats analyze error Invalid argument ANALYZE TABLE mysql.column_stats; Table Op Msg_type Msg_text -mysql.column_stats analyze status OK +mysql.column_stats analyze status Table is already up to date SELECT * FROM mysql.table_stats; db_name table_name cardinality SELECT * FROM mysql.column_stats;
I did not understand why this .result change happens after your patch? Minor things:
diff --git a/mysql-test/suite/rpl/t/create_or_replace.inc b/mysql-test/suite/rpl/t/create_or_replace.inc index e8fa95cb94a..b4e91d7b412 100644 --- a/mysql-test/suite/rpl/t/create_or_replace.inc +++ b/mysql-test/suite/rpl/t/create_or_replace.inc @@ -4,7 +4,24 @@ --let $rpl_topology=1->2 --source include/rpl_init.inc
-# Create help tables +select @@binlog_format, @@create_temporary_table_binlog_formats; + +# Copy create_temporary_table_binlog_formats from master to slave +# This is done to get same results as with older versions of MariaDB. +# The slave will work even if this is not done. However in mixed +# format on slave temporary tables would not be logged to in the +# slaves binary log +let $format=`select @@create_temporary_table_binlog_formats`; +connection server_2; +--eval set @@global.create_temporary_table_binlog_formats='$format' +# Ensure that the slave threads uses the new values. +stop slave; +start slave;
Here, use --source include/stop_slave.inc and --source include/start_slave.inc. For consistent with other test cases.
- Rename of not binlogged temporary tables where binlogged to binary log which caused replication to stop.
s/where/were/
+ /* + Mark if query was logged as statement. Used to check it tmp table changes + are logged. + */ + bool query_binlogged_as_stmt;
s/check it table/check if table/ + /* + This is set for all normal tables and for temporary tables that has + the CREATE binary logged. + + 0 table table is not in binary log (not logged temporary table) + 1 table create was logged (normal table or logged temp table) + 2 table create was logged but not all changes are in the binary log. + ROW LOGGING will be used for the table. s/that has the CREATE/that have the CREATE/ s/table table is not/table is not/
@@ -794,6 +821,20 @@ void THD::mark_tmp_table_as_free_for_reuse(TABLE *table)
DBUG_ASSERT(table->s->tmp_table);
+ /* + Ensure that table changes where either binary logged or the table + is marked as not up to date. + */
s/where/were/ Thanks, Monty! Reviewed-by: Kristian Nielsen <knielsen@knielsen-hq.org> - Kristian.

Hi! On Fri, Feb 21, 2025 at 3:44 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Hi Monty,
From: Monty <monty@mariadb.org>
The purpose of this commit is to ensure that creation and changes of temporary tables are properly and predicable logged to the binary log. It also fixes some bugs where ROW logging was used in MIXED mode, when STATEMENT would be a better (and expected) choice.
Here is my review of the patch for MDEV-36099 predictable binlog format for temporary tables:
<cut>
I think it is important to mention this prominently in the release notes. There will invariably be some users that manipulate a large number of rows with temporary tables, eg.
BEGIN; CREATE TEMPORARY TABLE tmp AS SELECT a, b, c FROM large_table1 JOIN large_table2 ON ... INSERT INTO other_table SELECT b, c FROM tmp WHERE a <100; DROP TEMPORARY TABLE tmp; COMMIT;
This will change from binlogging a few hundred bytes of Query events to potentially many megabytes of row events, which could cause problems in some cases. I think the change is still justified from the benefits it gives and the relative few cases where this will cause serious problems, but it is still important to try to document it as well as possible.
I added the following last to the commit message: The consequence of the above is that manipulations of a lot of rows through temporary tables will by default be be slower in mixed mode. For example: BEGIN; CREATE TEMPORARY TABLE tmp AS SELECT a, b, c FROM large_table1 JOIN large_table2 ON ... INSERT INTO other_table SELECT b, c FROM tmp WHERE a <100; DROP TEMPORARY TABLE tmp; COMMIT; By default this will create a huge entry in the binary log, compared to just a few hundred bytes in statement mode. However the change in this commit will make usage of temporary tables more reliable and predicable and is thus worth it. Using statement mode or setting create_temporary_table_binlog_formats can be used to avoid this issue. <cut>
Questions:
diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc index c2f842aea56..3d1df5ac42f 100644 --- a/sql/sql_parse.cc +++ b/sql/sql_parse.cc @@ -531,6 +531,7 @@ void init_update_queries(void) CF_SCHEMA_CHANGE; sql_command_flags[SQLCOM_CREATE_SEQUENCE]= (CF_CHANGES_DATA | CF_REEXECUTION_FRAGILE | + CF_FORCE_ORIGINAL_BINLOG_FORMAT | CF_AUTO_COMMIT_TRANS | CF_SCHEMA_CHANGE);
I did not get the reason for this change. I assume some bug fix found during test case writing?
Without the above change, the create of temporary sequences will be binary logged. If you remove the above, the the test binlog.binlog_empty_xa_prepared,mix will fail with: @@ -21,6 +21,11 @@ DROP TABLE tmp_1; # Proof of correct logging incl empty XA-PREPARE include/show_binlog_events.inc +Log_name Pos Event_type Server_id End_log_pos Info +master-bin.000001 # Gtid # # GTID #-#-# +master-bin.000001 # Query # # use `test`; CREATE TEMPORARY SEQUENCE seq_1 +master-bin.000001 # Gtid # # GTID #-#-# +master-bin.000001 # Query # # DROP TEMPORARY SEQUENCE IF EXISTS `test`.`seq_1` /* generated by server */ The reason is that without the flag, using CREATE TEMPORARY SEQUENCE will be forced to be executed in statement mode (default for anything that does not change row data) and thus binlogged. CREATE TABLE has the flag CF_CAN_GENERATE_ROW_EVENTS set, which does the same thing as CF_FORCE_ORIGINAL_BINLOG_FORMAT.
--- a/sql/sql_class.h +++ b/sql/sql_class.h @@ -817,6 +817,7 @@ typedef struct system_variables /* Flags for slow log filtering */ ulong log_slow_rate_limit; ulong binlog_format; ///< binlog format for this thd (see enum_binlog_format) + ulong create_temporary_table_binlog_formats;
Is there any way to make this take less space than 8 bytes (on 64-bit)? It goes into every THD, and only two bits are used.
Unfortunately the interface of used for integer variables that can be set from the command line only supports long.
+ + Temporary tables will be logged only on CREATE in STMT format + or on INSERT if all changes to the table is in the binlog. */ if ((WSREP_EMULATE_BINLOG(thd) || mysql_bin_log.is_open()) && - (likely(!error) || thd->transaction->stmt.modified_non_trans_table || - thd->log_current_statement())) + (table->s->using_binlog() || + ((in_create_table && + (!table->s->tmp_table || thd->binlog_create_tmp_table())))) && + (likely(!error) || + (!in_create_table && + (thd->transaction->stmt.modified_non_trans_table || + thd->log_current_statement()))))
This conditional is now very hard to understand. I think I managed to understand all parts of it, except the part "thd->log_current_statement()", but that is unchanged with your patch. This if () would now benefit from being split into multiple parts, each of which explained in a comment.
Agree that the above is complex an could be documented better. However the above will be totally rewritten by the next commit where the create part is not done by the above function, so there is no need to comment this. thd->log_current_statement() is used by some commands to force things to be binlogged even if there was an error. For example a CREATE TABLE ... SELECT has to be logged in some cases if the original table was dropped (in case of replace). This will be fixed for most cases in the atomic create or replace patch, which depends on this patch. The new code will look like this: if (!create_info && (WSREP_EMULATE_BINLOG(thd) || mysql_bin_log.is_open()) && table->s->using_binlog() && (likely(!error) || thd->transaction->stmt.modified_non_trans_table || thd->log_current_statement())) {
diff --git a/mysql-test/main/statistics.result b/mysql-test/main/statistics.result index bb6469c46da..ebd9cb7f36e 100644 --- a/mysql-test/main/statistics.result +++ b/mysql-test/main/statistics.result @@ -1278,7 +1278,7 @@ Table Op Msg_type Msg_text mysql.column_stats analyze error Invalid argument ANALYZE TABLE mysql.column_stats; Table Op Msg_type Msg_text -mysql.column_stats analyze status OK +mysql.column_stats analyze status Table is already up to date SELECT * FROM mysql.table_stats; db_name table_name cardinality SELECT * FROM mysql.column_stats;
I did not understand why this .result change happens after your patch?
This is because just before we executed 'DELETE FROM TABLE mysql.table_stats' Because of a 'bug in the old code' that switches everything to row mode, the DELETE was done in row mode which does a delete row by row. Now the DELETE is executed in statement mode and the DELETE is optimized by running truncate internally. Truncate resets all status flags and the table is 'up to date' from analyze tables point of view.
Minor things:
diff --git a/mysql-test/suite/rpl/t/create_or_replace.inc b/mysql-test/suite/rpl/t/create_or_replace.inc index e8fa95cb94a..b4e91d7b412 100644 --- a/mysql-test/suite/rpl/t/create_or_replace.inc +++ b/mysql-test/suite/rpl/t/create_or_replace.inc @@ -4,7 +4,24 @@ --let $rpl_topology=1->2 --source include/rpl_init.inc
-# Create help tables +select @@binlog_format, @@create_temporary_table_binlog_formats; + +# Copy create_temporary_table_binlog_formats from master to slave +# This is done to get same results as with older versions of MariaDB. +# The slave will work even if this is not done. However in mixed +# format on slave temporary tables would not be logged to in the +# slaves binary log +let $format=`select @@create_temporary_table_binlog_formats`; +connection server_2; +--eval set @@global.create_temporary_table_binlog_formats='$format' +# Ensure that the slave threads uses the new values. +stop slave; +start slave;
Here, use --source include/stop_slave.inc and --source include/start_slave.inc. For consistent with other test cases.
Changed.
- Rename of not binlogged temporary tables where binlogged to binary log which caused replication to stop.
s/where/were/
Changed
+ /* + Mark if query was logged as statement. Used to check it tmp table changes + are logged. + */ + bool query_binlogged_as_stmt;
s/check it table/check if table/
Fixed
+ /* + This is set for all normal tables and for temporary tables that has + the CREATE binary logged. + + 0 table table is not in binary log (not logged temporary table) + 1 table create was logged (normal table or logged temp table) + 2 table create was logged but not all changes are in the binary log. + ROW LOGGING will be used for the table.
s/that has the CREATE/that have the CREATE/ s/table table is not/table is not/
Fixed
@@ -794,6 +821,20 @@ void THD::mark_tmp_table_as_free_for_reuse(TABLE *table)
DBUG_ASSERT(table->s->tmp_table);
+ /* + Ensure that table changes where either binary logged or the table + is marked as not up to date. + */
s/where/were/
Fixed. Thanks a lot for the review! Regards, Monty
participants (2)
-
Kristian Nielsen
-
Michael Widenius