- developers - lists.mariadb.org

Re: 0f0ef93c4d4: MDEV-32444 Data from orphaned XA transaction is lost after online alter
by Sergei Golubchik 02 Nov '23

02 Nov '23

Hi, Nikita, See below. A small suggestion for the test. And a rather larger comment about extending xid_t, as you, likely, expected :) On Oct 19, Nikita Malyavin wrote: > revision-id: 0f0ef93c4d4 (mariadb-11.2.1-15-g0f0ef93c4d4) > parent(s): 7f00853f2f1 > author: Nikita Malyavin > committer: Nikita Malyavin > timestamp: 2023-10-13 13:39:36 +0400 > diff --git a/mysql-test/main/alter_table_online_debug.result b/mysql-test/main/alter_table_online_debug.result > index 1ebb19965eb..154aec003ca 100644 > --- a/mysql-test/main/alter_table_online_debug.result > +++ b/mysql-test/main/alter_table_online_debug.result > @@ -1563,6 +1563,103 @@ connection default; > drop table t1, t2; > set @@binlog_format=default; > set debug_sync= reset; > +# MDEV-32444 Data from orphaned XA transaction is lost after online alter > +create table t (a int primary key) engine=innodb; > +insert into t values (1); > +# XA commit > +set debug_sync= 'alter_table_online_downgraded signal downgraded wait_for go'; > +alter table t force, algorithm=copy, lock=none; > +connection con1; > +set debug_sync= 'now wait_for downgraded'; > +xa begin 'x1'; > +update t set a = 2 where a = 1; > +xa end 'x1'; > +xa prepare 'x1'; > +disconnect con1; > +connection con2; > +xa commit 'x1'; > +set debug_sync= 'now signal go'; > +connection default; > +select * from t; > +a > +2 > +# XA rollback > +set debug_sync= 'alter_table_online_downgraded signal downgraded wait_for go'; > +alter table t force, algorithm=copy, lock=none; > +connect con1, localhost, root,,; > +xa begin 'x2'; > +insert into t values (53); > +xa end 'x2'; > +xa prepare 'x2'; > +disconnect con1; > +connection con2; > +xa rollback 'x2'; > +set debug_sync= 'now signal go'; > +connection default; > +select * from t; > +a > +2 > +# XA transaction is left uncommitted > +# end then is rollbacked after alter fails > +set debug_sync= 'alter_table_online_downgraded signal downgraded wait_for go'; > +set statement innodb_lock_wait_timeout=0, lock_wait_timeout= 0 > +for alter table t force, algorithm=copy, lock=none; > +connect con1, localhost, root,,; > +xa begin 'xuncommitted'; > +insert into t values (3); > +xa end 'xuncommitted'; > +xa prepare 'xuncommitted'; > +set debug_sync= 'now signal go'; > +disconnect con1; > +connection default; > +ERROR HY000: Lock wait timeout exceeded; try restarting transaction > +xa rollback 'xuncommitted'; > +select * from t; > +a > +2 > +# Same, but commit > +set debug_sync= 'alter_table_online_downgraded signal downgraded wait_for go'; > +set statement innodb_lock_wait_timeout=0, lock_wait_timeout= 0 > +for alter table t force, algorithm=copy, lock=none; instead of `alter table t force`, better do something that changes the table, so that you could see whether alter was executed or aborted. For example, you can change the column type int->bigint and back. not only here, but everywhere. Or add/drop column, this has an added benefit that you won't need a separate `show create table`, your already existing `select * from t` will show whether the extra column exists > +connect con1, localhost, root,,; > +xa begin 'committed_later'; > +insert into t values (3); > +xa end 'committed_later'; > +xa prepare 'committed_later'; > +set debug_sync= 'now signal go'; > +disconnect con1; > +connection default; > +ERROR HY000: Lock wait timeout exceeded; try restarting transaction > +xa commit 'committed_later'; > +select * from t; > +a > +2 > +3 > +# Commit, but error in statement, and there is some stmt data to rollback > +set debug_sync= 'alter_table_online_downgraded signal downgraded wait_for go'; > +alter table t force, algorithm=copy, lock=none; > +connect con1, localhost, root,,; > +set debug_sync= 'now wait_for downgraded'; > +xa begin 'x1'; > +insert into t values (4), (3); > +ERROR 23000: Duplicate entry '3' for key 'PRIMARY' > +insert into t values (5); > +xa end 'x1'; > +xa prepare 'x1'; > +disconnect con1; > +connection con2; > +xa commit 'x1'; > +set debug_sync= 'now signal go'; > +connection default; > +select * from t; > +a > +2 > +3 > +5 > +connect con1, localhost, root,,; > +connection default; > +drop table t; > +set debug_sync= reset; > disconnect con1; > disconnect con2; > # > diff --git a/sql/handler.h b/sql/handler.h > index 07b392c010a..4622e49f883 100644 > --- a/sql/handler.h > +++ b/sql/handler.h > @@ -903,6 +905,7 @@ struct xid_t { > long gtrid_length; > long bqual_length; > char data[XIDDATASIZE]; // not \0-terminated ! > + Online_alter_cache_list *online_alter_cache; well, no, we've talked about it. Even the comment above the structure says that you cannot change it. make a separate structure, xid_and_online_alter_cache_t or xid_t_internal or whatever. but xid_t should be defined identically everywhere. For example: https://github.com/berkeleydb/libdb/blob/master/src/dbinc/xa.h === okay, now I've seen how you used it and I'd say it's _such a rare use case_ that you can store your online_alter_cache in the XID_cache_element and look it up there. Or even in a separate data structure, that will be empty most of the time anyway. > > xid_t() = default; /* Remove gcc warning */ > bool eq(struct xid_t *xid) const > diff --git a/sql/online_alter.cc b/sql/online_alter.cc > index 8b8c81319d1..19abe57914d 100644 > --- a/sql/online_alter.cc > +++ b/sql/online_alter.cc > @@ -334,14 +336,57 @@ static int online_alter_log_init(void *p) > online_alter_hton->savepoint_rollback_can_release_mdl= > [](handlerton *hton, THD *thd){ return true; }; > > - online_alter_hton->commit= [](handlerton *hton, THD *thd, bool all) > - { return online_alter_end_trans(hton, thd, all, true); }; > - online_alter_hton->rollback= [](handlerton *hton, THD *thd, bool all) > - { return online_alter_end_trans(hton, thd, all, false); }; > - online_alter_hton->commit_by_xid= [](handlerton *hton, XID *xid) > - { return online_alter_end_trans(hton, current_thd, true, true); }; > - online_alter_hton->rollback_by_xid= [](handlerton *hton, XID *xid) > - { return online_alter_end_trans(hton, current_thd, true, false); }; > + online_alter_hton->commit= [](handlerton *hton, THD *thd, bool all) -> int > + { > + int res= online_alter_end_trans(get_cache_list(hton, thd), thd, > + ending_trans(thd, all), true); > + cleanup_tables(thd); > + return res; > + }; > + online_alter_hton->rollback= [](handlerton *hton, THD *thd, bool all) -> int > + { > + int res= online_alter_end_trans(get_cache_list(hton, thd), thd, > + ending_trans(thd, all), false); > + cleanup_tables(thd); > + return res; > + }; > + > + > + online_alter_hton->recover= [](handlerton*, XID*, uint){ return 0; }; > + online_alter_hton->prepare= [](handlerton *hton, THD *thd, bool all) -> int > + { > + auto &cache_list= get_cache_list(hton, thd); > + int res= 0; > + if (ending_trans(thd, all)) > + { > + thd->transaction->xid_state.set_online_alter_cache(&cache_list); > + thd_set_ha_data(thd, hton, NULL); > + } > + else > + { > + res= online_alter_end_trans(cache_list, thd, false, true); is that possible? > + } > + > + cleanup_tables(thd); > + return res; > + }; > + online_alter_hton->commit_by_xid= [](handlerton *hton, XID *xid) -> int > + { > + int res= online_alter_end_trans(*xid->online_alter_cache, current_thd, > + true, true); > + delete xid->online_alter_cache; > + xid->online_alter_cache= NULL; > + return res; > + }; > + online_alter_hton->rollback_by_xid= [](handlerton *hton, XID *xid) -> int > + { > + int res= online_alter_end_trans(*xid->online_alter_cache, current_thd, > + true, false); > + delete xid->online_alter_cache; > + xid->online_alter_cache= NULL; > + return res; > + }; > + > > online_alter_hton->drop_table= [](handlerton *, const char*) { return -1; }; > online_alter_hton->flags= HTON_NOT_USER_SELECTABLE | HTON_HIDDEN Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org

2 6

Re: 2d937b62c33: MDEV-27744 InnoDB: Failing assertion: !cursor->index->is_committed() in row0ins.cc (from row_ins_sec_index_entry_by_modify) | Assertion `0' failed in row_upd_sec_index_entry (debug) | Corruption
by Sergei Golubchik 02 Nov '23

02 Nov '23

Hi, Alexander, Few questions and suggestions below. Nothing major. On Oct 27, Alexander Barkov wrote: > commit 2d937b62c33 > Author: Alexander Barkov <bar(a)mariadb.com> > Date: Mon Apr 4 14:50:21 2022 +0400 > > MDEV-27744 InnoDB: Failing assertion: !cursor->index->is_committed() in row0ins.cc (from row_ins_sec_index_entry_by_modify) | Assertion `0' failed in row_upd_sec_index_entry (debug) | Corruption > would it make sense to change the MDEV title in Jira to better describe the problem? I sometimes do it with my bugs. Like "LPAD in vcol created in ORACLE mode makes table corrupted in non-ORACLE" (I tried to make it short, appropriate for a title) and, of course, change it in the comment and in the test file to match. > The crash happened with an indexed virtual column whose > value is evaluated using a function that has a different meaning > in sql_mode='' vs sql_mode=ORACLE: > > diff --git a/mysql-test/suite/compat/oracle/r/func_concat.result b/mysql-test/suite/compat/oracle/r/func_concat.result > index 392d579707a..17ca4be078a 100644 > --- a/mysql-test/suite/compat/oracle/r/func_concat.result > +++ b/mysql-test/suite/compat/oracle/r/func_concat.result > @@ -211,14 +211,14 @@ SET sql_mode=ORACLE; > CREATE VIEW v1 AS SELECT 'foo'||NULL||'bar' AS test; > SHOW CREATE VIEW v1; > View Create View character_set_client collation_connection > -v1 CREATE VIEW "v1" AS select concat_operator_oracle(concat_operator_oracle('foo',NULL),'bar') AS "test" latin1 latin1_swedish_ci > +v1 CREATE VIEW "v1" AS select concat(concat('foo',NULL),'bar') AS "test" latin1 latin1_swedish_ci > SELECT * FROM v1; > test > foobar > SET sql_mode=DEFAULT; > SHOW CREATE VIEW v1; > View Create View character_set_client collation_connection > -v1 CREATE ALGORITHM=UNDEFINED DEFINER=`root`@`localhost` SQL SECURITY DEFINER VIEW `v1` AS select concat_operator_oracle(concat_operator_oracle('foo',NULL),'bar') AS `test` latin1 latin1_swedish_ci > +v1 CREATE ALGORITHM=UNDEFINED DEFINER=`root`@`localhost` SQL SECURITY DEFINER VIEW `v1` AS select oracle_schema.concat(oracle_schema.concat('foo',NULL),'bar') AS `test` latin1 latin1_swedish_ci please, add a mysqldump test. Like, create a table with virtual columns, check, defaults in the default sql mode. and create a table in oracle mode. Also, a view in the default mode and a view in oracle mode. and then mysqldump, to see that it dumps and restores everything correctly may be stored routines/triggers/etc, if you'd like, but they aren't directly relevant to this MDEV, as far as I understand. > SELECT * FROM v1; > test > foobar > diff --git a/mysql-test/suite/compat/oracle/r/func_decode.result b/mysql-test/suite/compat/oracle/r/func_decode.result > index 2809e971be3..1870a1ec0d5 100644 > --- a/mysql-test/suite/compat/oracle/r/func_decode.result > +++ b/mysql-test/suite/compat/oracle/r/func_decode.result > @@ -1,8 +1,8 @@ > SET sql_mode=ORACLE; > SELECT DECODE(10); > -ERROR 42000: Incorrect parameter count in the call to native function 'DECODE' > +ERROR 42000: Incorrect parameter count in the call to native function 'oracle_schema.DECODE' Hmm, may be this should say DECODE as before? you know, falls under the case "if you don't change sql_mode back and forth, you won't see schema-qualified names" > SELECT DECODE(10,10); > -ERROR 42000: Incorrect parameter count in the call to native function 'DECODE' > +ERROR 42000: Incorrect parameter count in the call to native function 'oracle_schema.DECODE' > SELECT DECODE(10,10,'x10'); > DECODE(10,10,'x10') > x10 > diff --git a/mysql-test/suite/compat/oracle/r/vcol_innodb.result b/mysql-test/suite/compat/oracle/r/vcol_innodb.result > new file mode 100644 > index 00000000000..9fa97c75c10 > --- /dev/null > +++ b/mysql-test/suite/compat/oracle/r/vcol_innodb.result > @@ -0,0 +1,51 @@ > +SET @table_open_cache=@@GLOBAL.table_open_cache; why do you need to manipulate table_open_cache? to trigger a reopen? Just do flush tables, it's explicit, more readable and more... controllable. > +SET sql_mode=''; > +CREATE TABLE t (d INT,b VARCHAR(1),c CHAR(1),g CHAR(1) GENERATED ALWAYS AS (SUBSTR(b,0,0)) VIRTUAL,PRIMARY KEY(b),KEY g(g)) ENGINE=InnoDB; > +INSERT INTO t VALUES (0); > +ERROR 21S01: Column count doesn't match value count at row 1 > +SET sql_mode='ORACLE'; > +INSERT INTO t SET c=REPEAT (1,0); > +Warnings: > +Warning 1364 Field 'b' doesn't have a default value > +ALTER TABLE t CHANGE COLUMN a b INT; > diff --git a/sql/item_func.h b/sql/item_func.h > index 76a997c33fb..cdbefb82541 100644 > --- a/sql/item_func.h > +++ b/sql/item_func.h > @@ -56,8 +56,40 @@ class Item_func :public Item_func_or_sum, > bool check_argument_types_can_return_date(uint start, uint end) const; > bool check_argument_types_can_return_time(uint start, uint end) const; > void print_cast_temporal(String *str, enum_query_type query_type); > + > + void print_schema_qualified_name(String *to, > + const LEX_CSTRING &schema_name, > + const char *function_name) const I don't see why you'd need this helper. is it something that was used in earlier versions of the patch? > + { > + // e.g. oracle_schema.func() > + to->append(schema_name); > + to->append('.'); > + to->append(function_name); > + } > + > + void print_sql_mode_qualified_name(String *to, > + enum_query_type query_type, > + const char *function_name) const > + { > + const Schema *func_schema= schema(); > + if (!func_schema || func_schema == Schema::find_implied(current_thd)) > + to->append(function_name); > + else > + print_schema_qualified_name(to, func_schema->name(), function_name); > + } > + > + void print_sql_mode_qualified_name(String *to, enum_query_type query_type) > + const > + { > + return print_sql_mode_qualified_name(to, query_type, func_name()); > + } I don't see why you need this helper either, you never use print_sql_mode_qualified_name with the last argument being not func_name(). So you can remove this helper and the third argument of print_sql_mode_qualified_name. > + > public: > > + // Print an error message for a builtin-schema qualified function call > + static void wrong_param_count_error(const LEX_CSTRING &schema_name, > + const LEX_CSTRING &func_name); > + > table_map not_null_tables_cache; > > enum Functype { UNKNOWN_FUNC,EQ_FUNC,EQUAL_FUNC,NE_FUNC,LT_FUNC,LE_FUNC, > diff --git a/sql/item_strfunc.cc b/sql/item_strfunc.cc > index ae078dbb22f..92d5e196da4 100644 > --- a/sql/item_strfunc.cc > +++ b/sql/item_strfunc.cc > @@ -2170,13 +2170,31 @@ bool Item_func_trim::fix_length_and_dec() > > void Item_func_trim::print(String *str, enum_query_type query_type) > { > + LEX_CSTRING suffix= {STRING_WITH_LEN("_oracle")}; > if (arg_count == 1) > { > - Item_func::print(str, query_type); > + if (query_type & QT_FOR_FRM) > + { > + // 10.3 downgrade compatibility for FRM > + str->append(func_name()); > + if (schema() == &oracle_schema_ref) > + str->append(suffix); > + } > + else > + print_sql_mode_qualified_name(str, query_type, func_name()); > + print_args_parenthesized(str, query_type); > return; > } > - str->append(Item_func_trim::func_name()); > - str->append(func_name_ext()); > + > + if (query_type & QT_FOR_FRM) > + { > + // 10.3 downgrade compatibility for FRM > + str->append(Item_func_trim::func_name()); > + if (schema() == &oracle_schema_ref) > + str->append(suffix); > + } > + else > + print_sql_mode_qualified_name(str, query_type, Item_func_trim::func_name()); it'd be simpler if you move the above block that prints the function name before if (arg_count == 1) also you won't need suffix, but can do like in all other functions str->append(STRING_WITH_LEN("trim_oracle"); > str->append('('); > str->append(mode_name()); > str->append(' '); > diff --git a/sql/sql_lex.cc b/sql/sql_lex.cc > index 71f592a3852..bb53d1a510a 100644 > --- a/sql/sql_lex.cc > +++ b/sql/sql_lex.cc > @@ -2084,7 +2100,64 @@ bool Lex_input_stream::get_7bit_or_8bit_ident(THD *thd, uchar *last_char) > } > > > -int Lex_input_stream::scan_ident_sysvar(THD *thd, Lex_ident_cli_st *str) > +/* > + Resolve special SQL functions that have a qualified syntax in sql_yacc.yy. > + These functions are not listed in the native function registry > + because of a special syntax, or a reserved keyword: > + > + mariadb_schema.SUBSTRING('a' FROM 1 FOR 2) -- Special syntax I didn't find it in Oracle's manual, by the way > + mariadb_schema.TRIM(BOTH ' ' FROM 'a') -- Special syntax > + mariadb_schema.REPLACE('a','b','c') -- Verb keyword > +*/ > + > +int Lex_input_stream::find_keyword_qualified_special_func(Lex_ident_cli_st *str, > + uint length) const > +{ > + /* > + There are many other special functions, see the following grammar rules: > + function_call_keyword > + function_call_nonkeyword > + Here we resolve only those that have a qualified syntax to handle > + different behavior in different @@sql_mode settings. > + > + Other special functions do not work in qualified context: > + SELECT mariadb_schema.year(now()); -- Function year is not defined > + SELECT mariadb_schema.now(); -- Function now is not defined > + > + We don't resolve TRIM_ORACLE here, because it does not have > + a qualified syntax yet. Search for "trim_operands" in sql_yacc.yy > + to find more comments. > + */ > diff --git a/sql/sql_schema.h b/sql/sql_schema.h > index 1174bc7a83f..2c52646f2ea 100644 > --- a/sql/sql_schema.h > +++ b/sql/sql_schema.h > @@ -77,5 +98,6 @@ class Schema > > > extern Schema mariadb_schema; > +extern const Schema &oracle_schema_ref; What's the difference between these two definitions. Do you expect someone will need to change mariadb_schema? > > #endif // SQL_SCHEMA_H_INCLUDED Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org

2 2

Eliminating sprintf deprecation warnings (maybe others as well)
by Stein Vidar Hagfors Haugan 27 Oct '23

27 Oct '23

Hi! First time here, so my apologies if I'm violating some spoken or unspoken rules :) I'm developing/maintaining two specialized storage engines(*) and I have one issue: There are numerous uses of sprintf() in the general mariadb codebase, and each one generates a warning during compilation when doing a plain cmake/make build (on MacOS & RHEL7 at least). I am a bit fanatical about hunting down and eliminating warnings if at all possible, no matter how small and insignificant - if you become accustomed to routinely getting warnings during compilation, you can easily overlook an important one. So I am considering to contribute with a patch that will eliminate all sprintf() instances by converting them to snprintf(destination, N, ...). Of course, hunting down the correct N for each instance can be a *lot* of work (I presume this is why this hasn't been done already). So I'm proposing to make a macro UNSAFE_SNPRINTF(destination, ...) which would expand to snprintf(destination, 10000, ...) and replace all current sprintf() calls with that. Yes, 10000 is a ridiculously high number that makes it look like we're introducing a huge security risk, but but the current sprintf() calls have an implicit N = infinity. In some instances, the compiler sniffs out the actual size of the buffer and complains - in that case I would change the code to an appropriate regular snprintf() call. Would doing this be worthwile? I.e., is there a good chance that such a patch would actually be accepted into the codebase? If so, I might also do the same with other warnings (I believe there are some type-punning warnings at least). Sincerely, Stein Haugan *) The engines are for internal use so far, but might make them public at some point

2 2

Re: [PATCH 4/4] MDEV-31273: Precompute binlog checksums
by Michael Widenius 26 Oct '23

26 Oct '23

Hi! Review of MDEV-31273: Precompute binlog checksums On Fri, Aug 25, 2023 at 10:16 AM Kristian Nielsen <knielsen(a)knielsen-hq.org> wrote: > > Compute binlog checksums (when enabled) already when writing events > into the statement or transaction caches, where before it was done > when the caches are copied to the real binlog file. This moves the > checksum computation outside of holding LOCK_log, improving > scalabitily. DBUG_RETURN(my_b_copy_to_cache(from_cache, to_cache, SIZE_T_MAX)); You could use from_cache->end_of_file instead of SIZE_T_MAX <cut> uchar checksum_opt; Wouldn't it be better to have this as an "enum_binlog_checksum_alg" to avoid some casts ? > } > else if (data) > return (enum enum_binlog_checksum_alg)data->checksum_opt; > else > return BINLOG_CHECKSUM_ALG_OFF; > } You can remove the else's <cut> ulong *param_ptr_binlog_cache_disk_use, bool precompute_checksums) : stmt_cache(precompute_checksums), trx_cache(precompute_checksums), last_commit_pos_offset(0), using_xa(FALSE), xa_xid(0) { stmt_cache.set_binlog_cache_info(param_max_binlog_stmt_cache_size, param_ptr_binlog_stmt_cache_use, This code was a bit confusing as we first initialize stmt_cache and trx_cache with checksum and then we initialize them again in the next two lines. I understand why and this does not have to be changed, but it does look a bit strange. The other option would be to have set_binlog_cache_info() take checksum as an argument(). <cut> This one confused me a bit: Log_event_writer writer(file, 0, checksum_alg, &crypto); Log_event_writer writer(file, cache_data, checksum_alg, &crypto); Why this change? It may at least affect calls to: void Log_event_writer::add_status(enum_logged_status status) { if (likely(cache_data)) Fix (after discussions on slack): Remove the following lines in MYSQL_BIN_LOG::write_event(Log_event *ev, binlog_cache_data *cache_data, if (cache_data) cache_data->add_status(ev->logged_status()); <cut> > Don't attempt to precompute checksums if: > - Disabled by user request, --binlog-legacy-event-pos > - Binlog is encrypted, cannot use precomputed checksums > - WSREP/Galera. Why Galera? Would be good to explain this in the commit comment. <cut> > /* > If possible, just copy the cache over byte-by-byte with pre-computed > checksums. > */ > if (likely(binlog_checksum_options == cache_data->checksum_opt) && > likely(!crypto.scheme) && > likely(!opt_binlog_legacy_event_pos)) > { > int res= my_b_copy_all_to_cache(cache, &log_file); > status_var_add(thd->status_var.binlog_bytes_written, my_b_tell(cache)); > DBUG_RETURN(res ? ER_ERROR_ON_WRITE : 0); > } I was just wondering if this could sometimes be optimized to write directly to the real binlog file without another cache in between. <cut> bool precomputed_checksums= (cache_data->checksum_opt != BINLOG_CHECKSUM_ALG_OFF); uint old_checksum_len= precomputed_checksums ? BINLOG_CHECKSUM_LEN : 0; Why two variables when one can as easily do: uint old_checksum_len= ((cache_data->checksum_opt != BINLOG_CHECKSUM_ALG_OFF) ? BINLOG_CHECSUM_LEN : 0); In current write_cache() code, group is a bad name. Please rename 'group' to 'log_file_pos'. By the way, great that you removed the 'end_log_file_pos' variable, which also was a very confusing name related how it was used! > /* > Any old precomputed checksum must _not_ be written here. Instead, it > must be discarded; the new checksum, if needed, is written by > writer.write_footer(). > */ > if (ev_len > old_checksum_len) > { > uint bytes_to_skip= > old_checksum_len - std::min(old_checksum_len, ev_len - chunk); > if (writer.write_data(cache->read_pos, chunk - bytes_to_skip)) > goto error_in_write; > } The above is likely wrong as for long events we would execute the inner part multiple times, while there is only on checksum. As checksum is last in the event, why not just do 'even_len-= old_checksum' Before the loop to copy the event and then disregard 'old_checksum_len' bytes from the cache at the end of the loop? Anyway, to find bugs like this, we need to have a test case with events that are bigger than the IO_CACHE size for cache. Setting binlog_file_cache_size to 8192 (min value) should make it easy to test this. <cut> > binlog_checksum_options= value; > my_atomic_storeul_explicit(&binlog_checksum_options, value, MY_MEMORY_ORDER_RELAXED); Atom is not needed as we have a lock on binlog. ...... Something totally different. I noticed in MYSQL_BIN_LOG::write_cache(): group= (size_t)my_b_tell(&log_file); val= uint4korr(header + LOG_POS_OFFSET) + group + end_log_pos_inc; int4store(header + LOG_POS_OFFSET, val); The first log entry is probably ok, as we have done a rotate() event before. However the cache can have a lot of log_events (as part of a transaction). The end_log_pos after the first event can be wrong as it may not fit into 4 bytes. My understanding is that this is not a problem anymore as we are not using end_log_pos anymore. However I still wonder if rotate() should not only consider the current log file size but also the size of all events we plan to write to the log and do a rotate if the total new log file size > 4G. Regards, Monty

2 1

Re: [PATCH 3/4] MDEV-31273: Refactor MYSQL_BIN_LOG::write_cache()
by Michael Widenius 26 Oct '23

26 Oct '23

Hi! On Fri, Aug 25, 2023 at 10:16 AM Kristian Nielsen <knielsen(a)knielsen-hq.org> wrote: > > Preparatory patch for pre-computing binlog checksums outside of holding > LOCK_log. > > The existing code for MYSQL_BIN_LOG::write_cache() was needlessly complex > and very hard to understand and modify for handling the new case where > pre-computed checksums are already present in the IO_CACHE. MDEV-31273: Refactor MYSQL_BIN_LOG::write_cache() > if (likely(length > LOG_EVENT_HEADER_LEN)) > { > header= cache->read_pos; > cache->read_pos+= LOG_EVENT_HEADER_LEN; > length-= LOG_EVENT_HEADER_LEN; > } > else > { > size_t sofar= length; > size_t remain= LOG_EVENT_HEADER_LEN - sofar; > header= &header_buf[0]; > memcpy(header, cache->read_pos, sofar); > cache->read_pos+= sofar; > > while (hdr_offs < length) > length= my_b_fill(cache); > if (!length) > goto error_in_read; > size_t chunk= std::min(length, remain); > memcpy(header + sofar, cache->read_pos, chunk); > sofar+= chunk; > remain-= chunk; > cache->read_pos+= chunk; > length-= chunk; > } while (unlikely(remain > 0)); > } > The above can be replaced with: if (my_b_read(cache, header, LOG_EVENT_HEAD_LENGTH)) goto error_in_read; It is almost exactly as efficent as the above (one extra if) and avoids using cache internals. Note that you do not have to do call my_fill() if you use my_b_read(). my_b_read() will return 0 if it was able to read all data. In case of end of file, my_b_read() will return 1 and info->error will be 0. If needed, you can als find out how much data left to read from IO_CACHE: left_data_to_read= (cache->end_of_file - my_b_tell(cache)) (I can make an inline function of that if needed). /* Write the rest of the event. */ > while (ev_len > 0) > { > if (length == 0) > length= my_b_fill(cache); > if (!length) > goto error_in_read; > -> while (ev_len > 0) { if (length == 0) { if (!(length= my_b_fill(cache))); goto error_in_read; } <cut> uint chunk= std::min(ev_len, (uint)length); I would have prefer to have MY_MIN() used (like the rest of the code). (not critical) Regards, Monty

2 1

Re: ce9ce585e72: MDEV-31184 Remove parser tokens DECODE_MARIADB_SYM and DECODE_ORACLE_SYM
by Sergei Golubchik 23 Oct '23

23 Oct '23

Hi, Alexander, On Oct 21, Alexander Barkov wrote: > revision-id: ce9ce585e72 (mariadb-10.4.30-103-gce9ce585e72) > parent(s): 1fde785315e > author: Alexander Barkov > committer: Alexander Barkov > timestamp: 2023-08-31 13:49:19 +0400 > message: > > MDEV-31184 Remove parser tokens DECODE_MARIADB_SYM and DECODE_ORACLE_SYM > > Changing the code handling sql_mode-dependent function DECODE(): > > - removing parser tokens DECODE_MARIADB_SYM and DECODE_ORACLE_SYM > - removing the DECODE() related code from sql_yacc.yy/sql_yacc_ora.yy > - adding handling of DECODE() with help of a new Create_func_func_decode please, add a test for DECODE_MARIADB_SYM that changes from ER_PARSE_ERROR to ER_WRONG_PARAMCOUNT_TO_NATIVE_FCT. Why sql_yacc.yy had rules for DECODE_ORACLE_SYM and vice versa? Was it possible to get DECODE_ORACLE_SYM in sql_yacc.yy? > diff --git a/mysql-test/suite/compat/oracle/r/func_decode.result b/mysql-test/suite/compat/oracle/r/func_decode.result > index b49bad93627..2809e971be3 100644 > --- a/mysql-test/suite/compat/oracle/r/func_decode.result > +++ b/mysql-test/suite/compat/oracle/r/func_decode.result > @@ -1,8 +1,8 @@ > SET sql_mode=ORACLE; > SELECT DECODE(10); > -ERROR 42000: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ')' at line 1 > +ERROR 42000: Incorrect parameter count in the call to native function 'DECODE' > SELECT DECODE(10,10); > diff --git a/sql/sql_yacc.yy b/sql/sql_yacc.yy > index cdc04d93708..048741b6ca1 100644 > --- a/sql/sql_yacc.yy > +++ b/sql/sql_yacc.yy > @@ -11027,18 +11024,6 @@ function_call_nonkeyword: > if (unlikely($$ == NULL)) > MYSQL_YYABORT; > } > - | DECODE_MARIADB_SYM '(' expr ',' expr ')' > - { > - $$= new (thd->mem_root) Item_func_decode(thd, $3, $5); > - if (unlikely($$ == NULL)) > - MYSQL_YYABORT; > - } > - | DECODE_ORACLE_SYM '(' expr ',' decode_when_list_oracle ')' > - { > - $5->push_front($3, thd->mem_root); > - if (unlikely(!($$= new (thd->mem_root) Item_func_decode_oracle(thd, *$5)))) > - MYSQL_YYABORT; > - } > | EXTRACT_SYM '(' interval FROM expr ')' > { > $$=new (thd->mem_root) Item_extract(thd, $3, $5); > @@ -12209,25 +12194,6 @@ when_list_opt_else: > } > ; > > -decode_when_list_oracle: > - expr ',' expr > - { > - $$= new (thd->mem_root) List<Item>; > - if (unlikely($$ == NULL) || > - unlikely($$->push_back($1, thd->mem_root)) || > - unlikely($$->push_back($3, thd->mem_root))) > - MYSQL_YYABORT; > - > - } > - | decode_when_list_oracle ',' expr > - { > - $$= $1; > - if (unlikely($$->push_back($3, thd->mem_root))) > - MYSQL_YYABORT; > - } > - ; > - > - > /* Equivalent to <table reference> in the SQL:2003 standard. */ > /* Warning - may return NULL in case of incomplete SELECT */ Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org

2 1

Re: 56fa1da9c67: MDEV-30048 Prefix keys for CHAR work differently for MyISAM vs InnoDB
by Sergei Golubchik 23 Oct '23

23 Oct '23

Hi, Alexander, The actual code change (replacing strnncollsp with strnncollsp_nchars) is ok. But I didn't like the functions you've created. The *_ft_* family is quite confusing, I would not expect to see fulltext-specific functions my_compare.h. I would expect to see there functions which are generic and named by what they're doing, not by where they should be used. ha_compare_char_fixed and ha_compare_char_varying are better. Names are clearer and it's kind of understandable what they do. I suggest you remove *_ft_* functions. fulltext code should, I guess, just use charset_info->coll->strnncoll() everywhere. Not strnncollsp(), because words normally don't have trailing spaces, and if some custom pluggable parser returns words with trailing spaces, they're likely a part of the word and should be compared too. Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org On Oct 17, Alexander Barkov wrote: > revision-id: 56fa1da9c67 (mariadb-10.4.28-90-g56fa1da9c67) > parent(s): ed2adc8c6f9 > author: Alexander Barkov > committer: Alexander Barkov > timestamp: 2023-04-07 14:54:17 +0400 > message: > > MDEV-30048 Prefix keys for CHAR work differently for MyISAM vs InnoDB > > Also fixes: MDEV-30050 Inconsistent results of DISTINCT with NOPAD > > Problem: > > Key segments for CHAR columns where compared using strnncollsp() > for engines MyISAM and Aria. > > This did not work correct in case if the engine applyied trailing > space compression. > > Fix: > > Replacing ha_compare_text() with a number of new functions: > > - ha_compare_ft_text_full() > - ha_compare_ft_text_prefix() > - ha_compare_ft_text() > - ha_compare_char_varying() > - ha_compare_char_fixed() > > The code branch corresponding to comparison of CHAR column keys > (HA_KEYTYPE_TEXT segment type) now uses ha_compare_char_fixed() > which calls strnncollsp_nchars(). > > For the rest of the code: > - comparison of VARCHAR/TEXT column keys > (HA_KEYTYPE_VARTEXT1, HA_KEYTYPE_VARTEXT2 segments types) > - comparison in the fulltext code > this patch does not change the behaviour.

2 3

Re: [PATCH] MDEV-31949 parallel slave xa Round-Robin distribution
by Kristian Nielsen 22 Oct '23

22 Oct '23

Kristian Nielsen <knielsen(a)knielsen-hq.org> writes: Hi Andrei, > commit 96bd9e6b780f3738b5008b89aae6b0b086f15943 > Author: Andrei <andrei.elkin(a)mariadb.com> > Date: Sat Aug 19 19:49:25 2023 +0300 > XA-Prepare group of events > > XA START xid > ... > XA END xid > XA PREPARE xid > > and its XA-"complete" terminator > > XA COMMIT or > XA ROLLBACK > > are made distributed Round-Robin across slave parallel workers. > The former hash-based policy was proven to attribute to execution > latency through creating a big - many times larger than the size > of the worker pool - queue of binlog-ordered transactions > to commit. > > Acronyms and notations used below: > > XAP := XA-Prepare event or the whole prepared XA group of events > XAC := XA-"complete", which is a solitary group of events > |W| := the size of the slave worker pool > Subscripts like `_k' denote order in a corresponding sequence > (e.g binlog file). Here is my review of this patch. > The list is arranged as a sliding window with the size of 2*|W| to account > a possibility of XAP_k -> XAP_k+2|W|-1 the largest (in the group-of-events > count sense) dependency. I spent a lot of effort trying to understand why the factor 2 in the size of 2*|W| occurs. Since each transaction must wait_for_prior_commit for the prior transaction, there should never be more than |W| transactions active. In your example with |W|=4, you say that > Worker #4 can take on its T_8 when T_1 is yet at the beginning of its processing But that does not seem possible. T_8 cannot start until T_1 (and 2..4) have done wakeup_subsequent_commits(), which surely shouldn't happen "at the beginning of its processing", but only at the end. But I think I now understand. The problem is that the XA code calls wakeup_subsequent_commits _before_ updating the XID hash (in XA PREPARE) / deleting from the XID hash (in XA COMMIT/ROLLBACK). Right? This I think is the fundamental issue to address. The wait_for_prior_commit mechanism is to ensure that the prior transaction has completed its commit, which means the state of the transaction is "committed" in memory. Thus, the XID hash, which records information about the status of the (XA) transactions, must also be updated before wakeup_prior_commit may be done. I can see that the XA PREPARE/COMMIT/ROLLBACK already uses the trx_group_commit_leader() code. So what you should do is to arrange for the update of / delete from the XID hash happens inside there, just before the wakeup_subsequent_commits() gets called: if (current->cache_mngr->using_xa && likely(!current->error) && DBUG_EVALUATE_IF("skip_commit_ordered", 0, 1)) { mysql_mutex_lock(&current->thd->LOCK_thd_data); run_commit_ordered(current->thd, current->all); mysql_mutex_unlock(&current->thd->LOCK_thd_data); } current->thd->wakeup_subsequent_commits(current->error); I checked the code, the XID hash update/delete currently happens shortly after that in the code, so there should be no problems moving it in there. [Incidentally, I also noticed that the XA PREPARE/COMMIT does not use the commit_ordered mechanism. While this seems (I think) unrelated to this patch, I think it is something you need to look into. The commit_ordered mechansim is central to ensuring that InnoDB and binlog commit in the same order, and I don't see how this is guaranteed for the current XA code (and I suspect this might also mean that binlog recovery will be broken in some cases).] Once the XID hash update/delete is moved as described, we have the very nice property that after an XA PREPARE/COMMIT/ROLLBACK T_i has done wakeup_subsequent_commits(), it is safe for a following T_j that refers to the same XID to replicate. This means that you can now use the existing wait_for_prior_commit mechanism to handle the dependencies between XA event groups with the same XID. Thus, there is no need to introduce a separate (and very complicated) wait mechanism of xid_cache_insert_maybe_wait() and xid_cache_search_maybe_wait(), solely for parallel replication of user XA. Thus, in the SQL driver thread, using your sliding window (which need only be of size |W| I believe), you can mark a "T1: XA COMMIT <xid> " followed by "T2: XA PREPARE <xid>" of a duplicate XID, that T2 must do a wait_for_prior_commit(T1) before running. This is simple, it can use the existing mechanims for that, using entry->last_committed_sub_id. Just like rgi->wait_commit_sub_id and rgi->wait_commit_group_info, we can introduce eg.: rgi->pre_dependency_sub_id rgi->pre_dependency_group_info and then do a wait_for_prior_commit(&rgi->pre_dependency_group_info->commit_orderer) if the rgi->pre_dependency_sub_id > entry->last_committed_sub_id) Similarly, the sliding window can record for "T3: XA PREPARE" and "T4: XA COMMIT", that T4 should do a wait_for_prior_commit(T3) before running. Then T4 will be sure that the XA transaction is ready. This will also be a good preparation for later introducing more of this kind of pre-calculated dependencies to the parallel replication scheduling, along the lines of the "balanced applier" that you have written about previously. These ideas I believe have tremendous potential, and handling this in a generic way is a very big improvement. I'm not suggesting to implement more than what is necessary now for XA, but keep it in mind that this dependency calculation in the SQL driver thread can be used for other than user XA in the future. You can also name the new introduced dependency fields appropriately (ie. with generic names not specific to XA when applicable). This will greatly simplify the patch, I believe; and more importantly it will integrate the XA-required scheduling in a clean and generic way in the parallel replication code. Following is more detailed comments on the patch: > - DBUG_ASSERT( > - !(thd->rgi_slave && > - !thd->rgi_slave->worker_error && > - thd->rgi_slave->did_mark_start_commit) || > - (thd->transaction->xid_state.is_explicit_XA() || > - (thd->rgi_slave->gtid_ev_flags2 & Gtid_log_event::FL_PREPARED_XA))); > - > + DBUG_ASSERT(!(thd->rgi_slave && > + !thd->rgi_slave->worker_error && > + thd->rgi_slave->did_mark_start_commit) || > + (thd->transaction->xid_state.is_explicit_XA() || > + (thd->rgi_slave->gtid_ev_flags2 & > + (Gtid_log_event::FL_PREPARED_XA | > + Gtid_log_event::FL_COMPLETED_XA)))); > if (thd->rgi_slave && > !thd->rgi_slave->worker_error && > thd->rgi_slave->did_mark_start_commit) > thd->rgi_slave->unmark_start_commit(); Why is it allowed to rollback here while did_mark_start_commit is true for XA PREPARE or XA COMMIT? The comment above should be extended to explain this. And why does the following if () statement then still do an emergency unmark_start_commit(), when the condition is not caught by the assertion? It looks like something is wrong in the earlier code, the intention here is that it should never be necessary to unmark_start_commit() here, and if it is, then it is a bug flagged by the assertion. > @@ -1751,7 +1753,8 @@ binlog_flush_cache(THD *thd, binlog_cache_mngr *cache_mngr, > > if ((using_stmt && !cache_mngr->stmt_cache.empty()) || > (using_trx && !cache_mngr->trx_cache.empty()) || > - thd->transaction->xid_state.is_explicit_XA()) > + (thd->transaction->xid_state.is_explicit_XA() || > + (thd->rgi_slave && thd->rgi_slave->is_async_xac))) Why is this extra condition thd->rgi_slave->is_async_xac necessary? There are many of these conditions spread around the code in the patch. I think perhaps it is because the transaction is somehow not "connected" to the THD? Because it is in the XID cache? But the situation must be similar on the master, if the XA COMMIT happens in a different transaction from XA PREPARE. So this should be done the same way on the slave, so that the XA transaction gets connected to the THD of the worker thread processing it, and so that these extra rgi_slave->is_async_xac conditions are not needed. It is very fragile to have such conditions spread around the code, it will be impossible to avoid bugs due to forgetting such extra condition on one place or another. > + /* > + While xid_state.get_xid() is a robust method to access `xid` > + it can't be used on slave by the asynchronously running XA-"complete". > + In the latter case thd->lex->xid is safely accessible. > + */ > + buflen= serialize_with_xid(is_async_xac? thd->lex->xid : > + thd->transaction->xid_state.get_xid(), Same here, it should be possible to use the same xid_state in the slave thread as on the master, so we avoid special cases all over the code for the slave threads. > +static bool acquire_xid(THD *thd) > +{ > + bool rc= false; > + > + if (thd->rgi_slave && thd->rgi_slave->is_async_xac && > + thd->rgi_slave->gtid_ev_flags2 & Gtid_log_event::FL_COMPLETED_XA) > + { > + XID_STATE &xid_state= thd->transaction->xid_state; This function actually seems to do part of "XA transaction gets connected to the THD". But it's called from binlog_commit_by_xid() and binlog_rollback_by_xid(), which means it has to have yet another of these special-condition checks for thd->rgi_slave->is_async_xac etc. Instead, call this from the code that applies the "XA COMMIT" event (eg. in log_event_server.cc), and make sure it fully connects the XA transaction to the the THD. Then you get rid of the checks for is_async_xac in a lot of places, and also avoid polluting with replication details the code than runs the original transaction on the master. > + rc= binlog_commit(thd, TRUE, FALSE); > + thd->ha_data[binlog_hton->slot].ha_info[1].reset(); > + } > + if (!rc) > + { > + rc= acquire_xid(thd); > + } > + if (thd->is_current_stmt_binlog_disabled()) > + { > + thd->wakeup_subsequent_commits(rc); > + } So IIUC, here we first binlog the XA COMMIT and then wakeup_subsequent_commits(). But the engine commit only happens afterwards, in ha_commit_or_rollback_by_xid. Why wakeup_subsequent_commits() here? It seems too early, the transaction is not yet committed in the engine, how do you avoid that the commits in the engine will happen in the wrong order? And then a mariabackup might take a backup with T2 committed but T1 not and not have a valid replication position to provision a slave. And also, is this skipping the binlog transaction coordinator and two-phase commit with the engines? And in particular, commit_ordered()? Then again, it seems we have the problem that the engine can commit in the opposite order from the binlog. I wonder if this will also affect crash recovery if we crash with different commit order in binlog and engine? So I think wakeup_subsequent_commits() here is wrong, should be removed. > @@ -2185,7 +2293,9 @@ int binlog_commit(THD *thd, bool all, bool ro_1pc) > } > > if (cache_mngr->trx_cache.empty() && > - (thd->transaction->xid_state.get_state_code() != XA_PREPARED || > + ((thd->transaction->xid_state.get_state_code() != XA_PREPARED && > + !(thd->rgi_slave && thd->rgi_slave->is_parallel_exec && > + thd->lex->sql_command == SQLCOM_XA_COMMIT)) || As explained above, this condition (and the similar in binlog_rollback() should be avoided. > @@ -10510,13 +10624,20 @@ int TC_LOG_BINLOG::unlog_xa_prepare(THD *thd, bool all) > > binlog_cache_mngr *cache_mngr= thd->binlog_setup_trx_data(); > int cookie= 0; > + int rc= 0; > + > + if (thd->rgi_slave && thd->is_current_stmt_binlog_disabled()) > + { > + rc= thd->wait_for_prior_commit(); > + if (rc == 0) > + thd->wakeup_subsequent_commits(rc); > + return rc; > + } Why is this necessary? If it is really necessary, then put it in the code that applies the XA PREPARE event, not as a random special case in generic code. > - bool rc= false; > - > > rc= write_empty_xa_prepare(thd, cache_mngr); // normally gains need_unlog > > static bool write_empty_xa_prepare(THD *thd, binlog_cache_mngr *cache_mngr) > { > return binlog_commit_flush_xa_prepare(thd, true, cache_mngr); > } Smaller point: since you're fixing the type of `rc` to be int (which is correct), also fix the return type of write_empty_xa_prepare() to be int and not bool - as binlog_commit_flush_xa_prepare() as well as unlog_xa_prepare() return int, not bool. > @@ -3314,16 +3314,22 @@ Gtid_log_event::Gtid_log_event(THD *thd_arg, uint64 seq_no_arg, > XID_STATE &xid_state= thd->transaction->xid_state; > if (is_transactional) > { > - if (xid_state.is_explicit_XA() && > - (thd->lex->sql_command == SQLCOM_XA_PREPARE || > - xid_state.get_state_code() == XA_PREPARED)) > + bool is_async_xac= false; > + if ((xid_state.is_explicit_XA() && > + (thd->lex->sql_command == SQLCOM_XA_PREPARE || > + xid_state.get_state_code() == XA_PREPARED)) || > + (is_async_xac= (thd->rgi_slave && thd->rgi_slave->is_async_xac))) Again, this is the generic GTID event contructor, it shouldn't need this kind of logic. It's very strange that a simple class constructor returns something different depending on which thread it runs in! Hopefully this is no longer necessary after XA transaction gets properly connected to the THD. But else, it should be handled by passing in suitable flag from the caller, or by the caller setting the required modifications after construction. > +using std::max; Please don't. Using std::max() explicit is not long and makes it explicit what `max` implementation is used. > @@ -760,7 +762,8 @@ convert_kill_to_deadlock_error(rpl_group_info *rgi) > return; > err_code= thd->get_stmt_da()->sql_errno(); > if ((rgi->speculation == rpl_group_info::SPECULATE_OPTIMISTIC && > - err_code != ER_PRIOR_COMMIT_FAILED) || > + (err_code != ER_PRIOR_COMMIT_FAILED && > + err_code != ER_XAER_NOTA)) || Why? What is special about the ER_XAER_NOTA error? Is it really safe to speculatively run the XA COMMIT in parallel, and then retry it for any other error than ER_XAER_NOTA ? > @@ -2467,6 +2463,7 @@ free_rpl_parallel_entry(void *element) > } > mysql_cond_destroy(&e->COND_parallel_entry); > mysql_mutex_destroy(&e->LOCK_parallel_entry); > + e->concurrent_xaps_window.~Dynamic_array(); > my_free(e); No, let's not do this, calling a destructor explicitly in a POD managed by my_malloc() / my_free(). Why not just make concurrent_xaps_window a pointer to the Dynamic_array and manage with new / delete? Then you can also allocate it lazily, so it will not need to be allocated at all for the 99.9xx% of users who are not replicating user XA transactions. The size of the sliding window is fixed anyway (until slave_parallel_threads is changed), so why use Dynamic_array at all, why not just allocate a plain array of the right size? > +template <> > +struct std::hash<XID> > +{ > + std::size_t operator()(const XID& xid) const > + { > + return my_hash_sort(&my_charset_bin, xid.key(), xid.key_length()); I don't understand, what is the purpose of introducing this function object? Why not just call my_hash_sort() directly (seems it's only used in one place) and avoid this code-obfuscation? > + Dynamic_array<std::pair<std::size_t, uint32>> concurrent_xaps_window; This is dangerous as Dynamic_array has a destructor but rpl_parallel_entry is a POD where we don't call any constructor/destructor. Should instead use a pointer to an object constructed with new (or not use Dynamic_array at all as suggested above). > + /* > + When true indicates that the user xa transaction is going to > + complete (with COMMIT or ROLLBACK) by the worker thread, > + *while* another worker is still preparing it. Once the latter is done > + the xid will be acquired and the flag gets reset. > + */ > + bool is_async_xac; > + I don't understand this comment. Surely it is not possible to *complete*, with XA COMMIT, a transaction before the corresponding XA PREPARE has completed (possibly in another worker thread)? Do you mean that it is possible for a worker thread W2 to *start* applying the XA COMMIT speculatively, before another worker W1 has completed the XA PREPARE? But I think this is not necessary, see below the discussion that the XA COMMIT worker can just do a wait_for_prior_commit() on the event group of its XA PREPARE. Then this should not be necessary. > @@ -137,7 +137,7 @@ template <class Elem> class Dynamic_array > */ > Elem& at(size_t idx) > { > - DBUG_ASSERT(idx < array.elements); > + DBUG_ASSERT(idx < max_size()); > return *(((Elem*)array.buffer) + idx); No, this cannot possibly be right. This is a patch about XA replication, why would it suddenly allow accessing deleted elements of any Dynamic_array used in the server?!? Surely you can allocate the elements you need for your sliding window. In fact, since the window is fixed size anyway, why use Dynamic_array at all? @@ -2367,6 +2367,15 @@ struct wait_for_commit event group is fully done. */ bool wakeup_blocked; + /* + The condition variable servers as a part of facilities to handle various + commit time additional dependency between groups of replication events, e.g + XA-Prepare -> XA-Commit, or XA-Prepare -> XA-Prepare all with the same xid. + */ + mysql_cond_t COND_wait_commit_dep; No, this doesn't belong in struct wait_for_commit. wait_for_commit is a low-level mechanism for ordering commits, it should not need any knowledge even of replication, certainly not of user-XA replication. The COND_wait_commit_dep isn't even used in any function related to wait_for_commit or even sql_class.cc, it's only used in sql/xa.cc. [But as discussed at the start, we can use the existing wait_for_prior_commit instead of introducing this new mechanism just for replicating user XA.] > -static XID_cache_element *xid_cache_search(THD *thd, XID *xid) > +XID_cache_element *xid_cache_search(THD *thd, XID *xid) Why remove the `static`? The function is not used outside of sql/xa.cc > @@ -254,16 +259,221 @@ static XID_cache_element *xid_cache_search(THD *thd, XID *xid) > + if (thd->rgi_slave && thd->rgi_slave->is_parallel_exec) > + { > + DBUG_ASSERT(thd->lex->sql_command == SQLCOM_XA_COMMIT || > + thd->lex->sql_command == SQLCOM_XA_ROLLBACK); > + thd->rgi_slave->is_async_xac= true; Here again we have some logic that relates to replication being inserted somewhat randomly in low-level code, requiring this thd->rgi_slave->is_parallel_exec condition. Instead of doing this, the caller higher up should handle this depending on the return of xid_cache_search(). This way, code paths that are not related to replication don't get affected by the replication-specific logic. > +bool xid_cache_insert_maybe_wait(THD* thd) > +{ This new wait mechanism just for parallel replication of user XA seems very complex. Spin loop, and extra locking and condition variables and locking on a lock-free hash. As explained at the start, let's instead avoid introducing a new mechanism and use the existing wait_for_prior_commit, which is central to parallel replication and highly optimized for this. IIUC, the real problem here is that after wakeup_subsequent_commits() of XA COMMIT, the XID is still in the xid_hash. This is because of this code in trans_xa_commit(): ha_commit_or_rollback_by_xid(thd->lex->xid, !res); if (!res && thd->is_error()) { // hton completion error retains xs/xid in the cache, // unless there had been already one as reflected by `res`. res= true; goto _end_external_xid; } xid_cache_delete(thd, xs); xid_deleted= true; ha_commit_or_rollback_by_xid() ends up in queue_for_group_commit() which will do the wakeup_subsequent_commits. But the xid_cache_delete() happens just after that. This is too late. The wakeup_subsequent_commits() must not occur until after the transaction is fully committed in memory. So instead, move the xid_cache_delete() call so it happens inside the group commit code. Then when wakeup_subsequent_commits() is done after XA COMMIT, the XID entry will be gone. It would go here in MYSQL_BIN_LOG::trx_group_commit_leader(): if (current->cache_mngr->using_xa && likely(!current->error) && DBUG_EVALUATE_IF("skip_commit_ordered", 0, 1)) { mysql_mutex_lock(&current->thd->LOCK_thd_data); run_commit_ordered(current->thd, current->all); mysql_mutex_unlock(&current->thd->LOCK_thd_data); } current->thd->wakeup_subsequent_commits(current->error); I noticed that commit_ordered is not done for user XA transactions in replication. The direct reason for that seems to be that they do not use log_and_order(), which sets cache_mngr->using_xa. Can you explain why the XA PREPARE and XA COMMIT is not following the commit_ordered protocol, and what mechanism exists instead to ensure that everything still works correctly? For example that commit order in binlog and InnoDB will be the same, and that binlog recovery works? And finally, just for the record, I still do not agree with the way replication of user XA has been changed (aka MDEV-32020), and this patch does not fix the underlying problem. But changing it to use the wait_for_prior_commit() mechanism as explained at the start will improve the integration with the parallel replication code and be a step towards solving also some of the other underlying issues. - Kristian.

1 0

Locking thd->LOCK_thd_data during group commit
by Kristian Nielsen 21 Oct '23

21 Oct '23

Hi Serg, I came upon this commit by you: > commit 6b685ea7b0776430d45b095cb4be3ef0739a3c04 > Author: Sergei Golubchik <serg(a)mariadb.org> > Date: Wed Sep 28 18:55:15 2022 +0200 > > correctness assert > > thd_get_ha_data() can be used without a lock, but only from the > current thd thread, when calling from anoher thread it *must* > be protected by thd->LOCK_thd_data > > * fix group commit code to take thd->LOCK_thd_data > * remove innobase_close_connection() from the innodb background thread, > it's not needed after 87775402cd0c and was failing the assert with > current_thd==0 > @@ -8512,7 +8512,11 @@ MYSQL_BIN_LOG::trx_group_commit_leader(group_commit_entry *leader) > ++num_commits; > if (current->cache_mngr->using_xa && likely(!current->error) && > DBUG_EVALUATE_IF("skip_commit_ordered", 0, 1)) > + { > + mysql_mutex_lock(&current->thd->LOCK_thd_data); > run_commit_ordered(current->thd, current->all); > + mysql_mutex_unlock(&current->thd->LOCK_thd_data); > + } This seems _really_ expensive :-( This code runs for every transaction commit in replication, and it runs under a global LOCK_commit_ordered which needs to be held for as short as possible. The commit message doesn't mention anything about what goes wrong during the group commit code. And the patch doesn't have any test case that shows the problem, it just adds an assertion that will trigger during the group commit. So what is the actual reason for this change? Note that at this point, the group commit leader thread _is_ effectively acting as if it was each of the participating threads, the other threads are blocked from running during commit_ordered. If it is just so that we can have this assertion, then that needs to be rolled back. We _really_ don't want to take/release N mutexes in release builds during LOCK_commit_ordered, just to have an assertion in debug builds. If there is some reason, then the correct fix could be to set current_thd for the duration of run_commit_ordered, which will satisfy the assertion and presumably the actual bug? Well, it depends _what_ the actual reason/bug is. - Kristian.

3 5

Re: Review of MDEV-31273: Eliminate Log_event::checksum_alg
by Kristian Nielsen 17 Oct '23

17 Oct '23

Hi Monty, Lots of good comments, thanks! Replies inline: Michael Widenius <michael.widenius(a)gmail.com> writes: > enum enum_binlog_checksum_alg > > In c++ we can drop the first 'enum' > You can fix that by doing: Ack, will fix. > event.select_checksum_alg() > > This is called a lot. Would it not be easier to store the checksum_alg > in the event when it is created and access the variable directly? > On the other hand, the number of calls to event.select_checksum_alg() may > not change as we will only create one Yes, this is called only once per event, when writing the event into the binlog or the event cache. So it seems unnecessary to use extra memory to store it. And we don't necessarily even have binlog_cache_data when constructing the event to pass to select_checksum_alg(), so I think it is correct to select the checksum algorithm to use only when we write the event. > @@ -1258,11 +1258,10 @@ Log_event* Log_event::read_log_event(const > uchar *buf, uint event_len, > > if (ev) > { > - ev->checksum_alg= alg; > #ifdef MYSQL_CLIENT > if (ev->checksum_alg != BINLOG_CHECKSUM_ALG_OFF && > ev->checksum_alg != BINLOG_CHECKSUM_ALG_UNDEF) > ev->crc= uint4korr(buf + (event_len)); > + ev->read_checksum_alg= alg; > > Why move the setting of read_checksum_alg inside MYQL_CLIENT? It is because with this patch, ev->checksum_alg is no longer needed in the server (the class member is removed). It is needed in the client (mysqlbinlog), so I added a new member read_checksum_alg, which is the algorithm that was used to read the event. This is needed so that mysqlbinlog can output the checksum (if any) in textual form. But it is not needed in the server > To ensure that this is not wrongly used, please add > #ifdef MYSQL_CLIENT > > here: > > class Log_event: > > #ifdef MYSQL_CLIENT > + enum enum_binlog_checksum_alg read_checksum_alg; > #endif Yes, agree, that is the intention - the read_checksum_alg is only present in the class in the client, not in the server (and we want to avoid using memory for it in the server). This code is already in the #else branch of #ifdef MYSQL_SERVER: #ifdef MYSQL_SERVER ... #else /* The checksum algorithm used (if any) when the event was read. */ enum enum_binlog_checksum_alg read_checksum_alg; I thought this would have the same effect as your suggestion with #ifdef MYSQL_CLIENT. It probably does, but it seems somewhat complex when MYSQL_CLIENT and MYSQL_SERVER are defined. Maybe they can even both be defined, though probably not where log_event.h is included? It's a bit confusing that we have both MYSQL_CLIENT and MYSQL_SERVER used for conditional compilation in log_event.h. But probably not something that should be cleaned up in this patch. Do you think I should still add the #ifdef MYSQL_CLIENT as you suggested (even though it should be redundant in the #else branch of #ifdef MYSQL_SERVER)? It can't hurt after all, even if we always have MYSQL_CLIENT == !MYSQL_SERVER. > in Format_description_log_event(const uchar > > We have: > > else > { > checksum_alg= BINLOG_CHECKSUM_ALG_UNDEF; > used_checksum_alg= BINLOG_CHECKSUM_ALG_OFF; > } > > The used_checksum_alg is not needed as you set it at function start. Sorry, I don't understand? At the start we set it to "undef": + used_checksum_alg= BINLOG_CHECKSUM_ALG_UNDEF; This value will remain if there's an error return; else the else branch will change it to "off". Did you miss that it's set differently at the start of the function and in this else { } branch? Or did I miss something? > sql/log_event.h > > enum enum_binlog_checksum_alg set_checksum_alg(enum > enum_binlog_checksum_alg alg) > As we are just copying and restoring checksum_length, why not just do > > Format_description_log_event::write(Log_event_writer *writer) > > org_checksum_length= checksum_length... > .... > checksum_length= org_checksum_length; > > instead of calling set_checksum_alg() again ? > (The last set we can even do undonditionally which saves us an if) Agree, that code looks silly, I don't know why I did it that way. I'll fix it, remove the function set_checksum_alg() and just do: org_checksum_length= writer->checksum_length; writer->checksum_length= BINLOG_CHECKSUM_LEN; ... writer->checksum_length= org_checksum_length; > log_event_server.cc > > enum enum_binlog_checksum_alg Log_event::select_checksum_alg() > > Much simpler, but we do not check if cache_type is correct > > DBUG_ASSERT(((get_type_code() != ROTATE_EVENT && > get_type_code() != STOP_EVENT) || > get_type_code() != FORMAT_DESCRIPTION_EVENT) || > cache_type == Log_event::EVENT_NO_CACHE); > > Do you think this is not needed? > (This could be added just before: > > return BINLOG_CHECKSUM_ALG_OFF; Agree, I was happy to remove a lot of complexity and the need for a number of asserts, but this assertion could be kept. I'll add it back. > Anyway, please remove the else after: > > if (cache_type == Log_event::EVENT_NO_CACHE) > return (enum_binlog_checksum_alg)binlog_checksum_options; Ack will do. > Format_description_log_event::write(Log_event_writer *writer) > > > uint8 checksum_byte= (uint8) (used_checksum_alg != BINLOG_CHECKSUM_ALG_UNDEF ? > used_checksum_alg : BINLOG_CHECKSUM_ALG_OFF); > > Is it possible the alg can be UNDEF? > You have several asserts in the code that this should not happen. Yeah, honestly, I'm not sure. It's one of the annoying things in the code that I did not get to fix, that we have both the UNDEF and the OFF state for the checksum algorithm. Format_description code is not performance critical (only once per binlog file), so I suggest I leave the extra check for UNDEF as defensive code, but add an assertion that it is not UNDEF and see if it passes the testsuite. Does that sound ok? > slave.cc > > @@ -1870,10 +1873,8 @@ static int get_master_version_and_clock(MYSQL* > mysql, Master_info* mi) > until it has received a new FD_m. > */ > mi->rli.relay_log.description_event_for_queue->checksum_alg= > mi->rli.relay_log.relay_log_checksum_alg; > > Not sure why this is removed. > > Are these guaranteed to be the same here? > If yes, why not add an assert for this? This patch removes Log_event::checksum_alg and instead adds a field Format_description_log_event::used_checksum_alg which gets set in the Format_description_log_event constructor. So this assignment - mi->rli.relay_log.description_event_for_queue->checksum_alg= - mi->rli.relay_log.relay_log_checksum_alg; is replaced by this code just above which serves the same purpose: - Format_description_log_event(4, mysql->server_version); + Format_description_log_event(4, mysql->server_version, + mi->rli.relay_log.relay_log_checksum_alg); So I think adding an assert DBUG_ASSERT(mi->rli.relay_log.description_event_for_queue->checksum_alg == mi->rli.relay_log.relay_log_checksum_alg); is somewhat redundant, but let me know if you still think it will be useful and I'll add it. - Kristian.

2 1