Hi, Andrei,
On Jan 05, Andrei wrote:
> commit 1c9edc72fc2
> Author: Andrei <andrei.elkin(a)mariadb.com>
> Date: Wed Nov 1 20:32:45 2023 +0200
>
> MDEV-32830 I. refactor XA binlogging for better integration with BGC/replication/recovery
>
> This commit is the part I in the series of three that addresses MDEV-31949.
>
> This part improves upon MDEV-742 design of the XA binlogging.
> When binlog is ON, to execute XA-prepare first in engines to write
> the xa-prepare replication event last.
> MYSQL_BIN_LOG::run_ordered_commit() is made to execute XA commit and
> rollback in engines, as well as
> it completes with the XA transaction state, so that its
> XID has become available for next use, e.g by another slave parallel
> worker, that e.g could be waiting for its order to commit a being
> prepared XID transaction.
Is it intentional
that line
length in the paragraph above varies
randomly from line to
line?
And I don't understand why MYSQL_BIN_LOG::run_ordered_commit() has to do the complete
XA commit in all engines? it's not its job. And it won't make XID available sooner,
it'll only make XID available in a different function.
> Notable changes:
>
> sql/handler.cc
> - ha_prepare() now is ensured to execute binlog_hton::prepare() as
> the last branch to prepare; recovery concerns are addressed
> separately (not by MDEV-21469, but rather along the plan of bug#76233)
> - read-only cases are handled more uniformly with extending
> has_binlog_hton() that registers the binlog branch at ha_prepare()
> when necessary
> - ha_rollback_trans() executes binlog_hton::rollback() as first
> branch, as a part of the 76233 planned recovery;
of <what> planned recovery?
> - ditto to the external completion of XA via ha_commit_or_rollback_by_xid().
> sql/log.cc
> - binlog_commit,rollback() simplified/clarified on the condition of an empty
> statement/trx
> - MYSQL_BIN_LOG::run_commit_ordered() is extended to cover
> XA-{commit, rollback} respective actions in Engines
> - XID unlogging by XA is converted to employ the standard
> binlog-checkpoint mechanism (cache_mngr->using_xa= TRUE)
> sql/xa.cc
> - trans_xa_commit,rollback() are refactored to invoke a common function
> for external (connection) completion
> - adapting the above two and slave_applier_reset_xa_trans() to
> possible (normal to slave) XID release, inside run_ordered_commit().
>
> diff --git a/sql/handler.cc b/sql/handler.cc
> --- a/sql/handler.cc
> +++ b/sql/handler.cc
> @@ -1472,6 +1472,48 @@ static int prepare_or_error(handlerton *ht, THD *thd, bool all)
> return err;
> }
>
> +/**
> + Search in @c ha_info transaction list branches for binlog.
> + When the 2nd argument is not NULL and the modified engine number
> + is non-zero the binlog branch is activated.
> +
> + @param ha_info The first node in the transaction descriptor list.
> + @param thd The client thread that executes the transaction.
> +
> + @return true when the binlog branch is registered or set up for transaction,
> + false otherwise.
> +*/
> +static bool has_binlog_hton(Ha_trx_info *ha_info, THD* thd= NULL)
> +{
> + bool rc;
> + uint rw_count;
> +
> + for (rc= false, rw_count= 0; ha_info && !rc; ha_info= ha_info->next())
> + {
> + rc= ha_info->ht() == binlog_hton;
> + rw_count += (ha_info->is_trx_read_write());
> + }
> +
> + if (!rc && thd && !thd->is_current_stmt_binlog_disabled() &&
> + unlikely(rw_count > 0))
> + {
> + /*
> + In case an engine is modified but binlog is not part of xa-prepared
> + transaction (thd is non-NULL) set out binlogging of an empty xa-prepare.
> + */
when can we land here? non-trans (or non-XA) table modified?
> + DBUG_ASSERT(thd->transaction->xid_state.get_state_code() == XA_IDLE);
> +
> + THD_TRANS *ptr_trans= &thd->transaction->all;
> + thd->ha_data[binlog_hton->slot].ha_info[1].register_ha(ptr_trans,
> + binlog_hton);
> + thd->ha_data[binlog_hton->slot].ha_info[1].set_trx_read_write();
> + thd->binlog_setup_trx_data();
> +
> + rc= true;
> + }
> +
> + return rc;
> +}
>
> /**
> @retval
> @@ -1488,9 +1530,16 @@ int ha_prepare(THD *thd)
>
> if (ha_info)
> {
> + int err;
> + bool has_binlog= has_binlog_hton(ha_info, thd);
> +
> for (; ha_info; ha_info= ha_info->next())
> {
> handlerton *ht= ha_info->ht();
> +
> + if (ht == binlog_hton)
> + continue;
Does binlog need a handlerton anymore? It was an easy way to get it registered
for participating in 2pc. But if you always filter it out, then there's no need
to add it to ha_info list in the first place, is there?
> +
> if (ht->prepare)
> {
> if (unlikely(prepare_or_error(ht, thd, all)))
> diff --git a/sql/log.cc b/sql/log.cc
> --- a/sql/log.cc
> +++ b/sql/log.cc
> @@ -1895,6 +1896,11 @@ binlog_rollback_flush_trx_cache(THD *thd, bool all,
> if (thd->transaction->xid_state.get_state_code() == XA_PREPARED)
> buflen= serialize_with_xid(thd->transaction->xid_state.get_xid(),
> buf, query, q_len);
> + cache_mngr->using_xa= TRUE;
> +#ifndef DBUG_OFF
> + if (thd->killed != NOT_KILLED)
> + thd->lex->sql_command= SQLCOM_XA_ROLLBACK; // for run_commit_ord asserts
1. don't abbrev, commits are for others not for you
2. what other values can sql_command have here?
> +#endif
> }
> Query_log_event end_evt(thd, buf, buflen, TRUE, TRUE, TRUE, 0);
>
> @@ -2186,16 +2200,13 @@ int binlog_commit(THD *thd, bool all, bool ro_1pc)
> }
>
> if (cache_mngr->trx_cache.empty() &&
> - (thd->transaction->xid_state.get_state_code() != XA_PREPARED ||
> - !(thd->ha_data[binlog_hton->slot].ha_info[1].is_started() &&
> - thd->ha_data[binlog_hton->slot].ha_info[1].is_trx_read_write())))
> + (!(is_preparing_xa(thd) || is_prepared_xa(thd)) ||
> + !is_binlog_read_write(thd)))
> {
> /*
> - This is an empty transaction commit (both the regular and xa),
> - or such transaction xa-prepare or
> - either one's statement having no effect on the transactional cache
> - as any prior to it.
> - The empty xa-prepare sinks in only when binlog is read-only.
> + This is an empty transaction/statement commit (both the regular and xa),
> + or the xa transaction is running xa-prepare or xa-commit while
> + its binlog branch is not read-write.
weird. how and why do you get into binlog_commit on XA PREPARE?
> */
> cache_mngr->reset(false, true);
> THD_STAGE_INFO(thd, org_stage);
> @@ -7752,7 +7763,7 @@ MYSQL_BIN_LOG::write_transaction_to_binlog(THD *thd,
> entry.all= all;
> entry.using_stmt_cache= using_stmt_cache;
> entry.using_trx_cache= using_trx_cache;
> - entry.need_unlog= is_preparing_xa(thd);
> + entry.need_unlog= 0;
is `need_unlog` always 0 now?
> ha_info= all ? thd->transaction->all.ha_list : thd->transaction->stmt.ha_list;
> entry.ro_1pc= is_ro_1pc;
> entry.end_event= end_ev;
> @@ -9365,18 +9381,102 @@ TC_LOG::run_commit_ordered(THD *thd, bool all)
> all ? thd->transaction->all.ha_list : thd->transaction->stmt.ha_list;
>
> mysql_mutex_assert_owner(&LOCK_commit_ordered);
> +
> + if (!ha_info)
> + {
> + // slave or master's xa-"complete" external
what does it mean?
> + DBUG_ASSERT(thd->lex->sql_command == SQLCOM_XA_COMMIT ||
> + thd->lex->sql_command == SQLCOM_XA_ROLLBACK);
> +
> + XID *xid= thd->lex->xid;
> + bool commit= thd->lex->sql_command == SQLCOM_XA_COMMIT;
> + struct xahton_st xaop= { xid, 1 };
> +
> + plugin_foreach(NULL, commit ? xacommit_handlerton : xarollback_handlerton,
> + MYSQL_STORAGE_ENGINE_PLUGIN, &xaop);
is it for commit/rollback by xid?
> + xid_cache_delete(thd);
> +
> + DBUG_ASSERT(!thd->transaction->xid_state.is_explicit_XA());
> +
> + return;
> + }
> +
> + // else
> + if (is_preparing_xa(thd))
> + {
> + for (; ha_info; ha_info= ha_info->next())
> + {
> + DBUG_ASSERT(all);
> + DBUG_ASSERT(thd->lex->sql_command == SQLCOM_XA_PREPARE);
> + DBUG_ASSERT(thd->transaction->xid_state.is_explicit_XA());
> +
> + handlerton *ht= ha_info->ht();
> + if (!ht->prepare)
> + {
> + push_warning_printf(thd, Sql_condition::WARN_LEVEL_WARN,
> + ER_GET_ERRNO, ER_THD(thd, ER_GET_ERRNO),
> + HA_ERR_WRONG_COMMAND,
> + ha_resolve_storage_engine_name(ht));
> + continue;
> + }
> + else if (ht != binlog_hton)
> + {
> + /* leader only */
> + //if (thd->binlog_setup_trx_data()->last_commit_pos_offset == 0)
> + // ht->prepare(ht, thd, all);
wait, what? why do you need a loop over all engines if you don't do
anything there?
> + }
> + }
> + if (thd->rgi_slave && thd->rgi_slave->is_parallel_exec)
> + {
> + thd->transaction->xid_state.set_state_code(XA_PREPARED);
> + thd->transaction->xid_state.set_cache_element_to_recovered();
> + thd->transaction->xid_state.xid_cache_element= 0;
when are you here?
> + }
> + }
> + else if (thd->transaction->xid_state.is_explicit_XA())
> + {
> + for (; ha_info; ha_info= ha_info->next())
> + {
> + DBUG_ASSERT(all);
> + DBUG_ASSERT(thd->lex->sql_command == SQLCOM_XA_COMMIT ||
> + thd->lex->sql_command == SQLCOM_XA_ROLLBACK);
> + DBUG_ASSERT(is_prepared_xa(thd) ||
> + ((thd->lex->sql_command == SQLCOM_XA_ROLLBACK &&
> + thd->transaction->xid_state.get_state_code() == XA_IDLE) ||
> + thd->lex->xa_opt == XA_ONE_PHASE));
> + DBUG_ASSERT(!thd->transaction->xid_state.is_recovered());
> +
> + handlerton *ht= ha_info->ht();
> + if (thd->lex->sql_command == SQLCOM_XA_COMMIT)
> + {
> + if (ht->commit_ordered)
> + ht->commit_ordered(ht, thd, all);
> + }
> + else if (ht != binlog_hton)
why not?
> + {
> + ht->rollback(ht, thd, all);
> + }
> + }
> + xid_cache_delete(thd); // trx is completed now
> +
> + DBUG_ASSERT(!thd->transaction->xid_state.is_explicit_XA());
> + }
> + else
> + {
when are you here?
> for (; ha_info; ha_info= ha_info->next())
> {
> handlerton *ht= ha_info->ht();
> if (!ht->commit_ordered)
> continue;
> ht->commit_ordered(ht, thd, all);
> DBUG_EXECUTE_IF("enable_log_write_upto_crash",
> {
> DBUG_SET_INITIAL("+d,crash_after_log_write_upto");
> sleep(1000);
> });
> DEBUG_SYNC(thd, "commit_after_run_commit_ordered");
> }
> + }
> }
>
> diff --git a/sql/xa.cc b/sql/xa.cc
> --- a/sql/xa.cc
> +++ b/sql/xa.cc
> @@ -319,9 +342,16 @@ static void xid_cache_delete(THD *thd, XID_cache_element *&element)
>
> void xid_cache_delete(THD *thd, XID_STATE *xid_state)
> {
> - DBUG_ASSERT(xid_state->is_explicit_XA());
> - xid_cache_delete(thd, xid_state->xid_cache_element);
> - xid_state->xid_cache_element= 0;
> + if (xid_state->is_explicit_XA())
why it should be done here and not in the caller?
> + {
> + xid_cache_delete(thd, xid_state->xid_cache_element);
> + xid_state->xid_cache_element= 0;
> + }
> +}
> +
> +void xid_cache_delete(THD *thd)
> +{
> + xid_cache_delete(thd, &thd->transaction->xid_state);
> }
>
>
> @@ -547,7 +577,14 @@ bool trans_xa_prepare(THD *thd)
> }
> else
> {
> - thd->transaction->xid_state.xid_cache_element->xa_state= XA_PREPARED;
> + if (thd->transaction->xid_state.xid_cache_element)
> + {
> + thd->transaction->xid_state.xid_cache_element->xa_state= XA_PREPARED;
> + }
> + else
> + {
> + DBUG_ASSERT(thd->rgi_slave->is_parallel_exec);
what do you mean by that? That parallel slave doesn't add XA transactions to
the xid_cache?
> + }
> MYSQL_SET_TRANSACTION_XA_STATE(thd->m_transaction_psi, XA_PREPARED);
> res= thd->variables.pseudo_slave_mode || thd->slave_thread ?
> slave_applier_reset_xa_trans(thd) : 0;
Regards,
Sergei
Chief Architect, MariaDB Server
and security(a)mariadb.org