Kristian,
Using the chance to forwarding the reply that did not make to the
developers list, let me comment on binlog_commit() that I might hastily
rate as a big subject.
> Hi Andrei,
>
> Here is the last part of my review of the changes in branch
> bb-10.6-MDEV-31949:
>
>> --- a/sql/log_event_server.cc
>> +++ b/sql/log_event_server.cc
>> @@ -3324,6 +3324,11 @@ Gtid_log_event::Gtid_log_event(THD *thd_arg, uint64 seq_no_arg,
>> flags2|= thd->lex->sql_command == SQLCOM_XA_PREPARE ?
>> FL_PREPARED_XA : FL_COMPLETED_XA;
>> xid.set(xid_state.get_xid());
>> +
>> + // multi-engine external completion is unaware of the prepared engine #.
>> + extra_engines= thd->transaction->all.ha_list ?
>> + ha_count_rw_2pc(thd_arg,
>> + thd_arg->in_multi_stmt_transaction_mode()) - 1 : 0;
>
> I cannot understand the comment.
> I assume it's about that an XA COMMIT of a detached xid (different
> session
> that the XA PREPARE, or replication slave) - right?
Correct.
>
> But what is the consequences of not knowing extra_engines in this
> case?
> Is the extra_engines used for anything than the optional truncate of
> the
> binlog after master crash?
In the multi-engine case when the # of prepared in Engine is greater
than in the value in GTID event then the trx is partially committed, so
is decided for to be completed.
>
>> @@ -4175,7 +4173,8 @@ int XA_prepare_log_event::do_commit()
>> thd->lex->xid= &xid;
>> if (!one_phase)
>> {
>> - if ((res= thd->wait_for_prior_commit()))
>> + if (thd->is_current_stmt_binlog_disabled() &&
>> + (res= thd->wait_for_prior_commit()))
>
> Why?000
> Is it to allow XA_PREPARE_LOG_EVENT to group commit with other event
> groups?
> That makes sense, but then shouldn't the wait_for_prior_commit() be
> somewhere in handler.cc (or maybe xa.cc)?
> Look at Xid_log_event::do_commit(), it doesn't have any
> wait_for_prior_commit(), that is in ha_commit_one_phase().
Right.
> (The wait-for-prior-commit facility is designed as a server-level
> feature,
> not a replication-level, it could in principle be used outside of
> replication, even though it not currently (IIRC).)
I think I understood what you mean which is follow the ha_commit_one_phase()
pattern.
[todo]
>
>> diff --git a/sql/sql_array.h b/sql/sql_array.h
>> index 8610e971016..94c6d86ac05 100644
>> --- a/sql/sql_array.h
>> +++ b/sql/sql_array.h
>> @@ -172,6 +172,8 @@ template <class Elem> class Dynamic_array
>>
>> size_t size() const { return array.elements; }
>>
>> + size_t max_size() const { return array.max_element; }
>> +
>
> Not that it hurts that much, but I think you can avoid adding this.
Left over of a cleanup from the previous version.
[todo]
>
>> --- a/sql/log.cc
>> +++ b/sql/log.cc
>> @@ -85,6 +85,7 @@ const char *log_bin_index= 0;
>> const char *log_bin_basename= 0;
>>
>> MYSQL_BIN_LOG mysql_bin_log(&sync_binlog_period);
>> +XID_cache_element *xid_cache_search(THD *thd, XID *xid);
>
> This duplicates the same declaration in sql/xa.h
> So probably this can be removed now?
[todo]
>
>> @@ -2186,16 +2202,13 @@ int binlog_commit(THD *thd, bool all, bool
>> ro_1pc)
>> }
>>
>> if (cache_mngr->trx_cache.empty() &&
>> - (thd->transaction->xid_state.get_state_code() != XA_PREPARED ||
>> - !(thd->ha_data[binlog_hton->slot].ha_info[1].is_started() &&
>> - thd->ha_data[binlog_hton->slot].ha_info[1].is_trx_read_write())))
>> + (!(is_preparing_xa(thd) || is_prepared_xa(thd)) ||
>> + !is_binlog_read_write(thd)))
>
> This binlog_commit function started out as the "commit" method for the
> binlog handlerton. But that is not how it is used anymore, now it is
> called
> directly from several different places, including some places specific
> to
> external user XA.
>
> A large part of the function has become complex conditions like this
> that
> are basically impossible for the developer (or reviewer) to understand
> exactly why this set of conditions is the correct one. Again, the
> function
> is trying to guess what the caller wants, depending on the state of
> various
> thd properties etc.
>
> Instead, the function should be split into multiple functions. One
> that is
> called to handle XA PREPARE; one for binlog_commit_by_xid(), etc. Then
> each
> version should be able to greatly reduce the number of conditions
> needed.
>
> And if any logic is shared between the different versions, this can
> then be
> put in smaller functions that are called from each version.
[my initial response was:
Your proposal leads to a significant refactoring, which I add only one hunk.
I'd really consider that as a separate exercise
I think I've understood. Sure, binlog_prepare() that logs XA-PREPARE
can call a common function with binlog_commit() that originally intended
to log only normal trx:s.
I can look into that.
[todo]
>>
>> @@ -11050,10 +11124,30 @@ bool
>> Recovery_context::decide_or_assess(xid_recovery_member *member, int
>> round,
>> }
>> else if (member->in_engine_prepare < last_gtid_engines)
>> {
>> - DBUG_ASSERT(member->in_engine_prepare > 0);
>> + DBUG_ASSERT(member->in_engine_prepare > 0 ||
>> + /* XA event over xid that was completed in engines */
>> + (member->xid == 0 &&
>> + member->xa_binlog_state >= xid_recovery_member::XA_PREPARE));
>> + if (member->in_engine_prepare > 0 &&
>> + member->xa_binlog_state == xid_recovery_member::XA_PREPARE)
>> + {
>
> The function decide_or_assess() is hard to understand.
> I can see that it's called in 3 cases: XID_LOG_EVENT, XA PREPARE, XA
> COMMIT/ROLLBACK.
> So again here, I think this should be split into 3 different
> functions, one
> for each case (or maybe two, one for internal XID and one for external
> XA,
> whatever makes the more sense).
>
> I don't understand the comment "XA event over xid that was completed
> in
> engines". But I remember seeing something in log_event_server.cc that
> sets
> extra_engines to 0 in the external user XA case.
>
> So is that the case here, that last_gtid_engines==0 marks this is user
> XA?
> That would be much clearer if the logic for internal and external XA
> was not
> mixed in the same function.
>
>> + /*
>> + the first of the two options below can occur when round 1 turns to 2
>> + so event sequences across binlog files are like the following
>> + g_current(B_0, xid), ... g_last(B^*, xid),
>> + where `g`:s are xa event groups associated with the same member via
>> + common `xid`. As elsewhere B_0 is the 2nd round's first and,
>> + B^* the 1st round binlogs.
>> + */
>> if (reset_truncate_coord(member->binlog_coord < last_gtid_coord ?
>> member->binlog_coord.second : pos))
>
> I'm trying to understand what is going on here. I think this is refering to
> the way binlog files are scanned?
> First we scan the most recent binlog file. At the end of it, we may find
> that the latest binlog checkpoint is in an earlier file, then we continue to
> scan those older files. So for example we may scan in order:
>
> master-bin.000004, master-bin.000002, master-bin.000003
>
> This means that we may see the XA PREPARE and XA COMMIT out-of-order as we
> go from the first scan to the remaining - "round 1 turns to 2".
>
> But I don't understand that this is "the first of the two options below":
>
> member->binlog_coord < last_gtid_coord ? member->binlog_coord.second : pos
>
> If last_gtid_coord is in master-bin.000002, but member->binlog_coord is in
> master-bin.000004 ("round 1 turns to 2"), then I would think we would have
>
> member->binlog_coord > last_gtid_coord
>
> So it would be the *second* of the options below? What am I missing here?
>
> Also, I would suggest a simplification here. After we do the initial scan in
> round 1, if we find that we need to scan more files, then reset all the
> state and hashes. And then do a full scan of all necessary binlog files
> in-order. In my example, that would be:
>
> master-bin.000004
> <reset state>
> master-bin.000002
> master-bin.000003
> master-bin.000004
>
> This way, the logic becomes simpler, as we don't need logic to handle seeing
> events out-of-order.
>
> This will mean that occasionally we will need to scan one more binlog file,
> but I think that is a good trade-off. The recovery code is notoriously
> difficult to test and get correct, because it's an unusual occurence to need
> recovery in the first place, and very hard to reproduce any problems. And
> robust and correct recovery is the whole point of having XA in the first
> palce. So making the code as simple as possible is crucial.
>
> The extra cost is small:
>
> - For one, recovery is a rare event
> - Re-scanning only costs CPU, the file is already read from storage into OS
> buffers.
> - Normally, the binlog checkpoint completes shortly after rotation, so the
> last binlog file that must be rescanned will be short.
>
> So simply redo the scan from the correct point when a "round 2" is needed,
> to avoid any logic to handle seeing events out-of-order during recovery.
>
>
>> @@ -11182,7 +11311,9 @@ void Recovery_context::process_gtid(int round, Gtid_log_event *gev,
>> last_gtid_no2pc= false;
>> last_gtid_standalone=
>> (gev->flags2 & Gtid_log_event::FL_STANDALONE) ? true : false;
>> - if (do_truncate && last_gtid_standalone)
>> + /* a standalone XA-COMPLETE is truncation-safe */
>> + if (do_truncate && last_gtid_standalone
>> + && !(gev->flags2 & Gtid_log_event::FL_COMPLETED_XA))
>> update_binlog_unsafe_coord_if_needed(linfo);
>
> I don't understand the comment here.
> What is "a standalone XA-COMPLETE" ?
> What does it mean to be "truncation-safe"?
> Please reword the comment to be comprehensible.
>
>> + /*
>> + the user XA events are decided similary to the normal transaction,
>> + difference is just in the timing which is Gtid event read now.
>> + */
>
> I don't understand this comment. Please reformulate to make it
> understandable. What does normal transactions do, and what does user
> XA do?
> How does the timing differ, and why?
>
>> + sql_print_warning("GTID %u-%u-%s XA transaction state % "
>> + "in binlog file:%s pos:%llu is inconsistent "
>> + "with the former same state; "
>> + "consider filtered binary loggig.",
>> + ctx.last_gtid.domain_id, ctx.last_gtid.server_id,
>> + buf, is_xac ? "'COMPLETE'" : "'PREPARE'",
>> + linfo->log_file_name, ctx.last_gtid_coord.second);
>
> Typo: s/loggig/logging/
>
> Also, the phrase "the former same state" is not understandable (at
> least to
> me), what does it mean? Please reformulate.
>
I also have to be short now in replying to the rest, which
I'll expand on tomorrow morning.
Now just this.
>> @@ -909,14 +910,16 @@ ha_commit_checkpoint_request(void *cookie, void (*pre_hook)(void *))
>> there's no need to rollback here as all transactions must
>> be rolled back already
>> */
>> -void ha_close_connection(THD* thd)
>> +void ha_close_connection(THD* thd, bool skip_binlog)
>
> I tried to understand what the purpose of this change is.
> ha_close_connection() is now called from two new places related to
> replication of XA PREPARE. And in one place with this mysterious
> skip_binlog=true condition.
>
> My guess is that this is to detach the XA PREPARE from the current
> THD?
The reason is described in 6cbb63f592 commit name II.1.
```
OTHER CHANGES,
to
ha_close_connection()
are done to keep the binlog hton connection alive for the end of
XA-prepare's binlogging on the parallel slave.
Other than the binlog hton:s are disconnected from the slave
transaction handler in TC_LOG::run_commit_ordered() which makes
the prepared transaction engine objects available to
binlog-commit-order followers.
```
In other words XAP_k in run_commit_ordered() releases with
ha_close_connection(thd, true)
all resources
that the following XAC_k+i need. The latter does not need XAP_k
to release binlog which it still need while being is this stack of
TC_LOG::run_commit_ordered
MYSQL_BIN_LOG::trx_group_commit_leader
MYSQL_BIN_LOG::write_transaction_to_binlog_events
MYSQL_BIN_LOG::write_transaction_to_binlog
binlog_flush_cache
binlog_commit_flush_xa_prepare
binlog_commit
binlog_prepare
prepare_or_error
ha_prepare
> So
> that it becomes visible in XA RECOVER to other threads, and (maybe
> more
> important) to make the THD available to run another transaction? One
> of the
> things InnoDB does in close_connection is: trx_disconnect_prepared().
>
> But I think the purpose of the close_connection handlerton call is
> different
And you must have meant 'ha_close_connection', right?
> - it is to free things at the end of a client connection?
Well, with the ha- prefix, it's to disconnect THD from all
transaction branches/htons.
> So it can do
> more
> than detach the current external XA - which is presumably the reason
> for the
> need to hack it to not call the specific binlog_close_connection()?
>
> So this shouldn't hijack the close_connection to mean
> "detach_xa_trx". It
> looks like we need an explicit handlerton call for detaching an xa
> transaction, that can then be used instead of this hack in
> ha_close_connection().
I need to digest that in the morning.
Generally I don't see a problem to conduct hton
release flexibly.
>
>> -static my_bool xacommit_handlerton(THD *unused1, plugin_ref plugin,
>> - void *arg)
>> +my_bool xacommit_handlerton(THD *unused1, plugin_ref plugin, void *arg)
>> {
>> handlerton *hton= plugin_hton(plugin);
>> - if (hton->recover)
>> + if (hton->recover && hton != binlog_hton)
>> {
>> hton->commit_by_xid(hton, ((struct xahton_st *)arg)->xid);
>
>> -int ha_commit_or_rollback_by_xid(XID *xid, bool commit)
>> +int ha_commit_or_rollback_by_xid(XID *xid, bool commit, THD *thd)
>> {
>
>> + if (!skip_binlog)
>> + {
>> + // when binlog is ON start from its transaction branch
>> if (commit)
>> binlog_commit_by_xid(binlog_hton, xid);
>> else
>> binlog_rollback_by_xid(binlog_hton, xid);
>
> I think the purpose if this is to ensure that commit/rollback of XA
> calls
> into the binlog code first (to binlog before doing it in the engines).
>
> But as I think Serg also remarked already, this should not be
> needed. The
> binlog is the (internal) transaction coordinator when it is enabled,
> we have
> a dedicated interface for that, class TC_LOG. This is used elsewhere
> in
> handler.cc to avoid special cases for whether the binlog is enabled or
> not.
>
> So it looks like there needs to be extra virtual functions in TC_LOG
> for
> the XA commit/rollback, and these can then be called on the tc_log
> object
> and do what is required.
I need to process that tomorrow.
> Then some logic in handler.cc can be omitted/simplified.
>
> ----
>
> I think with this, I got through all of the code changes in branch
> bb-10.6-MDEV-31949 as of commit
> 1bdc9bc8077c79e55cb4dbfc7cc47a193d17d54f
> from January 20, 2024. So this concludes the current review. I have
> not
> looked at the test cases in this round of review.
Thanks again!
Cheers,
Andrei