[MariaDB developers] [Andrei Elkin] Re: Third/last part of review of XA patches in bb-10.6-MDEV-31949

12 Feb 2024

Kristian,

Using the chance to forwarding the reply that did not make to the
developers list, let me comment on binlog_commit() that I might hastily
rate as a big subject.
...
Hi Andrei,
Here is the last part of my review of the changes in branch
bb-10.6-MDEV-31949:
...
--- a/sql/log_event_server.cc
+++ b/sql/log_event_server.cc
@@ -3324,6 +3324,11 @@ Gtid_log_event::Gtid_log_event(THD *thd_arg, uint64 seq_no_arg,
       flags2|= thd->lex->sql_command == SQLCOM_XA_PREPARE ?
         FL_PREPARED_XA : FL_COMPLETED_XA;
       xid.set(xid_state.get_xid());
+
+      // multi-engine external completion is unaware of the prepared engine #.
+      extra_engines= thd->transaction->all.ha_list ?
+        ha_count_rw_2pc(thd_arg,
+                        thd_arg->in_multi_stmt_transaction_mode()) - 1 : 0;
I cannot understand the comment.
I assume it's about that an XA COMMIT of a detached xid (different
session
that the XA PREPARE, or replication slave) - right?
Correct.
...
But what is the consequences of not knowing extra_engines in this
case?
Is the extra_engines used for anything than the optional truncate of
the
binlog after master crash?
In the multi-engine case when the # of prepared in Engine is greater
   than in the value in GTID event then the trx is partially committed, so
   is decided for to be completed.
...
...
@@ -4175,7 +4173,8 @@ int XA_prepare_log_event::do_commit()
   thd->lex->xid= &xid;
   if (!one_phase)
   {
-    if ((res= thd->wait_for_prior_commit()))
+    if (thd->is_current_stmt_binlog_disabled() &&
+        (res= thd->wait_for_prior_commit()))
Why?000
Is it to allow XA_PREPARE_LOG_EVENT to group commit with other event
groups?
That makes sense, but then shouldn't the wait_for_prior_commit() be
somewhere in handler.cc (or maybe xa.cc)?
Look at Xid_log_event::do_commit(), it doesn't have any
wait_for_prior_commit(), that is in ha_commit_one_phase().
Right.
...
(The wait-for-prior-commit facility is designed as a server-level
feature,
not a replication-level, it could in principle be used outside of
replication, even though it not currently (IIRC).)
I think I understood what you mean which is follow the ha_commit_one_phase()
   pattern.
   [todo]
...
...

diff --git a/sql/sql_array.h b/sql/sql_array.h
index 8610e971016..94c6d86ac05 100644
--- a/sql/sql_array.h
+++ b/sql/sql_array.h
@@ -172,6 +172,8 @@ template <class Elem> class Dynamic_array
size_t size() const { return array.elements; }
+  size_t max_size() const { return array.max_element; }
+
Not that it hurts that much, but I think you can avoid adding this.
Left over of a cleanup from the previous version.
   [todo]
...
...
--- a/sql/log.cc
+++ b/sql/log.cc
@@ -85,6 +85,7 @@ const char *log_bin_index= 0;
 const char *log_bin_basename= 0;
MYSQL_BIN_LOG mysql_bin_log(&sync_binlog_period);
+XID_cache_element *xid_cache_search(THD *thd, XID *xid);
This duplicates the same declaration in sql/xa.h
So probably this can be removed now?
[todo]
...
...
@@ -2186,16 +2202,13 @@ int binlog_commit(THD *thd, bool all, bool
ro_1pc)
   }
if (cache_mngr->trx_cache.empty() &&
- (thd->transaction->xid_state.get_state_code() != XA_PREPARED ||
- !(thd->ha_data[binlog_hton->slot].ha_info[1].is_started() &&
- thd->ha_data[binlog_hton->slot].ha_info[1].is_trx_read_write())))
+      (!(is_preparing_xa(thd) || is_prepared_xa(thd)) ||
+       !is_binlog_read_write(thd)))
This binlog_commit function started out as the "commit" method for the
binlog handlerton. But that is not how it is used anymore, now it is
called
directly from several different places, including some places specific
to
external user XA.
A large part of the function has become complex conditions like this
that
are basically impossible for the developer (or reviewer) to understand
exactly why this set of conditions is the correct one. Again, the
function
is trying to guess what the caller wants, depending on the state of
various
thd properties etc.
Instead, the function should be split into multiple functions. One
that is
called to handle XA PREPARE; one for binlog_commit_by_xid(), etc. Then
each
version should be able to greatly reduce the number of conditions
needed.
And if any logic is shared between the different versions, this can
then be
put in smaller functions that are called from each version.
[my initial response was:
   Your proposal leads to a significant refactoring, which I add only one hunk.
   I'd really consider that as a separate exercise

I think I've understood. Sure, binlog_prepare() that logs XA-PREPARE
can call a common function with binlog_commit() that originally intended
to log only normal trx:s.
I can look into that.

[todo]
...
...
@@ -11050,10 +11124,30 @@ bool
Recovery_context::decide_or_assess(xid_recovery_member *member, int
round,
     }
     else if (member->in_engine_prepare < last_gtid_engines)
     {
-      DBUG_ASSERT(member->in_engine_prepare > 0);
+      DBUG_ASSERT(member->in_engine_prepare > 0 ||
+ /* XA event over xid that was completed in engines */
+                  (member->xid == 0 &&
+ member->xa_binlog_state >= xid_recovery_member::XA_PREPARE));
+      if (member->in_engine_prepare > 0 &&
+ member->xa_binlog_state == xid_recovery_member::XA_PREPARE)
+      {
The function decide_or_assess() is hard to understand.
I can see that it's called in 3 cases: XID_LOG_EVENT, XA PREPARE, XA
COMMIT/ROLLBACK.
So again here, I think this should be split into 3 different
functions, one
for each case (or maybe two, one for internal XID and one for external
XA,
whatever makes the more sense).
I don't understand the comment "XA event over xid that was completed
in
engines". But I remember seeing something in log_event_server.cc that
sets
extra_engines to 0 in the external user XA case.
So is that the case here, that last_gtid_engines==0 marks this is user
XA?
That would be much clearer if the logic for internal and external XA
was not
mixed in the same function.
...
+            /*
+              the first of the two options below can occur when round 1 turns to 2
+              so event sequences across binlog files are like the following
+                    g_current(B_0, xid), ... g_last(B^*, xid),
+              where `g`:s are xa event groups associated with the same member via
+              common `xid`. As elsewhere B_0 is the 2nd round's first and,
+              B^* the 1st round binlogs.
+            */
            if (reset_truncate_coord(member->binlog_coord < last_gtid_coord ?
                                     member->binlog_coord.second : pos))
I'm trying to understand what is going on here. I think this is refering to
the way binlog files are scanned?
First we scan the most recent binlog file. At the end of it, we may find
that the latest binlog checkpoint is in an earlier file, then we continue to
scan those older files. So for example we may scan in order:
master-bin.000004, master-bin.000002, master-bin.000003
This means that we may see the XA PREPARE and XA COMMIT out-of-order as we
go from the first scan to the remaining - "round 1 turns to 2".
But I don't understand that this is "the first of the two options below":
member->binlog_coord < last_gtid_coord ? member->binlog_coord.second : pos
If last_gtid_coord is in master-bin.000002, but member->binlog_coord is in
master-bin.000004 ("round 1 turns to 2"), then I would think we would have
member->binlog_coord > last_gtid_coord
So it would be the *second* of the options below? What am I missing here?
Also, I would suggest a simplification here. After we do the initial scan in
round 1, if we find that we need to scan more files, then reset all the
state and hashes. And then do a full scan of all necessary binlog files
in-order. In my example, that would be:
master-bin.000004
  <reset state>
  master-bin.000002
  master-bin.000003
  master-bin.000004
This way, the logic becomes simpler, as we don't need logic to handle seeing
events out-of-order.
This will mean that occasionally we will need to scan one more binlog file,
but I think that is a good trade-off. The recovery code is notoriously
difficult to test and get correct, because it's an unusual occurence to need
recovery in the first place, and very hard to reproduce any problems. And
robust and correct recovery is the whole point of having XA in the first
palce. So making the code as simple as possible is crucial.
The extra cost is small:
- For one, recovery is a rare event
 - Re-scanning only costs CPU, the file is already read from storage into OS
   buffers.
 - Normally, the binlog checkpoint completes shortly after rotation, so the
   last binlog file that must be rescanned will be short.
So simply redo the scan from the correct point when a "round 2" is needed,
to avoid any logic to handle seeing events out-of-order during recovery.
...
@@ -11182,7 +11311,9 @@ void Recovery_context::process_gtid(int round, Gtid_log_event *gev,
     last_gtid_no2pc= false;
     last_gtid_standalone=
       (gev->flags2 & Gtid_log_event::FL_STANDALONE) ? true : false;
-    if (do_truncate && last_gtid_standalone)
+    /* a standalone XA-COMPLETE is truncation-safe */
+    if (do_truncate && last_gtid_standalone
+        && !(gev->flags2 & Gtid_log_event::FL_COMPLETED_XA))
       update_binlog_unsafe_coord_if_needed(linfo);
I don't understand the comment here.
What is "a standalone XA-COMPLETE" ?
What does it mean to be "truncation-safe"?
Please reword the comment to be comprehensible.
...
+            /*
+ the user XA events are decided similary to the normal transaction,
+ difference is just in the timing which is Gtid event read now.
+            */
I don't understand this comment. Please reformulate to make it
understandable. What does normal transactions do, and what does user
XA do?
How does the timing differ, and why?
...
+ sql_print_warning("GTID %u-%u-%s XA transaction state % "
+ "in binlog file:%s pos:%llu is inconsistent "
+ "with the former same state; "
+ "consider filtered binary loggig.",
+ ctx.last_gtid.domain_id, ctx.last_gtid.server_id,
+ buf, is_xac ? "'COMPLETE'" : "'PREPARE'",
+ linfo->log_file_name, ctx.last_gtid_coord.second);
Typo: s/loggig/logging/
Also, the phrase "the former same state" is not understandable (at
least to
me), what does it mean? Please reformulate.
I also have to be short now in replying to the rest, which
   I'll expand on tomorrow morning.

   Now just this.
...
...
@@ -909,14 +910,16 @@ ha_commit_checkpoint_request(void *cookie, void (*pre_hook)(void *))
   there's no need to rollback here as all transactions must
   be rolled back already
 */
-void ha_close_connection(THD* thd)
+void ha_close_connection(THD* thd, bool skip_binlog)
I tried to understand what the purpose of this change is.
ha_close_connection() is now called from two new places related to
replication of XA PREPARE. And in one place with this mysterious
skip_binlog=true condition.
My guess is that this is to detach the XA PREPARE from the current
THD?
The reason is described in 6cbb63f592 commit name II.1.
   ```
   OTHER CHANGES,
   to
   ha_close_connection()
   are done to keep the binlog hton connection alive for the end of
   XA-prepare's binlogging on the parallel slave.
   Other than the binlog hton:s are disconnected from the slave
   transaction handler in TC_LOG::run_commit_ordered() which makes
   the prepared transaction engine objects available to
   binlog-commit-order followers.
   ```
   In other words XAP_k in run_commit_ordered() releases with
   ha_close_connection(thd, true)
   all resources
   that the following XAC_k+i need. The latter does not need XAP_k
   to release binlog which it still need while being is this stack of

   TC_LOG::run_commit_ordered
   MYSQL_BIN_LOG::trx_group_commit_leader
   MYSQL_BIN_LOG::write_transaction_to_binlog_events
   MYSQL_BIN_LOG::write_transaction_to_binlog
   binlog_flush_cache
   binlog_commit_flush_xa_prepare
   binlog_commit
   binlog_prepare
   prepare_or_error
   ha_prepare
...
So
that it becomes visible in XA RECOVER to other threads, and (maybe
more
important) to make the THD available to run another transaction? One
of the
things InnoDB does in close_connection is: trx_disconnect_prepared().
But I think the purpose of the close_connection handlerton call is
different
And you must have meant 'ha_close_connection', right?
...
- it is to free things at the end of a client connection?
Well, with the ha- prefix, it's to disconnect THD from all
   transaction branches/htons.
...
So it can do
more
than detach the current external XA - which is presumably the reason
for the
need to hack it to not call the specific binlog_close_connection()?
So this shouldn't hijack the close_connection to mean
"detach_xa_trx". It
looks like we need an explicit handlerton call for detaching an xa
transaction, that can then be used instead of this hack in
ha_close_connection().
I need to digest that in the morning.
   Generally I don't see a problem to conduct hton
   release flexibly.
...
...
-static my_bool xacommit_handlerton(THD *unused1, plugin_ref plugin,
-                                   void *arg)
+my_bool xacommit_handlerton(THD *unused1, plugin_ref plugin, void *arg)
 {
   handlerton *hton= plugin_hton(plugin);
-  if (hton->recover)
+  if (hton->recover && hton != binlog_hton)
   {
     hton->commit_by_xid(hton, ((struct xahton_st *)arg)->xid);
...
-int ha_commit_or_rollback_by_xid(XID *xid, bool commit)
+int ha_commit_or_rollback_by_xid(XID *xid, bool commit, THD *thd)
 {
...
+  if (!skip_binlog)
+  {
+    // when binlog is ON start from its transaction branch
     if (commit)
       binlog_commit_by_xid(binlog_hton, xid);
     else
       binlog_rollback_by_xid(binlog_hton, xid);
I think the purpose if this is to ensure that commit/rollback of XA
calls
into the binlog code first (to binlog before doing it in the engines).
But as I think Serg also remarked already, this should not be
needed. The
binlog is the (internal) transaction coordinator when it is enabled,
we have
a dedicated interface for that, class TC_LOG. This is used elsewhere
in
handler.cc to avoid special cases for whether the binlog is enabled or
not.
So it looks like there needs to be extra virtual functions in TC_LOG
for
the XA commit/rollback, and these can then be called on the tc_log
object
and do what is required.
I need to process that tomorrow.
...
Then some logic in handler.cc can be omitted/simplified.
----
I think with this, I got through all of the code changes in branch
bb-10.6-MDEV-31949 as of commit
1bdc9bc8077c79e55cb4dbfc7cc47a193d17d54f
from January 20, 2024. So this concludes the current review. I have
not
looked at the test cases in this round of review.
Thanks again!

Cheers,

Andrei

    

[MariaDB developers] [Andrei Elkin] Re: Third/last part of review of XA patches in bb-10.6-MDEV-31949

andrei.elkin＠pp.inet.fi