[MariaDB developers] Re: Third/last part of review of XA patches in bb-10.6-MDEV-31949

7 Feb 2024

Hi Andrei,

Here is the last part of my review of the changes in branch bb-10.6-MDEV-31949:
...
--- a/sql/log_event_server.cc
+++ b/sql/log_event_server.cc
@@ -3324,6 +3324,11 @@ Gtid_log_event::Gtid_log_event(THD *thd_arg, uint64 seq_no_arg,
       flags2|= thd->lex->sql_command == SQLCOM_XA_PREPARE ?
         FL_PREPARED_XA : FL_COMPLETED_XA;
       xid.set(xid_state.get_xid());
+
+      // multi-engine external completion is unaware of the prepared engine #.
+      extra_engines= thd->transaction->all.ha_list ?
+        ha_count_rw_2pc(thd_arg,
+                        thd_arg->in_multi_stmt_transaction_mode()) - 1 : 0;
I cannot understand the comment.
I assume it's about that an XA COMMIT of a detached xid (different session
that the XA PREPARE, or replication slave) - right?

But what is the consequences of not knowing extra_engines in this case?
Is the extra_engines used for anything than the optional truncate of the
binlog after master crash?
...
@@ -4175,7 +4173,8 @@ int XA_prepare_log_event::do_commit()
   thd->lex->xid= &xid;
   if (!one_phase)
   {
-    if ((res= thd->wait_for_prior_commit()))
+    if (thd->is_current_stmt_binlog_disabled() &&
+        (res= thd->wait_for_prior_commit()))
Why?
Is it to allow XA_PREPARE_LOG_EVENT to group commit with other event groups?
That makes sense, but then shouldn't the wait_for_prior_commit() be
somewhere in handler.cc (or maybe xa.cc)?
Look at Xid_log_event::do_commit(), it doesn't have any
wait_for_prior_commit(), that is in ha_commit_one_phase().
(The wait-for-prior-commit facility is designed as a server-level feature,
not a replication-level, it could in principle be used outside of
replication, even though it not currently (IIRC).)
...

diff --git a/sql/sql_array.h b/sql/sql_array.h
index 8610e971016..94c6d86ac05 100644
--- a/sql/sql_array.h
+++ b/sql/sql_array.h
@@ -172,6 +172,8 @@ template <class Elem> class Dynamic_array
size_t size() const { return array.elements; }
+  size_t max_size() const { return array.max_element; }
+
Not that it hurts that much, but I think you can avoid adding this.
...
--- a/sql/log.cc
+++ b/sql/log.cc
@@ -85,6 +85,7 @@ const char *log_bin_index= 0;
 const char *log_bin_basename= 0;
MYSQL_BIN_LOG mysql_bin_log(&sync_binlog_period);
+XID_cache_element *xid_cache_search(THD *thd, XID *xid);
This duplicates the same declaration in sql/xa.h
So probably this can be removed now?
...
@@ -2186,16 +2202,13 @@ int binlog_commit(THD *thd, bool all, bool ro_1pc)
   }
if (cache_mngr->trx_cache.empty() &&
-      (thd->transaction->xid_state.get_state_code() != XA_PREPARED ||
-       !(thd->ha_data[binlog_hton->slot].ha_info[1].is_started() &&
-         thd->ha_data[binlog_hton->slot].ha_info[1].is_trx_read_write())))
+      (!(is_preparing_xa(thd) || is_prepared_xa(thd)) ||
+       !is_binlog_read_write(thd)))
This binlog_commit function started out as the "commit" method for the
binlog handlerton. But that is not how it is used anymore, now it is called
directly from several different places, including some places specific to
external user XA.

A large part of the function has become complex conditions like this that
are basically impossible for the developer (or reviewer) to understand
exactly why this set of conditions is the correct one. Again, the function
is trying to guess what the caller wants, depending on the state of various
thd properties etc.

Instead, the function should be split into multiple functions. One that is
called to handle XA PREPARE; one for binlog_commit_by_xid(), etc. Then each
version should be able to greatly reduce the number of conditions needed.

And if any logic is shared between the different versions, this can then be
put in smaller functions that are called from each version.
...
@@ -11050,10 +11124,30 @@ bool Recovery_context::decide_or_assess(xid_recovery_member *member, int round,
     }
     else if (member->in_engine_prepare < last_gtid_engines)
     {
-      DBUG_ASSERT(member->in_engine_prepare > 0);
+      DBUG_ASSERT(member->in_engine_prepare > 0 ||
+                  /* XA event over xid that was completed in engines */
+                  (member->xid == 0 &&
+                   member->xa_binlog_state >= xid_recovery_member::XA_PREPARE));
+      if (member->in_engine_prepare > 0 &&
+          member->xa_binlog_state == xid_recovery_member::XA_PREPARE)
+      {
The function decide_or_assess() is hard to understand.
I can see that it's called in 3 cases: XID_LOG_EVENT, XA PREPARE, XA
COMMIT/ROLLBACK.
So again here, I think this should be split into 3 different functions, one
for each case (or maybe two, one for internal XID and one for external XA,
whatever makes the more sense).

I don't understand the comment "XA event over xid that was completed in
engines". But I remember seeing something in log_event_server.cc that sets
extra_engines to 0 in the external user XA case.

So is that the case here, that last_gtid_engines==0 marks this is user XA?
That would be much clearer if the logic for internal and external XA was not
mixed in the same function.
...
+            /*
+              the first of the two options below can occur when round 1 turns to 2
+              so event sequences across binlog files are like the following
+                    g_current(B_0, xid), ... g_last(B^*, xid),
+              where `g`:s are xa event groups associated with the same member via
+              common `xid`. As elsewhere B_0 is the 2nd round's first and,
+              B^* the 1st round binlogs.
+            */
            if (reset_truncate_coord(member->binlog_coord < last_gtid_coord ?
                                     member->binlog_coord.second : pos))
I'm trying to understand what is going on here. I think this is refering to
the way binlog files are scanned?
First we scan the most recent binlog file. At the end of it, we may find
that the latest binlog checkpoint is in an earlier file, then we continue to
scan those older files. So for example we may scan in order:

  master-bin.000004, master-bin.000002, master-bin.000003

This means that we may see the XA PREPARE and XA COMMIT out-of-order as we
go from the first scan to the remaining - "round 1 turns to 2".

But I don't understand that this is "the first of the two options below":

  member->binlog_coord < last_gtid_coord ? member->binlog_coord.second : pos

If last_gtid_coord is in master-bin.000002, but member->binlog_coord is in
master-bin.000004 ("round 1 turns to 2"), then I would think we would have

  member->binlog_coord > last_gtid_coord

So it would be the *second* of the options below? What am I missing here?

Also, I would suggest a simplification here. After we do the initial scan in
round 1, if we find that we need to scan more files, then reset all the
state and hashes. And then do a full scan of all necessary binlog files
in-order. In my example, that would be:

  master-bin.000004
  <reset state>
  master-bin.000002
  master-bin.000003
  master-bin.000004

This way, the logic becomes simpler, as we don't need logic to handle seeing
events out-of-order.

This will mean that occasionally we will need to scan one more binlog file,
but I think that is a good trade-off. The recovery code is notoriously
difficult to test and get correct, because it's an unusual occurence to need
recovery in the first place, and very hard to reproduce any problems. And
robust and correct recovery is the whole point of having XA in the first
palce. So making the code as simple as possible is crucial.

The extra cost is small:

 - For one, recovery is a rare event
 - Re-scanning only costs CPU, the file is already read from storage into OS
   buffers.
 - Normally, the binlog checkpoint completes shortly after rotation, so the
   last binlog file that must be rescanned will be short.

So simply redo the scan from the correct point when a "round 2" is needed,
to avoid any logic to handle seeing events out-of-order during recovery.
...
@@ -11182,7 +11311,9 @@ void Recovery_context::process_gtid(int round, Gtid_log_event *gev,
     last_gtid_no2pc= false;
     last_gtid_standalone=
       (gev->flags2 & Gtid_log_event::FL_STANDALONE) ? true : false;
-    if (do_truncate && last_gtid_standalone)
+    /* a standalone XA-COMPLETE is truncation-safe */
+    if (do_truncate && last_gtid_standalone
+        && !(gev->flags2 & Gtid_log_event::FL_COMPLETED_XA))
       update_binlog_unsafe_coord_if_needed(linfo);
I don't understand the comment here.
What is "a standalone XA-COMPLETE" ?
What does it mean to be "truncation-safe"?
Please reword the comment to be comprehensible.
...
+            /*
+              the user XA events are decided similary to the normal transaction,
+              difference is just in the timing which is Gtid event read now.
+            */
I don't understand this comment. Please reformulate to make it
understandable. What does normal transactions do, and what does user XA do?
How does the timing differ, and why?
...
+              sql_print_warning("GTID %u-%u-%s XA transaction state % "
+                                "in binlog file:%s pos:%llu is inconsistent "
+                                "with the former same state; "
+                                "consider filtered binary loggig.",
+                                ctx.last_gtid.domain_id, ctx.last_gtid.server_id,
+                                buf, is_xac ? "'COMPLETE'" : "'PREPARE'",
+                                linfo->log_file_name, ctx.last_gtid_coord.second);
Typo: s/loggig/logging/

Also, the phrase "the former same state" is not understandable (at least to
me), what does it mean? Please reformulate.
...
@@ -909,14 +910,16 @@ ha_commit_checkpoint_request(void *cookie, void (*pre_hook)(void *))
   there's no need to rollback here as all transactions must
   be rolled back already
 */
-void ha_close_connection(THD* thd)
+void ha_close_connection(THD* thd, bool skip_binlog)
I tried to understand what the purpose of this change is.
ha_close_connection() is now called from two new places related to
replication of XA PREPARE. And in one place with this mysterious
skip_binlog=true condition.

My guess is that this is to detach the XA PREPARE from the current THD? So
that it becomes visible in XA RECOVER to other threads, and (maybe more
important) to make the THD available to run another transaction? One of the
things InnoDB does in close_connection is: trx_disconnect_prepared().

But I think the purpose of the close_connection handlerton call is different
- it is to free things at the end of a client connection? So it can do more
than detach the current external XA - which is presumably the reason for the
need to hack it to not call the specific binlog_close_connection()?

So this shouldn't hijack the close_connection to mean "detach_xa_trx". It
looks like we need an explicit handlerton call for detaching an xa
transaction, that can then be used instead of this hack in
ha_close_connection().
...
-static my_bool xacommit_handlerton(THD *unused1, plugin_ref plugin,
-                                   void *arg)
+my_bool xacommit_handlerton(THD *unused1, plugin_ref plugin, void *arg)
 {
   handlerton *hton= plugin_hton(plugin);
-  if (hton->recover)
+  if (hton->recover && hton != binlog_hton)
   {
     hton->commit_by_xid(hton, ((struct xahton_st *)arg)->xid);
...
-int ha_commit_or_rollback_by_xid(XID *xid, bool commit)
+int ha_commit_or_rollback_by_xid(XID *xid, bool commit, THD *thd)
 {
...
+  if (!skip_binlog)
+  {
+    // when binlog is ON start from its transaction branch
     if (commit)
       binlog_commit_by_xid(binlog_hton, xid);
     else
       binlog_rollback_by_xid(binlog_hton, xid);
I think the purpose if this is to ensure that commit/rollback of XA calls
into the binlog code first (to binlog before doing it in the engines).

But as I think Serg also remarked already, this should not be needed. The
binlog is the (internal) transaction coordinator when it is enabled, we have
a dedicated interface for that, class TC_LOG. This is used elsewhere in
handler.cc to avoid special cases for whether the binlog is enabled or not.

So it looks like there needs to be extra virtual functions in TC_LOG for
the XA commit/rollback, and these can then be called on the tc_log object
and do what is required.

Then some logic in handler.cc can be omitted/simplified.

----

I think with this, I got through all of the code changes in branch
bb-10.6-MDEV-31949 as of commit 1bdc9bc8077c79e55cb4dbfc7cc47a193d17d54f
from January 20, 2024. So this concludes the current review. I have not
looked at the test cases in this round of review.

 - Kristian.

    

[MariaDB developers] Re: Third/last part of review of XA patches in bb-10.6-MDEV-31949

Kristian Nielsen