[MariaDB developers] Second part of review of XA patches in bb-10.6-MDEV-31949

5 Feb 2024

Hi Andrei,

Here next part of my review.

First, I want to try to really explain a general problem I see several
places in these patches, here is one example:
...
-static XID_cache_element *xid_cache_search(THD *thd, XID *xid)
+XID_cache_element *xid_cache_search(THD *thd, XID *xid)
 {
   DBUG_ASSERT(thd->xid_hash_pins);
   XID_cache_element *element=
     (XID_cache_element*) lf_hash_search(&xid_cache, thd->xid_hash_pins,
                                         xid->key(), xid->key_length());
   if (element)
+  {
+    /*
+      parallel slave XA-"complete" will acquire its actual xid after
+      it's ready for ordered commit. Meanwhile the transaction's handler
+      will use a dummy prepared xid element.
+
+      TODO: lsu = zero.
+    */
+    if ((!(thd->rgi_slave && thd->rgi_slave->is_parallel_exec) ||
+         thd->transaction->xid_state.is_dummy_XA()))
We _really_ need to try as hard as possible to avoid these random conditions
all over the codebase, like here:

  thd->rgi_slave && thd->rgi_slave->is_parallel_exec

This xid_cache_search() function is a simple function, with a simple
purpose. It looks up an XID, and "acquires" it so it doesn't get removed by
another thread in parallel. That's it.

But now you add all this complexity, of "oh, but except if it's a parallel
replication thread that does the lookup, or maybe if the thread has a dummy
XA state, then something I don't quite follow... This kind of stuff makes
the code impossible to understand and maintain.

The further away from the main replication code we get, the more wrong it is
to have these kinds of conditions. This is a low-level function for external
XA management, so checking for thd->rgi_slave->is_parallel_exec is _very_
wrong.

And it looks completely unnecessary. The callers you add in your patch
already check the same condition, for example:
...
+    if (thd->rgi_slave && thd->rgi_slave->is_parallel_exec)
+    {
+      DBUG_ASSERT(thd->transaction->xid_state.is_dummy_XA());
+
+      auto xs= xid_cache_search(thd, xid);
So you can just add a different function that does the "dummy XA" thing,
and call that instead. And then the new function can be explained in the
comments what it does.

*Please* try to understand this issue, and try to avoid the problem in all
places in the XA code (and elsewhere). It's really important, the
replication code is notoriously difficult to maintain as it is. It's
absolutely my main concern about this whole XA work. The XA feature can work
well or poorly, it's not a widely used feature, I can live with it. But
having the replication code maintainable and understandable is critical for
users going forward.

Think of going into a Bakery shop. Imagine if I said to the baker: "My name
is Kristian, and I'm currently renovating my living room". The baker would
then have to decide (or guess) that this means I will be wanting a loaf of
bread and 5 cookies. That seems very complicated and error prone, if all
customers were like that, doesn't it? Isn't it much preferable if each
customer just tells what she or he wants? Let's make the functions in our
code the same way, caller tells what it wants, function delivers.


Next, on a high level, I looked at the MDEV-21777 and tried to understand
the plans for fixing that the (GTID) slave is no longer crash safe with XA.

I think the idea is the following. Given a user external XA xid X. From X we
construct a new xid, say X_PREPARE or X_COMPLETE (for XA PREPARE
respectively XA COMMIT/ROLLBACK). Then we replicate the XA PREPARE like
this:

  XA START X
  ... apply events from master
  XA END X

  XA START X_PREPARE
  INSERT INTO mysql.gtid_slave_pos VALUES (...)
  XA END X_PREPARE
  XA PREPARE X_PREPARE  # This is not binlogged
  # Fsync the X_PREPARE trx inside InnoDB

  XA PREPARE X  # Binlogged as normal
  # Fsync the XA PREPARE X inside InnoDB

  XA COMMIT X_PREPARE

The XA COMMIT X is similar, except there are no events to apply initially
from the master, and we use X_COMPLETE instead of X_PREPARE.

Then in case of slave crash, after normal crash recovery (InnoDB and if
enabled, binlog), a slave recovery takes place. For every xid X_PREPARE
found prepared inside InnoDB, check if X is also found prepared. If so,
commit X_PREPARE, else roll it back. And vice versa, for every xid
X_COMPLETE found, roll it back if X is found prepared in InnoDB, else commit
it.

Is that correctly understood?

One reason I'm asking is I'm wondering how this will affect the scheduling
of XA in parallel replication? Like, in case of duplicate XID, the X_PREPARE
or X_COMPLETE could conflict between two transactions, so maybe the
wakeup_subsequent_commits can only be done after the XA COMMIT of the
X_PREPARE and X_COMMIT? Which I think would require changes to the current
patch.

But then another thought follow from this. If we anyway need to construct
new XIDs based on external user XID, like X->X_PREPARE. Then we either need
to "steal" some of the space available for users (64+64 bytes gtrid/qual),
or we could add extra space internally, not available/visible to users. And
if we do this, could we append a unique identifier to each external XID
(again invisible outside), just a running number? It seems this could
simplify a number of cases where we have to handle re-use of the same XID.
With a unique number appended, there would be no duplicate XIDs, simplifying
the logic of binlog recovery, parallel slave scheduling, and possibly other
cases.

What do you think?

Next, my comments from reading more through the patch:
...

diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
index c26263401b8..1ddf520af59 100644
--- a/sql/rpl_parallel.cc
+++ b/sql/rpl_parallel.cc
...
@@ -2455,6 +2451,28 @@ rpl_parallel_entry::choose_thread(rpl_group_info *rgi, bool *did_enter_cond,
   return thr;
 }
+/**
+   The class implements rpl_parallel_entry::xap_window.
+   See @c handle_xa_prepera_duplicate_xid its main user.
+*/
+class Sliding_window
+  :public Dynamic_array<std::pair<std::size_t, uint64>>
I still don't understand why you want to use Dynamic_array. From what I can
see, you never resize it, it's only allocated once with N elements. Isn't it
just adding unnecessary complexity, when a simple array would do exactly the
same job?
...
+{
+public:
+  uint32 cxap_lhs, cxap_rhs;
+  Sliding_window(PSI_memory_key psi_key, uint prealloc) :
+    Dynamic_array<std::pair<std::size_t, uint64>>(psi_key, prealloc)
+    {
+      DBUG_ASSERT(prealloc == max_size());
+
+      elements(max_size());
Here, you allocate the number of elements. Why then does the code use
max_size() everywhere, and not simply elements()?
...
+      cxap_lhs= cxap_rhs= 0;
+      for (size_t i= 0; i < max_size(); i++)
+        at(i)= {0, 0};
+    }
...
@@ -2798,6 +2818,107 @@ abandon_worker_thread(THD *thd, rpl_parallel_thread *cur_thread,
   mysql_cond_signal(&cur_thread->COND_rpl_thread);
 }
+/**
+  Check the concurrency status of an input @c xid with ones
+  that are already in progress.
+  Any new @c xid of XA-prepare (@c is_xap is true then) is appended to
+  a sliding window. It's released "naturally" as the window is designed
+  as circular buffer.
Don't you need to also append an XA COMMIT to the window?

Otherwise, consider the following sequence:

    1: XA PREPARE 'a'
    2: <unrelared transaction>
    3: ...
       ...
  100: XA COMMIT 'a'
  101: XA PREPARE 'a'

Ie. assume that the XA PREPARE is far back and has slid out of the window
(assume less that 100 worker threads), don't you still need to block the XA
PREPARE in (101) from starting in parallel with the XA COMMIT in (100) ?
...
+static bool
+handle_xa_prepare_duplicate_xid(rpl_parallel_entry *e, XID *xid, bool is_xap)
<cut>
...
+  std::size_t xid_hash=
+    my_hash_sort(&my_charset_bin, xid->key(), xid->key_length());
Why use a hash key here, and not simply compare the actual XIDs?
(I think it's fine, the occasional hash collision should not be a problem,
just maybe a short comment explaining why a hash is used would be in order,
as naively it would seem simpler just to compare the XIDs?).
...
@@ -3046,6 +3168,18 @@ rpl_parallel::do_event(rpl_group_info *serial_rgi, Log_event *ev,
     new_gco= true;
     force_switch_flag= 0;
     gco= e->current_gco;
+    /*
+      Take care of duplicate xids in multiple XA_prepare_log_event:s,
+      XA-"complete" should not race its XA-prepare parent either.
+      When the current transaction's xid has been seen and its transaction may
+      still be in process this event group gets flagged for synchronization, i.e
+      to wait for prior commits at its execution start.
+    */
+    if ((gtid_flags & Gtid_log_event::FL_PREPARED_XA) &&
+        (is_dup_xac= handle_xa_prepare_duplicate_xid(e, >id_ev->xid,
+                                                     gtid_flags &
+                                                     Gtid_log_event::FL_PREPARED_XA)))
+      gtid_flags &= ~Gtid_log_event::FL_ALLOW_PARALLEL;
This won't work in "aggressive" mode, will it?
Because FL_ALLOW_PARALLEL is only used in optimistic mode, not agressive.

Besides, you already have another check for is_dup_xac below:
...
+    rgi->speculation=
+      is_dup_xac && (gco->flags & group_commit_orderer::FORCE_SWITCH) ?
+      rpl_group_info::SPECULATE_WAIT : speculation;
So you can just have a single condition here, right?

  if (is_dup_xac)
    rgi->speculation= rpl_group_info::SPECULATE_WAIT

But another thing, the handle_xa_prepare_duplicate_xid() takes an argument
is_xap, but you always only pass this as true, so it's redundant.

I'm guessing that originally you also had something that would make the XA
COMMIT wait for the XA PREPARE, but this was then later changed to the use
of dummy_element. If so, you should remove the parameter.

If you had used the xa_prepare_log_event() for the dependency
PREPARE->COMMIT also, then the XA COMMIT could simply have registrered a
wait_for_prior_commit() on the precise event group of the XA PREPARE, that
seems to be simple. I'm wondering why you thought it better to have a dummy
xid for the XA COMMIT?

I'll need to study the logic in more detail to understand it, but it looks
to me that what will happen is that the code will binlog the XA COMMIT (or
ROLLBACK) without any checks, and then an error will occur in the storage
engine if the xid is not in prepared state in the engine - is that correct?

Well, except that xacommit_handlerton() silently ignores any errors from the
engines. Does this mean that problems around this will silently be ignored
by replication? Or is there some other place that such errors will be caught
later? It's not clear to me.

For example, since the mysql.gtid_slave_pos update is not crash safe, we
could have a crash lead to double apply of the XA COMMIT, will this silently
succeed, and binlog the XA COMMIT twice?
What if the user (incorrectly) did a manual XA COMMIT, and the real XA
COMMIT then gets replicated? This is an error case, but can we be sure we
will not get an assertion or even a crash somewhere in the code?
...
@@ -3114,7 +3249,14 @@ rpl_parallel::do_event(rpl_group_info *serial_rgi, Log_event *ev,
       if (gtid_flags & Gtid_log_event::FL_DDL)
         force_switch_flag= group_commit_orderer::FORCE_SWITCH;
     }
-    rgi->speculation= speculation;
+    /*
+      When DDL flagged XA sees an ongoing one with "duplicate" xid
+      its execution will have to wait for the xid owner full completion.
+      SPECULATE_WAIT provides that, regardless of the parallel mode.
+    */
+    rgi->speculation=
+      is_dup_xac && (gco->flags & group_commit_orderer::FORCE_SWITCH) ?
+      rpl_group_info::SPECULATE_WAIT : speculation;
This I don't understand at all.
How can an XA PREPARE/COMMIT/ROLLBACK be marked as "ddl" ?
...
@@ -3131,7 +3273,7 @@ rpl_parallel::do_event(rpl_group_info *serial_rgi, Log_event *ev,
         Remember the count that marks the end of the previous batch of event
         groups that run in parallel, and allocate a new gco.
       */
-      uint64 count= e->count_queued_event_groups;
+      auto count= e->count_queued_event_groups;
(This is a trivial thing in the big picture, but better avoid unrelated
changes in patches, and better leave the real type to make the code more readable).

Let me send this off for now, and then next I'll try to understand the part
with the dummy xid and some logic around has_binlog_hton()...

 - Kristian.

    

[MariaDB developers] Second part of review of XA patches in bb-10.6-MDEV-31949

Kristian Nielsen