[MariaDB developers] Re2: Starting on review of XA patches in bb-10.6-MDEV-31949

9 Feb 2024

Salve Kristian,

this is the 2nd follow-up replay to your 1st review,
to cover few deferred notes in particular.
I mark with [x] any addressed [todo] note.
...
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
...
Hi Andrei,
So you're in this context
1 innobase_commit_by_xid(
 2 /*===================*/
 3     handlerton*    hton,
 4     XID*        xid)    /*!< in: X/Open XA transaction 
identification */
 5 {
 6     DBUG_ASSERT(hton == innodb_hton_ptr);
 7
 8     DBUG_EXECUTE_IF("innobase_xa_fail",
 9             return XAER_RMFAIL;);
10
11     if (high_level_read_only) {
12         return(XAER_RMFAIL);
13     }
14
15     if (trx_t* trx = trx_get_trx_by_xid(xid)) {
16         THD *thd = current_thd;
17         if (thd)
18             thd_binlog_pos(thd, &trx->mysql_log_file_name,
19                        &trx->mysql_log_offset);
20         if (trx->mysql_log_offset > 0)
21             trx->flush_log_later = true;
and actually considering `trx->mysql_log_offset` as the return value?
... I could not grasp your way of thinking in this place.
Could you please expand on?
In another mail yesterday concerning with this part I suggested to 
replace
    if (trx->mysql_log_offset > 0)
with
    if (thd_binlog_format(thd) != UNSPEC)
...
...
It would be much better if this could be done similar to how normal 
commits
are done. We have the commit_ordered handlerton which does the fast 
part of
commit-to-memory. And then the slow part which does the rest in commit
handlerton.
Notice that the *slow* part of the XA commit is the same as in the 
normal
case:
trx_commit_complete_for_mysql
*innobase_commit*
commit_one_phase_2
ha_commit_one_phase
The fast parts of the two - the XA and normal - could be unified - into
innobase_commit_ordered()
but I did not think (had time for) towards that.
...
So we could have commit_by_xid_ordered() and commit_by_xid(). The 
problem is
that commit_by_xid_ordered() needs to release the xid, so it becomes 
harder
to get hold of the transaction (which the commit handlerton just finds 
from
the thd). Maybe commit_by_xid_ordered() would need to return some trx
handle.
What part of the commit is skipped by this in InnoDB? Is it _only_ 
flushing
the log up to the LSN of the commit record, or are there more things?
So the xa's fast innobase_commit_by_xid() skips a "slow"
  trx_commit_complete_for_mysql()
and the same does the normal's innobase_commit_ordered().
...
How is another storage engine supposed to handle this? This also needs 
to be
documented somewhere, eg. in comments on the handlerton calls.
This also means that the --innodb-flush-log-at-trx-commit=3 is not 
supported
for XA COMMIT when it's done from a different session that the XA 
PREPARE.
For example, with --innodb-flush-log-at-trx-commit=3 but 
--sync-binlog=0,
normal transactions are still durable in InnoDB. But external XA
transactions will not be. This could be solved if the commit_by_xid() 
was
split into a fast and slow part similar to normal commit.
[todo] Let me process that carefully later.
[x]

My 1st reply already said, just missed to underscore that, that the XA
also the slow and fast commit paths, therefore iflatc=3 is good.
The past paths are  commit_one_phase_2 and innobase_commit_by_xid
functionally equivalent. One would vote to unify them, and I'd accept 
that.
...
...
Probably you'll say that this works for now, and perhaps that's true. 
But at
least it really _really_ needs some detailed comments explaining what 
is
going on, and what should be kept in mind if modifying/improving this 
code
in the future, otherwise it will be impossible for the next developer 
to
understand.
[todo] Agreed.
...
...

diff --git a/sql/log.cc b/sql/log.cc
index e54d4087d46..7e3dbeaef29 100644
--- a/sql/log.cc
+++ b/sql/log.cc
...
@@ -9365,18 +9381,84 @@ TC_LOG::run_commit_ordered(THD *thd, bool 
all)
     all ? thd->transaction->all.ha_list : 
thd->transaction->stmt.ha_list;
mysql_mutex_assert_owner(&LOCK_commit_ordered);
-  for (; ha_info; ha_info= ha_info->next())
+
+  if (!ha_info)
Your patch puts a lot of XA/binlog logic into run_commit_ordered(). 
But
that's not appropriate. The run_commit_ordered() function has the 
purpose
solely to call the commit_ordered() handlerton in each engine, and 
it's also
called from TC_LOG_MMAP (that's why it's TC_LOG::, not 
MYSQL_BIN_LOG::).
Instead make a new function that handles the part of binlog commit 
that
needs to be done under LOCK_commit_ordered. Be sure to comment it that 
this
runs under this highly contended mutex, and needs to do as little work 
as
possible.
Agreed.
[todo]
...
...
This function can then call run_commit_ordered(). It can also be used 
to
share a few bits of code that are now duplicated between the
opt_optimize_thread_scheduling and !opt_optimize_thread_scheduling 
cases,
like:
++num_commits;
    run_commit_ordered(...)
    entry->thd->wakeup_subsequent_commits()
    if (entry->queued_by_other) { ... }
(if it makes sense to do so in the actual code, only if makes things
simpler, that's easier to decide when working with the actual code).
Will see.
[todo]
...
...
There sesems to be 3 new cases:
- XA PREPARE
 - XA COMMIT from same session as XA PREPARE
 - XA COMMIT of detached XA transaction
The logic to decide this was already done earlier, when the event 
group was
queued for group commit, eg. to decide which end_event to use. For 
example
there is no need to call xid_cache_search() here, that can be done
earlier
...
...
(and it's also wrong to call my_error() here, as we're in the wrong 
thread
context).
Indeed. The error's thread context was overlooked. Thanks for catching 
it.
...
...
So the logic about what to do should be placed higher up in the call 
stack,
ideally where we already know, eg. binlog_commit_by_xid(). And errors 
such
as ER_XAER_NOTA can be thrown already there, and the decision on what 
to do
put consisely as a flag into the struct group_commit_entry, so the 
logic
during binlog group commit becomes as simple as possible.
(It might even make sense to simply store a function pointer to the 
method
that should be called to handle this; normal, attached XA COMMIT, 
detached
XA COMMIT, XA PREPARE. But again it's easier when actually working 
with the
code to decide if this is simpler than a couple flags or case on enum
value).
[todo] Let me process the above suggestion block a bit later.
[x]
To the need of  xid_cache_search() etc, let me combine this part to
reply to the 2nd review mail's notes around a Dummy XID though ...
(read on)
...
...
...
+    if (thd->rgi_slave && thd->rgi_slave->is_parallel_exec)
+    {
+      DBUG_ASSERT(thd->transaction->xid_state.is_dummy_XA());
+
+      auto xs= xid_cache_search(thd, xid);
+      if(!xs)
+      {
+        my_error(ER_XAER_NOTA, MYF(0));
+        return;
+      }
+      else
+      {
+        thd->transaction->xid_state.xid_cache_element= xs;
+      }
+      DBUG_ASSERT(thd->transaction->xid_state.is_recovered());
+    }
+
     plugin_foreach(NULL, commit ? xacommit_handlerton : 
xarollback_handlerton,
                    MYSQL_STORAGE_ENGINE_PLUGIN, &xaop);
     xid_cache_delete(thd);
Why this check for thd->rgi_slave->is_parallel_exec here deep in 
binlog
group commit? That seems very wrong, and again no comment whatsoever
explaining what is going on, so again I have to try to guess...
I'm guessing that it's something with calling xid_cache_delete() which 
needs
to find thd->transaction->xid_state.xid_cache_element ?
Right.
...
But this could be done much better in a higher-level part of the code 
that's
specific to xa and (parallel) replication, not here deep down in 
binlog
group commit code.
And why should this be needed only for parallel replication, not for
non-parellel replication or non-replication SQL? I don't understand.
It's a kind of optimization for the slave side, in that `xid` of XAP
(Prepare) gets released for a XAC (Commit) the soonest.
... the reason's said here.
...
...
A couple remarks about the tests I pushed to 
knielsen_mdev31949_review:
...
I think the failures in rpl.rpl_recovery_xa I linked are from 
duplicate XID
in XA PREPARE ; XA COMMIT ; XA PREPARE. There comes an ambiguity 
between
whether the transaction prepared in the engine at recovery is from the 
first
XA PREPARE (so it should be committed) or the second (so it should be 
rolled
back). This happens when the second XA PREPARE is in the engine but 
not
binlogged.
One way to fix could actually be to use the binlog position stored 
inside
InnoDB.
That's a neat reminder on the Innodb durably committed binlog position!
Indeed with the duplicate xid case there exists this trouble of
identification and the Engine is able and must come to rescue.
...
...
If the XA COMMIT is before this binlog position, then we know the
first transaction became durable in InnoDB, so we know we have to roll 
back
the current prepared transaction in the engine. Otherwise we should 
commit
it. I see the recovery code already uses binlog positions to decide 
the fate
of pending XA transactions in the engines, but I did not yet fully
understand what is going on there.
Another way could be to binlog an extra event before the second XA 
PREPARE
with the duplicate XID.
Well, this does not cover the case of prior to crash
XA PREPARE only made to Engine. The above consulting with Innodb
resolves the ambiguity generally.
...
...
This event would mean "This xid is now durable
committed in the engine", so we would know not to commit any prepared 
trx in
the engine with the same xid. This event would only be needed in case 
of XA
PREPARE with a duplicate xid. This though requires the prior XA COMMIT 
to
fsync inside InnoDB first, I think.
Thanks for this piece of analysis!
[todo] I am processing this a bit later.
[x]
...
...
Another test failure in rpl.slave_provision_xa I think happens because 
the
XA PREPARE does not update the binlog position inside innodb.
Maybe the way
to solve this problem is to have the slave be idempotent, and silently
ignore any duplicate XA PREPARE if the xid is already prepared?
Indeed, the fact that XAP does not update the innodb's binlog position 
must cause
few concerns.
I still need to have a close look.

[todo]
...
[todo] I am processing this a bit later.
[x]
...
...
About the "ToDo" of using some Xid_list_log_event. I think this is a 
way
during recovery to get the state of the XA PREPARE/COMMIT's in earlier
binlog files before the last binlog checkpoint, if I understood things
correctly?
True.
...
Another way to do this is to delay the binlog checkpoints until all XA
PREPARE in a binlog file have been XA COMMIT'ted (or XA ROLLBACK'ed), 
so
that they will all be scanned during recovery. I actually implemented 
this
in my branch knielsen_mdev32020. The advantage of this is that we may 
anyway
need this to provide the XA PREPARE events to a slave set up from a 
logical
backup (like mysqldump), to avoid purging XA PREPARE events too early.
And
then it's a bit simpler to simply scan an extra file rather than both 
scan
and read Xid_list_log_event. But I think Xid_list_log_event could work 
fine
as well, just mentioning it for you to consider.
Xlle can't be generally avoided so we have to take conservatively 
first.
To add up to a separate thread discussion over this problematics,
I've grown more inclined with the first implementation basing on your
idea to defer Binlog-CheckPoint.
That might work for a majority of XA users, Xid list to be added later 
if
time permits or will be convincingly :-) requested.

Now I am turning to account the rest - 2,3 reviews you sent that only
glanced through - to reply within few hours.

Cheers,

Andrei

    

[MariaDB developers] Re2: Starting on review of XA patches in bb-10.6-MDEV-31949

andrei.elkin＠pp.inet.fi