Howdy Kristian, Monty!
Hi Monty,
As promised, here my thoughts around the issues with the implementation of replication of external XA as implemented since MariaDB 10.5.
It's good to have Kristian's mail first so that I can respond to each item raised about the current XA replication framework.
I see at least two architectural issues with the current implementation.
One is that it splits transactions in two separate GTIDs in the binlog (XA PREPARE and XA COMMIT). This breaks with the fundamental principle that replication applies transactions one after the other in strict sequence, and the replication position/GTID is the single transaction last replicated.
Let me soften this part. The rational part is that the XA transaction is represented by more than one GTID. Arguably it's a bit uncomfortable, but such generalization is fair to call flexible especially looking forward on implementing fragmented transaction replication, or long running and non necessarily transactional DML or DDL statements including ALTER TABLE. And of course fundamentally there's no violation of committing xa transactions in binlog order. To Kristian's findings that deserve full credit...
This for example means that a mysqldump backup can no longer be used to provision a slave, since any XA PREPAREd event at the time of the dump will be missing; a testcase rpl_xa_provision.test in MDEV-32020 demonstrates this.
The issue at hand is that mysqldump can't represent prepared XA:s: --connection one xa start 'x'; /* work */ xa end 'x'; xa prepare 'x' # => gtid 0-1-1 --connection two commit /* trx_normal */; # => gtid 0-1-2 shell> mysqldump --gtid > dump.sql shell> grep 'gtid' dump.sql # => gtid_slave_pos = 0-1-2 So slave that is provisioned with `dump.sql` will not have prepared xid --connection slave xa recover; # => empty Notice this is regardless of how XA are binlogged/replicated. This provisioned server won't be able to replace the original server at failover. In other words rpl_xa_provision.test also relates to this general issue. What to do with the case? I thought to copy binlog events of all XA-prepared like gtid 0-1-1 to `dump.sql`. If the binlog is not available, then a list of gtids of XA-prepared:s could be added instead and for each XA-prepare gtid must be retrieved from a server that has it in binlog.
Another architectural issue is that each XA PREPARE keeps row locks around on every slave until commit (or rollback). This means replication will break if _any_ transaction replicated after the XA PREPARE gets blocked on a lock.
Indeed, but there can be no non-GAP lock on any unique index. And no GAP lock by any prepared on slave XA - that's the grace of MDEV-30165/MDEV-26682 efforts by Marko and Vlad Lesin. In a really disastrous cases (which we're unaware of as of yet) there exits a safety measure to identify a prepared XA that got in a way of next transactions, to roll it back and re-apply like a normal transaction when XA-COMMIT finally arrives.
This can easily happen; surely in many ways in statement-based replication, and even in row-based replication without primary key as demonstrated by testcase rpl_mdev32020.test in MDEV-32020.
Non-unique indexes remain vulnerable but only in ROW format, as they still are to be locked as different subsets on master and slave. The root of the issue is not XA. The latter may exacerbate what might in normal transaction case lead to "just" (double quote here to hint that the current XA hang might be still better option for the user) data inconsistency. Monty offered already to fix this with table scan (consistency is there then). On part I'd put such statement into binlog always in the STATEMENT format.
There are other problems; for example the update of mysql.gtid_slave_pos cannot be committed crash-safe together with the transaction for XA PREPARE (since the transaction is not committed).
For this part I mentioned MDEV-21117 many times. It's going to involve the prepared XA and an autocommit INSERT-into-gtid_slave_pos into 2pc so the binlog-less slave would recover as well.
I believe the root of the problem is architectural: external XA should be replicated only after they commit on the master.
But we'd lose failover.
Trying to fix individual problems one by one will not address the root problem and will lead to ever increasing complexity without ever being fully successful.
Well, there is still some work to complete in this project, but I don't see where we are going to get stuck. And I can't help to underline the real virtue of the XA replication as a pioneer of "fragmented" replication that I tried to promote for Kristian in our face to face meetings.
The current implementation appears to only address a very specific and rare use-case, where enhanced semi-synchronous replication is used with row-based binlogging to try to fail-over to a slave and preserve any external XA that was in PREPAREd state on the master before the failover. Mixed-mode replication, provisioning slaves with mysqldump, slaves not intended for failover, etc., seem to be not considered and basically broken since 10.5.
Back in Aug I wrote a patch that converts XA:s to normal transactions at binlogging. Could we reconcile on a server option that activates it?
Here is my idea for a design that solves most of these problems.
Not to disregard your text Kristian and also loggically split two subjects, let me process it in another reply tomorrow.
At XA PREPARE, we can still write the events to the binlog, but without a GTID, and we do not replicate it to slaves by default. Then at XA COMMIT we binlog a commit transaction that can be replicated normally to the slaves without problems. If necessary, the events for XA COMMIT can be read from the PREPARE earlier in the binlog, eg. after server crash/restart. We already have the binlog checkpoint mechanism to ensure that required binlog files are preserved until no longer needed for transaction recovery.
This way we make external XA preserved across server restart, and all normal replication features continue to work - mysqldump, mixed-mode, etc. Nice and simple.
Then optionally we can support the specific usecase of being able to recover external XA PREPAREd transactions on a slave after failover. When enabled, the slave can receive the XA PREPARE events and binlog them itself, without applying. Then as part of failover, those XA PREPARE in the binlog that are still pending can be applied, leaving them in PREPAREd state on the new master. This way, _only_ the few transactions that need to be failed-over need special handling, the majority can still just replicate normally.
There are different refinements and optimizations that can be added on top of this. But the point is that this is a simple implementation that is robust, correct, and crash-safe from the start, without needing to add complexity and fixes on top.
I've done some initial proof-of-concept code for this, and continue to work on it on the branch knielsen_mdev32020 on github.
- Kristian.
Cheers, Andrei