Hi Kristian! Kristian Nielsen kirjoitti 2024-01-24 23:59:
andrei.elkin@pp.inet.fi writes:
Back in Aug I wrote a patch that converts XA:s to normal transactions at binlogging.
I don't think I saw that patch, but it sounds like exactly what I am proposing as an alternate solution... ?
MIne is just simplistic to never log any prepare part in the binlog. In the follow-up mail Date: Wed, 24 Jan 2024 13:41:15 +0200 I explained more on it. Yours is apparently better for the user as it provides at least the xa original host recovery.
Notice, that an initialization part of failover 'the few transactions' that would have to be "officially" prepared now. However MDEV-32020 shows just two is enough for hanging. Therefore this may not be a solution for the failover case.
But what is your point? If the XA PREPAREs hang when applied at failover, they will also hang in the current implementation. The user who wants to use failover of XA PREPAREd transactions will have to accept severe restrictions, such as row-based primary-key only replication.
That's exactly why it must be optional and off by default, so it doesn't affect _all_ XA users (as it does currently). We might have to resort to that, but first I'd take on analysis of anything
Kristian, this statement of row-based primary-key only replication requires at least a confirmation with a test. Oth ROW may be somewhat difficult to dismiss and I am not going to now, but the PK is too much as a UK guarantees correctness, please read on. So far we only have MDEV 32020 about non-UK ROW format vulnerability. To our and our users testing the current XA replication works when there is at least one unique key. And there's a theoretical background to validate except of course for implementation bugs. Let me narrow our context to the Read-Committed isolation and ROW format. After MDEV-30165/26682 have removed GAP locks from prepared xa:s, the latter cease to be potentially unilaterally-slave-side conflicting with the following in binlog order ones. That's because a prepared XA can only hold conflicting X locks on indexes and Insert-Intention ones are harmless for the following in binlog order normal trx' GAP locks. In presence of a Unique Key therefore XAP_1 (XAP := XA-Prepare) can *not* stop any normal Trx_2 (think of 1,2 as gtid seq_no:s). that gets in a way, and hopefully it could be tackled as MDEV-32020 with a menu of choices.
Another architectural issue is that each XA PREPARE keeps row locks around on every slave until commit (or rollback). This means replication will break
Indeed, but there can be no non-GAP lock on any unique index.
Only by restricting replication to primary-key updates only. MariaDB is an SQL database, please don't try to turn it into a key-value store.
Let me state it this way: with at least one UK *and* GAP locks out of the picture two binlogged transactions can have conflicts, if any, only through its index X locks. A trx obviously X-locks all modified records.
In a really disastrous cases (which we're unaware of as of yet) there exits a safety measure to identify a prepared XA that got in a way of next transactions, to roll it back and re-apply like a normal transaction when XA-COMMIT finally arrives.
By "exist" I think you mean "exist an idea" - this is not implemented in the current code. In fact, this is exactly the point of my alternate solution, to make it possible to apply at XA COMMIT time. And once you have this, there is no point to apply it at XA PREPARE, since for most transactions, the XA COMMIT comes shortly after the XA PREPARE.
There's point and its name is yours aka ... optimistic (parallel) execution :-). Why should we defer a day long transaction execution when its events, maybe not all, are around, and we can always retreat to the savepoint of its BEGIN?!
Non-unique indexes remain vulnerable but only in ROW format,
What do you mean by "vulnerable only in ROW format"? There are many cases of statement-based replication with XA that will break the slave in current inplementation. The test cases from MDEV-5914 and MDEV-5941 for example (from when CONSERVATIVE parallel replication was implemented) also cause current XA to break in statement/mixed replication mode.
The root of the issue is not XA. The latter may exacerbate what might in normal transaction case lead to "just" (double quote here to hint that the current XA hang might be still better option for the user) data inconsistency.
What data inconsistency?
(I don't mean how the notion of consistency can apply to a no-UK table case, do you :-?) For instance the MIN row format. When master and slave are instructed to executed a trx using different indexes a non-UK table whey can modify in the end different records.
This for example means that a mysqldump backup can no longer be used to provision a slave, since any XA PREPAREd event at the time of the dump will
Notice this is regardless of how XA are binlogged/replicated. This provisioned server won't be able to replace the original server at failover. In other words rpl_xa_provision.test also relates to this general issue.
The problem is not failover to the new server, that will be possible as soon as all transactions that were XA PREPAREd at the time of dump are committed, which is normally a fraction of a second.
Well, in your case there's a condition ' will be possible as soon ...'. I don't have it. Just run that sql file and a clone of a master is provisioned. It sure works like that for the normal trx, and does not for the XA. To me it's a problem to resolve.
The problem is that in the current implementation, a slave setup from the dump does not have a binlog position to start from, any position will cause replication to fail.
Of course. The prepared XA:s gtid:s are thought to be ones of committed trx:s.
I thought to copy binlog events of all XA-prepared like gtid 0-1-1 to `dump.sql`.
So then you will need to extend the binlog checkpoint mechanism to preserve binlogs for all pending XA PREPAREd transactions, just as I propose.
Right. Let's take this part.
And once you do that, there's no longer a need to apply the XA PREPAREs until failover.
If the binlog is not available, then a list of gtids of XA-prepared:s
You will need the binlog, otherwise how will you preserve the list of gtids of pending XA-prepared transactions across server restart?
I am merging currently for final piece of XA recovery which is XA_list_log_event. It's a part of the part IV of bb-10.6-MDEV-31949 that I need to update (ETA tomorrow). The list is needed in order to decide what to do with a prepared user xa xid at restart, when binlog purged already the prepared part.
XA PREPARE 'x'; #=> OK to the user *crash of master*
*failover to slave* XA RECOVER; #=> 'x' is the prepared list
does not work in your simpler design.
That's the whole point of the "optionally we can support the specific usecase of being able to recover external XA PREPAREd transactions on a slave after failover". Of course my proposal is not implemented yet, but why do you think it cannot work?
Take MDEV-32020 description case. Crash the hanging slave. Restart it as master which ensues executing of the two, right? And with the same effect on the hanging salve. I thought we got on the same page on this back at our zulip conversation.
There are other problems; for example the update of mysql.gtid_slave_pos cannot be committed crash-safe together with the transaction for XA PREPARE (since the transaction is not committed).
For this part I mentioned MDEV-21117 many times. It's going to involve
But this is still not implemeted, right? (I think you meant another bug than MDEV-21117).
MDEV-21777, indeed. Thanks. Not implemented. We were four people and one tester at one point.. [ Let reply to more general subjects below later? there's a hot release stuff awaiting my attention..
And I can't help to underline the real virtue of the XA replication as a pioneer of "fragmented" replication that I tried to promote for Kristian in
Fragmented replication should send events to the slave as soon as possible after starting on the master, so the slave has time to work on it in parallel. And any conflicts with the commit order of other transactions should be detected and the fragmented transaction rolled back and retried.
But the current XA implementation does exactly the opposite: It sends the events to the slave only at the end of the transaction (XA PREPARE), and it makes it _impossible_ to rollback and retry the prepare in case of conflict (by assigning a GTID to the XA PREPARE that's updated in the @@gtid_slave_pos).
The rational part is that the XA transaction is represented by more than one GTID. Arguably it's a bit uncomfortable, but such generalization is fair to call flexible especially looking forward on implementing fragmented transaction replication, or long running and non necessarily transactional DML or DDL statements including ALTER TABLE.
But I don't see any new code in the current XA implementation that could be used for more general "fragmented transaction replication" - what did I miss?
Why do you want to assign more than one GTID to a fragmented transaction, wouldn't it be better to binlog the fragments without a GTID (as in my proposed XA solution)?
... ^ ]
(I am backing up below specifically) slows things down. Big prepared XA transactions would be idling around - and apparently creating hot spots for overall execution when their commits finally arrive.
There is no slowdown from this. Running a big transaction takes the same time whether you end it with an XA PREPARE or an XA COMMIT.
Slowdown is apparent at failover. There's nothing good in having some operations delayed in general, especially that we must have agreed (in the past at least) on the dynamic forced (by "circumstances") rollback idea. (Personally I would add, to hear defenses like that from the author or the optimistic parallel replication is as painful as blasphemy from the mouth of a local priest :-)!)
In fact, my proposal will speed things up, because only one 2-phase commit between binlog and engine is needed per transaction. While in current implementation two are needed, one for XA PREPARE and one for XA COMMIT.
And a simple sequence like this will be able to group commit together (on the slave):
XA PREPARE 't1'; XA COMMIT 't1'; XA PREPARE 't2'; XA COMMIT 't2'; XA PREPARE 't3'; XA COMMIT 't3';
I believe in the current code, it's impossible to group-commit the XA PREPARE together with the XA COMMIT on a slave?
Of course they can be in one group. That's the part II ready for review in bb-10.6-MDEV-31949.
- Kristian.
All the best, Andrei