Hi Andrei, Good to hear from you, and to see that things are still going on with MariaDB GTID replication! I tried looking at the MDEVs that you referred to. Here is how I understand the motivation for this, correct me if I'm wrong: 1. We're considering a slave with "lossless semi-sync slave", where phantom reads are avoided by delaying storage engine commit until at least one slave has acknowledged a transaction. 2. If a master crashes, and comes back up after recovery, we want to prevent it from having any extra transactions that are not replicated to the rest of the replication servers (one of which will have been promoted as a new master). Such a transaction would prevent demoting the old master as a slave, as it would contain a rouge transaction not known to the rest of the replication topology. 3. To solve (2), we recover the old master with --tc-heuristic-recover=rollback. This makes it rollback (=discard) any transaction that was not fully committed to disk (does it also roll back the binary log for such transactions? I don't recall this detail). 4. Because of (3), there might be a transaction that originated on the old master (has it's server_id), but which is now missing from that server (now a slave) as it was deliberately discarded. 5. To solve (4), we will use circular replication so that the missing transaction gets replicated back to the old master (now slave). And we want to have a slave replicate its own transactions (replicate_same_sever_id), which is normally not done. And we want to use the server's binlog to know which transactions to replicate and which to ignore. Is that correctly understood? I must admit, this sounds horribly complex to me. So my question would be, do we really need to do this to solve the fundamental use case? In my belief, the main drawback with MariaDB replication and GTID is the complexity of especially the more advanced setups, which is barely feasible for even a skilled DBA to fully understand. Does this come from trying to port over some specific MySQL replication/GTID feature to MariaDB GTID? One important point is that MariaDB GTID was originally designed to keep the concept of master/binlog and slave/relaylog cleanly separated. A master transaction needs only be written to the binlog, there is no need to update any replication state (=overhead). Similarly, a slave transaction needs only be replicated to the storage engine, there is no need to write it to the binlog (=overhead). IIRC, MySQL (GTID) replication required the binlog (and --log-slave-updates) to be enabled on a slave) This is one of the improvements in the design of MariaDB GTID. So I don't particularly like the idea of now involving the binlog in how a slave replicates transactions. On the contrary, the MASTER_USE_GTID=current_pos feature is one thing I regret from the original GTID design, precisely because it violates the principle. And I was happy to see work to deprecate this, which I think is the prefered direction. So isn't there a better way to achieve the underlying use case? For example, why not just use the new MASTER_DEMOTE_TO_SLAVE option to set the correct slave position from which to replicate from in the old master (now slave)? And then use the existing --gtid-ignore-duplicate option? Since this is GTID-based circular replication, there must already be configured a separate gtid_domain_id for each master? Then you suggest to change the current --gtid-strict-mode and --gtid-ignore-duplicate options to a single option. I feel less strongly about this change; as you say, --gtid-strict-mode=OFF + --gtid-ignore-duplicates=ON may not make much sense. Still, I wonder what is the the need for such a change? This would be a not-backwards-compatible change, with little real benefit, as the semantics would be the same? Also, the --gtid-strict-mode option is an option that is highly recommended for any GTID use, and which should preferably be as simple as possible to understand to be able to be widely used by DBAs. On the contrary --gtid-ignore-duplicates is a very complex option only useful for very complex replication topologies, and with many subtle pitfalls that is very hard to fully understand. So there is real benefit to keeping them separate. As to the --gtid-strict-mode=OFF. The main original motivation for this option (and its default value of "OFF") was to be able to enable GTID by default, for any server installation upgraded (to was it 10.0 that introduced GTID? Long time back!). This is another big improvement (so I think) in MariaDB GTID over MySQL, that GTID is available by default, and compatible with any replication topology that works with old-style MySQL replication. Is it really so that no-one is using old-style, non-GTID replication any more, with all its sloppy setups possible, Sure, then non-strict GTID mode can be deprecated, I suppose - but it seems doubtful that this would be the case for *all* users? I hope I managed to explain where I didn't fully understand the motivation and need for some of these proposals? I'd be very interested to understand better any underlying necessities that I didn't catch from your mail, and maybe refine my comments. And once again, great to see the work still going on with MariaDB replication and GTID! Hope this helps, - Kristian. Andrei Elkin <andrei.elkin@mariadb.com> writes:
Two points of concern of this email deal with observations around GTID strict mode.
1. About gtid_ignore_duplicates
Introduced by
hash: 2c2478b82260 author: unknown date: 2014-03-09 10:27:38
MDEV-5804: If same GTID is ... Before, the arrival of same GTID twice in multi-source replication would cause double-apply or in gtid strict mode an error.
with a clear motivation to overcome the gtid strict mode error policy. That is `gtid_ignore_duplicates = ON' softens `gtid_strict_mode = ON' to ignore a duplicate gtid without a slave error/interruption. As `gtid_strict_mode = OFF', `gtid_ignore_duplicates = ON' combination is hardly ever relevant/practical, the `gtid_ignore_duplicates = ON' then can be fairly perceived as `gtid_strict_mode=ON' 's "mild" submode to suggest enum `gtid_stict_mode' { OFF, ON, "ON_BUT_IGNORE_DUPLICATES_RATHER_THAT_ERROR" } which would obsolete `gtid_ignore_duplicates'.
Why I'd prefer this change actually also deals with the circular replication that - MDEV-28609 - forces us to do searching for duplicates in the slave's binlog. MDEV-28609 presents a case where a same server-id transaction received by a semisync slave requires a more sophisticated decision making than before (the same server-id based) for weather it to be ignored or accepted.
The accept policy of even own server-id gtids (transactions) is necessary for MDEV-21117 semisync slave recovery, and the legacy `replicate_same_sever_id' filtering can't help being too coarse as we need to consider the gtid seq_no. So under those conditions the same server-id transaction is made to be accepted when it does not exist in the slave's binlog. Otherwise, when it does exist in binlog, it'd be (like) a duplicate, arriving through a circular topology from a "source" of the slave local binlog, only to be ignored, like the same server-id gtids normally are. The ignore policy in this case corresponds to the proposed `gtid_stict_mode = "ON_BUT_IGNORE..."'.
It should be naturally default as it inherits its pre-gtid equivalent of `replicate_same_sever_id = NO' and is safe (the slave's gtid state) for a real multi-source configuration as well.
`gtid_stict_mode = ON' still would make sense as a method to catch misconfigurations of the likes of Change-Master's do,ignore- domain_id ones.
Your take?
2. But it's not clear what `gtid_strict_mode = OFF' is for.
The ON mode is pretty flexible allowing gtid seq_no holes that is setting the gtid state to any greater than the current (logical timestamp) value. To set the gtid "clock" backward is still possible though requires slave restart.
So how would you comment on an idea to deprecate OFF?
Cheers,
Andrei