Thanks a lot Markus for the additional explanations, very useful. Markus Mäkelä via developers <developers@lists.mariadb.org> writes:
From what I remember (in relation to MDEV-33465) having the master roll back the transactions caused some problems to happen if a quick restart happened. I think it was that if GTID 0-1-123 gets replicated due to AFTER_SYNC but then the master crashes and comes back up, it rolls back 0-1-123 due to --rpl-semi-sync-slave-enabled (or --init-rpl-role=SLAVE after MDEV-33465) but before replication starts back up, another transaction gets commited as GTID 0-1-123 on the master. Now when the replication asks for "GTID position after
Yes. _Either_ we need to be sure the master is ahead of all slaves, and we keep it as the master after crash-recovery. _Or_ we need to be sure at least one slave is ahead of the master, and we promote that slave as the new master and demote the old master to a slave after crash recovery. Otherwise the replication hierarchy cannot be reliably re-assembled after a master crash.
From what I know and have seen, it is used when something fully automatic like MaxScale is used to handle failovers and rejoining of nodes to the cluster. Without it, I think that you would eventually have to start restoring the nodes from backups once enough failovers have happened.
What about the following idea? 1. Implement BEFORE_WRITE semi-sync mode. The master will not write transactions to the binlog until at least one slave have acknowledged. 2. This means that if the master crashes, when it comes back up it will have no transaction that does not exists on at least one running node (assuming at most a single failure at a time). 3. When the master restarts, it will go into read-only mode and wait for MaxScale (or other management system) to tell it what to do, similar to MDEV-34878. 4. If MaxScale decides to keep it as the master, it will briefly set it up as a slave and make sure it has replicated the latest GTID on any slave in the replication topology. Then it will be set read-write and continue as the master. 5. If MaxScale decides to promote another server as the new master, the old master is kept in read-only mode and configured as a slave. The BEFORE_WRITE ensures the old master will not be ahead of the new master. This requires the ability in MaxScale to do (4). I think this will be much more robust than having a crashed server try to remove transactions already written to the binlog, and having to configure the server to have one or another role when it starts up. Instead, all servers in the replication topology always wait at startup for the manager to replicate any missing transactions from the appropriate server, and then either set it read-write as a master or continue as a slave. What do you think? Of course, this is all for the future, it requires implementing BEFORE_WRITE in the server first. But I think it sounds promising.
I think that implementing semi-sync in each application is probably a bit too much but doing it in a proxy like MaxScale does sound doable and the implementation would be essentially the same: delay the OK for
It sounds like the new binlog-in-engine should support semi-sync (perhaps not in the first release, but eventually). It could then support AFTER_COMMIT, which would be used when a crashed server is allowed to restart and continue by itself, as is the current default. And then also support BEFORE_WRITE, where transactions are sent to the slave before being written to the binlog, and a crashed server comes up in read-only mode after restart. MaxScale could still implements its own version, but probably it is best if the new binlog implementation would also support some form of semi-sync eventually.
I had a vague memory of the group commit mechanism doing only one ACK per group but I might have remembered it wrong, I'm mostly a passive
I think it still does it for every commit, but this could be improved in the server (MDEV-33491). - Kristian.