Hi,

On 12/5/24 18:02, Kristian Nielsen wrote:
What about the following idea?

1. Implement BEFORE_WRITE semi-sync mode. The master will not write
   transactions to the binlog until at least one slave have acknowledged.

2. This means that if the master crashes, when it comes back up it will have
   no transaction that does not exists on at least one running node
   (assuming at most a single failure at a time).

3. When the master restarts, it will go into read-only mode and wait for
   MaxScale (or other management system) to tell it what to do, similar to
   MDEV-34878.

4. If MaxScale decides to keep it as the master, it will briefly set it up
   as a slave and make sure it has replicated the latest GTID on any slave
   in the replication topology. Then it will be set read-write and continue
   as the master.

5. If MaxScale decides to promote another server as the new master, the old
   master is kept in read-only mode and configured as a slave. The
   BEFORE_WRITE ensures the old master will not be ahead of the new master.

This requires the ability in MaxScale to do (4).

I think this will be much more robust than having a crashed server try to
remove transactions already written to the binlog, and having to configure
the server to have one or another role when it starts up.

Instead, all servers in the replication topology always wait at startup for
the manager to replicate any missing transactions from the appropriate
server, and then either set it read-write as a master or continue as a
slave.

What do you think? Of course, this is all for the future, it requires
implementing BEFORE_WRITE in the server first. But I think it sounds
promising.

I think that sounds like a good idea. In step 4, instead of briefly replicating the lost changes and resuming writes on the same node, I think MaxScale could just move all writes to the node with the newest GTID and turn off read-only there, essentially performing a switchover to another node.  I think that it might actually already handle this case as it can happen with AFTER_SYNC.

However, I'd imagine that this BEFORE_WRITE mode might not be super useful for manually managed replication. You'd have to always switch over to another node when a server crashes. All in all, the BEFORE_WRITE sounds promising and we'd definitely appreciate it but also doesn't seem super useful outside of this somewhat niche use-case. However I do still think semi-sync is generally useful and thus this does seem like something that, as you said, should be implemented eventually in the binlog-in-engine mode.

I'm looking forward to see more progress updates on this, it all seems very interesting.

Markus

-- 
Markus Mäkelä, Senior Software Engineer
MariaDB Corporation