Hi, On 12/4/24 18:08, Kristian Nielsen wrote:
Markus Mäkelä via developers <developers@lists.mariadb.org> writes:
On 12/4/24 13:19, Kristian Nielsen via developers wrote:
5. A more controversial thought is to drop support for semi-sync replication. I think many users use semi-sync believing it does something As a (kind of) user of semi-sync replication, I believe it has a Hi Markus, thanks for taking the time to comment! Your input is very valuable.
valid, albeit limited, use-case and that it's a necessary component in setups where no transactions are allowed to be lost when the primary node in a replication cluster goes down. Perhaps I'm wrong or the way I would like to be explicit about what it means "no transactions are allowed to be lost". I know you Markus fully understand what it means, of course.
Transactions can easily be lost if the server crashes up to and during the commit. What it really means is that the server will send a notification to the client at some point when a single point of failure will no longer cause the transaction to be lost. With semi-sync, this notification comes in the form of the "ok" result of the client's commit.
I want to understand if there are other, possibly better ways to get this notification, if that is all the relevant applications need?
I was suggesting that the application could itself use MASTER_GTID_WAIT() against a slave before accepting the commit as "ok" (or a proxy like MaxScale could do it for the application). Does the current semi-sync replication do anything more for the application than this, and if so, what?
One benefit of this method is that each commit can decide whether it needs to wait or not. One commit that "is not allowed to be lost" will not block other transactions from committing. I think with AFTER_SYNC, all following transactions will be blocked from committing until the current commit has been acknowledged by a slave, and that with AFTER_COMMIT they will not be blocked, but I'm not 100% sure. I had a vague memory of the group commit mechanism doing only one ACK
I think that implementing semi-sync in each application is probably a bit too much but doing it in a proxy like MaxScale does sound doable and the implementation would be essentially the same: delay the OK for the commit until at least one replica responds to the MASTER_GTID_WAIT(). The number of roundtrips should be the same so the only downside of this approach is that you're forced to wait for the SQL thread to apply the transaction which introduces more latency than the existing semi-sync approach does. If a function like MASTER_GTID_WAIT_FOR_IO_THREAD() were to exist, it would be probably be very close in terms of latency. Another use-case that I think I heard about was to use semi-sync replication to slow down the rate of writes so that replication lag is avoided. While this is possible, I believe that tuning the group commit size to be larger probably has the same effect with better overall performance. per group but I might have remembered it wrong, I'm mostly a passive observer to all replication related discussion in Zulip and MDEVs. If it indeed does one ACK per commit even if there's a group of transactions then doing it at the application level might potentially perform better as the waits could be done in parallel.
misunderstanding comes from this. The default value of rpl_semi_sync_master_wait_point should be AFTER_SYNC (lossless failover) and rpl_semi_sync_master_timeout should be set to something I would like to understand the reason(s) AFTER_SYNC is better than AFTER_COMMIT.
From my understanding, from the client's narrow perspective about their own commit there is little difference, either is a notification that the transaction is now robust to single point of failure (available on at least two servers).
Yes, I think you're right and from the point of view of the client the configuration is irrelevant: if you get the OK for the commit the transaction is "durable" on more than one server.
I know of one usecase, which is when things are set up so that if the master crashes, failover to a slave is _always_ done, and the crashed master is changed to be a slave of the new master (as opposed to letting the master restart, do crash recovery, and continue its operation as a master).
With AFTER_COMMIT, the old master might have a transaction committed that does not exist on the new master, which will prevent it from working as a slave and it will need to be discarded (possibly restored from a backup).
With AFTER_SYNC, the old master may still (after restarting) have a transaction committed to the binlog that is not on the slave / new master. But the old master can be restarted with --rpl-semi-sync-slave-enabled that tries to truncate the binlog to discard as many transactions from it as possible, to make sure it only has transactions that are also present on the new master.
(Interestingly, this means that the purpose of AFTER_SYNC is to ensure that transactions _are_ lost, rather than ensure that they are _not_ lost).
Is this the (only) reason that AFTER_SYNC should be default? Or do you know of other reasons to prefer it? I think this is the only reason, I thought that it had a more fundamental effect on things but I think I must've remembered it only in relation to failed masters rejoining the cluster. Due to MDEV-33465, I
I think this is the use-case that MDEV-21117 and MDEV-33465 relate to. From what I remember (in relation to MDEV-33465) having the master roll back the transactions caused some problems to happen if a quick restart happened. I think it was that if GTID 0-1-123 gets replicated due to AFTER_SYNC but then the master crashes and comes back up, it rolls back 0-1-123 due to --rpl-semi-sync-slave-enabled (or --init-rpl-role=SLAVE after MDEV-33465) but before replication starts back up, another transaction gets commited as GTID 0-1-123 on the master. Now when the replication asks for "GTID position after 0-1-123", instead of getting a "I have not seen that GTID" error, the replication continues and history effectively got rewritten. I don't remember if this was the exact problem but it was something along these lines. Looking at the description of --init-rpl-role (https://mariadb.com/kb/en/mariadbd-options/#-init-rpl-role), it seems that it can also cause replication to break. think that my initial thoughts on this are probably wrong and the default value probably isn't as important as I imagined it would be.
Now, with the new binlog implementation, there is no longer any AFTER_SYNC. The whole point of the feature is to make the binlog commit and the InnoDB commit atomic with each other as a whole, there is no point at which a transaction is durably committed in the binlog and not committed in InnoDB. So the truncation of the binlog at old master restart with --rpl-semi-sync-slave-enabled no longer applies.
But I would argue that this binlog truncation is anyway a misfeature. If we want to ensure that the master never commits a transaction before it has been received by a slave, then send the transaction to the slave and await slave reply _before_ writing it to the binlog. Don't first write it to the binlog, and then add complex crash recovery code to try and remove it from the binlog again.
And doing the semi-sync handshake _before_ writing the transaction to the binlog is something that could be implemented in the new binlog implementation. It would be something like BEFORE_WRITE, instead of AFTER_SYNC (which does not exist in the new binlog implementation).
Thus, I really want to understand:
1. Is the --rpl-semi-sync-slave-enabled use case, where a crashing master is always demoted to a slave, used by users in practice, to warrant implementing something like BEFORE_WRITE semisync for the new binlog format?
From what I know and have seen, it is used when something fully automatic like MaxScale is used to handle failovers and rejoining of nodes to the cluster. Without it, I think that you would eventually have to start restoring the nodes from backups once enough failovers have happened. I think the bigger problems is that, until MDEV-34878 or something similar is implemented, there's now way for the crashed master to know what its role in the cluster is as it depends on the other nodes in the cluster. If a failover did take place then the crashed master must come back as a slave and try to rejoin the cluster. If no failover took place, the crashed master must come back as a master and continue accepting writes. Since --init-rpl-role=MASTER cannot be set at runtime, the safest thing to do is to live with the consequences and accept the fact that you can't always rejoin the crashed master back into the cluster.
2. Is there another reason that AFTER_SYNC is useful that I should know, and which needs to be designed into the new binlog format?
- Kristian.
-- Markus Mäkelä, Senior Software Engineer MariaDB Corporation