Pavel Ivanov <pivanof@google.com> writes:
We've noticed recently that semisync_master plugin in MariaDB (which apparently was fully inherited from MySQL) is seriously incompatible with our understanding of the purpose of semi-sync replication. This incompatibility was apparently introduced as a fix for http://bugs.mysql.com/bug.php?id=45672. The "major no-no" that bug
So as I understand it, this bug is about what should happen when semisync is enabled, but no slaves are connected. Apparently before the fix of Bug#45672, an error was thrown late during COMMIT. So the transaction was committed (locally on the master), but the client still got an error back. And if I understand correctly, after the fix of Bug#45672, no error is thrown in the case where no slave is connected.
talks about is in our opinion the whole purpose of semi-sync replication -- if transaction is not replicated to at least one slave client shouldn't get OK even if transaction is committed locally on the master. Also master shouldn't just turn off semi-sync replication whenever it wants.
So with "just turn off semi-sync replication whenever it wants" - what are you refering to here? I seem to remember that semisync has a timeout, and it gets disabled if that timeout triggers? My guess is that this is what you have in mind, but I wanted to ask to make sure ...
We will fix this problem for us, but first I wanted to understand what's your view of the purpose of semi-sync replication and how you think it should work? I need to know your opinion to understand how I should fix this issue...
Well, personally, I never was much interested in semi-sync. But it is my understanding that there is some interest, so I will answer with what small opinion I have. I suppose the general idea is that when client sees its COMMIT complete, it can know that its transaction exists in at least two places (master binlog + at least one slave relay log). So there is no longer any single point of failure that can cause loss of the transaction. Another point of view I is that semi-sync provides some sort of throttle on how fast the master can generate events compared to how fast the slaves can receive them: http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysq... There was also a suggestion (and a patch is floating around somewhere) for "enhanced semisync replication": https://mariadb.atlassian.net/browse/MDEV-162 This delays not only client acknowledge but also InnoDB commit until the ack from at least one slave, which means that transactions are not visible to other clients until they exist on at least one slave in addition to on the master. Since this is _semi_-sync, not real two-phase commit synchronous replication, the main problem is that there is way to ensure consistency in the general error case. The transaction is already fully committed on the master, it cannot be rolled back. So we are left with the choice of one of two evils: 1. Report an error to the client. Most clients would then probably wrongly assume that the transaction was _not_ committed. There also does not seem to be much the client can do about the error except perhaps log an incident to the monitoring system. On the other hand, then at least the problem is not silently ignored. 2. Report success to the client but complain loudly in the error log (I assume this is what happens in current code). This leaves the client unaware that there is a problem (but presumably the monitoring system will catch the message in the error log).
From this summary, I think I can see the logic of the current behaviour:
- It preserves protection against single-point-of-failure. If all slaves are gone, then we already have one failure, and unless we experience a double failure (master also failing before slave recovers), the transaction will eventually be sent to a slave and no overall failure happens. - If the client can anyway not do anything about the problem except notify the monitoring system, the server may as well do the notification itself. But the opposite point of view also has merit. The client asked for semi-sync behavior, but did not get it, and it does not even have a way to know about the problem. That is not good. Does the client currently at least get a warning for the COMMIT? I think it should (eg. the fix for Bug#45672 should at least have been to turn the error into a warning, not remove the error completely). What I think could make sense is if the client got an error during the prepare phase if no slaves are connected. In this case we _can_ roll back the transaction and give an error to the client without any issue of consistency. But it still leaves a small window where the last slave can disappear between the prepare and the commit phase and leave us with the original problem. I hope this helps you ... Maybe you can describe your use-case, and how you need things to work for that case? Personally I have nothing against changing this behaviour to something more logical, I am just not sure what the most logical behaviour is ... - Kristian.