Re: [Maria-developers] Semisync plugin incompatibility
Pavel Ivanov <pivanof@google.com> writes:
We've noticed recently that semisync_master plugin in MariaDB (which apparently was fully inherited from MySQL) is seriously incompatible with our understanding of the purpose of semi-sync replication. This incompatibility was apparently introduced as a fix for http://bugs.mysql.com/bug.php?id=45672. The "major no-no" that bug
So as I understand it, this bug is about what should happen when semisync is enabled, but no slaves are connected. Apparently before the fix of Bug#45672, an error was thrown late during COMMIT. So the transaction was committed (locally on the master), but the client still got an error back. And if I understand correctly, after the fix of Bug#45672, no error is thrown in the case where no slave is connected.
talks about is in our opinion the whole purpose of semi-sync replication -- if transaction is not replicated to at least one slave client shouldn't get OK even if transaction is committed locally on the master. Also master shouldn't just turn off semi-sync replication whenever it wants.
So with "just turn off semi-sync replication whenever it wants" - what are you refering to here? I seem to remember that semisync has a timeout, and it gets disabled if that timeout triggers? My guess is that this is what you have in mind, but I wanted to ask to make sure ...
We will fix this problem for us, but first I wanted to understand what's your view of the purpose of semi-sync replication and how you think it should work? I need to know your opinion to understand how I should fix this issue...
Well, personally, I never was much interested in semi-sync. But it is my understanding that there is some interest, so I will answer with what small opinion I have. I suppose the general idea is that when client sees its COMMIT complete, it can know that its transaction exists in at least two places (master binlog + at least one slave relay log). So there is no longer any single point of failure that can cause loss of the transaction. Another point of view I is that semi-sync provides some sort of throttle on how fast the master can generate events compared to how fast the slaves can receive them: http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysq... There was also a suggestion (and a patch is floating around somewhere) for "enhanced semisync replication": https://mariadb.atlassian.net/browse/MDEV-162 This delays not only client acknowledge but also InnoDB commit until the ack from at least one slave, which means that transactions are not visible to other clients until they exist on at least one slave in addition to on the master. Since this is _semi_-sync, not real two-phase commit synchronous replication, the main problem is that there is way to ensure consistency in the general error case. The transaction is already fully committed on the master, it cannot be rolled back. So we are left with the choice of one of two evils: 1. Report an error to the client. Most clients would then probably wrongly assume that the transaction was _not_ committed. There also does not seem to be much the client can do about the error except perhaps log an incident to the monitoring system. On the other hand, then at least the problem is not silently ignored. 2. Report success to the client but complain loudly in the error log (I assume this is what happens in current code). This leaves the client unaware that there is a problem (but presumably the monitoring system will catch the message in the error log).
From this summary, I think I can see the logic of the current behaviour:
- It preserves protection against single-point-of-failure. If all slaves are gone, then we already have one failure, and unless we experience a double failure (master also failing before slave recovers), the transaction will eventually be sent to a slave and no overall failure happens. - If the client can anyway not do anything about the problem except notify the monitoring system, the server may as well do the notification itself. But the opposite point of view also has merit. The client asked for semi-sync behavior, but did not get it, and it does not even have a way to know about the problem. That is not good. Does the client currently at least get a warning for the COMMIT? I think it should (eg. the fix for Bug#45672 should at least have been to turn the error into a warning, not remove the error completely). What I think could make sense is if the client got an error during the prepare phase if no slaves are connected. In this case we _can_ roll back the transaction and give an error to the client without any issue of consistency. But it still leaves a small window where the last slave can disappear between the prepare and the commit phase and leave us with the original problem. I hope this helps you ... Maybe you can describe your use-case, and how you need things to work for that case? Personally I have nothing against changing this behaviour to something more logical, I am just not sure what the most logical behaviour is ... - Kristian.
On Mon, Nov 11, 2013 at 2:29 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
We've noticed recently that semisync_master plugin in MariaDB (which apparently was fully inherited from MySQL) is seriously incompatible with our understanding of the purpose of semi-sync replication. This incompatibility was apparently introduced as a fix for http://bugs.mysql.com/bug.php?id=45672. The "major no-no" that bug
So as I understand it, this bug is about what should happen when semisync is enabled, but no slaves are connected.
Apparently before the fix of Bug#45672, an error was thrown late during COMMIT. So the transaction was committed (locally on the master), but the client still got an error back.
And if I understand correctly, after the fix of Bug#45672, no error is thrown in the case where no slave is connected.
No error is thrown and semi_sync_master is turned off completely.
talks about is in our opinion the whole purpose of semi-sync replication -- if transaction is not replicated to at least one slave client shouldn't get OK even if transaction is committed locally on the master. Also master shouldn't just turn off semi-sync replication whenever it wants.
So with "just turn off semi-sync replication whenever it wants" - what are you refering to here? I seem to remember that semisync has a timeout, and it gets disabled if that timeout triggers? My guess is that this is what you have in mind, but I wanted to ask to make sure ...
Yes, that's what I was referring to.
We will fix this problem for us, but first I wanted to understand what's your view of the purpose of semi-sync replication and how you think it should work? I need to know your opinion to understand how I should fix this issue...
Well, personally, I never was much interested in semi-sync. But it is my understanding that there is some interest, so I will answer with what small opinion I have.
I suppose the general idea is that when client sees its COMMIT complete, it can know that its transaction exists in at least two places (master binlog + at least one slave relay log). So there is no longer any single point of failure that can cause loss of the transaction.
Another point of view I is that semi-sync provides some sort of throttle on how fast the master can generate events compared to how fast the slaves can receive them:
http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysq...
There was also a suggestion (and a patch is floating around somewhere) for "enhanced semisync replication":
https://mariadb.atlassian.net/browse/MDEV-162
This delays not only client acknowledge but also InnoDB commit until the ack from at least one slave, which means that transactions are not visible to other clients until they exist on at least one slave in addition to on the master.
Since this is _semi_-sync, not real two-phase commit synchronous replication, the main problem is that there is way to ensure consistency in the general error case. The transaction is already fully committed on the master, it cannot be rolled back. So we are left with the choice of one of two evils:
1. Report an error to the client. Most clients would then probably wrongly assume that the transaction was _not_ committed. There also does not seem to be much the client can do about the error except perhaps log an incident to the monitoring system. On the other hand, then at least the problem is not silently ignored.
Well, I'd say "wrongly assume" is not quite good wording here. When client sees error it must assume that transaction is not committed, and if by the time it reconnects a new master is already elected, client indeed will see that transaction is not committed. Of course I understand that this design is somewhat brittle because with a very small semi_sync_master_timeout client will basically see error on each transaction it makes. And he will be able to check with SELECT that transaction is committed, even without re-connecting to server. So the general assumption is that semi_sync_master_timeout is very big and client will see client-side timeout and loss of connection much earlier than that.
2. Report success to the client but complain loudly in the error log (I assume this is what happens in current code). This leaves the client unaware that there is a problem (but presumably the monitoring system will catch the message in the error log).
This not only leaves the client unaware of the problem, but also allows the server to accept transactions from clients at a very high rate when no slaves are present. And if then machine with master fails all those accepted transactions will be permanently lost. So in the situation when master doesn't have slaves we want to slow down clients as much as possible even though their transactions will be committed locally and they will be able to check with SELECTs that transactions are actually committed.
From this summary, I think I can see the logic of the current behaviour:
- It preserves protection against single-point-of-failure. If all slaves are gone, then we already have one failure, and unless we experience a double failure (master also failing before slave recovers), the transaction will eventually be sent to a slave and no overall failure happens.
- If the client can anyway not do anything about the problem except notify the monitoring system, the server may as well do the notification itself.
But the opposite point of view also has merit. The client asked for semi-sync behavior, but did not get it, and it does not even have a way to know about the problem. That is not good.
Does the client currently at least get a warning for the COMMIT? I think it should (eg. the fix for Bug#45672 should at least have been to turn the error into a warning, not remove the error completely).
No, there's no warning. And on the server side there's only one line in the logs showing that semi-sync replication has turned off, and nothing else after that for a long period of time when transactions were accepted, but no slaves replicated it.
What I think could make sense is if the client got an error during the prepare phase if no slaves are connected. In this case we _can_ roll back the transaction and give an error to the client without any issue of consistency. But it still leaves a small window where the last slave can disappear between the prepare and the commit phase and leave us with the original problem.
I hope this helps you ... Maybe you can describe your use-case, and how you need things to work for that case? Personally I have nothing against changing this behaviour to something more logical, I am just not sure what the most logical behaviour is ...
For our use case we want clients to always see error when slaves didn't ack the transaction. This basically allows us to have a general rule: "Clients can rely on durability of only those transactions which they received the "success" result on". I.e. all transactions that were committed locally but didn't receive semi-sync ack are ok to lose later, and that won't be a serious offense on MySQL side. Of course "enhanced semi-sync replication" will help with this a lot and we'll be really happy to have it. But without it we at least don't want semi_sync_master to turn itself off ever. So basically my question is: if I prepare a patch that will restore the original behavior of semi-sync replication (and remove the tests added for Bug#45672) will that be acceptable for MariaDB? Thank you, Pavel
Pavel Ivanov <pivanof@google.com> writes:
So basically my question is: if I prepare a patch that will restore the original behavior of semi-sync replication (and remove the tests added for Bug#45672) will that be acceptable for MariaDB?
I don't have anything against it, as I said I do not have much opinion on semi-sync one way or the other. But I would like to hear at least one other opinion (Serg maybe?) And I think you should write up a full description of how semi-sync should work with respect to error handling and disconnecting slaves. So that we have a complete, logical picture into which your patch fits.
For our use case we want clients to always see error when slaves didn't ack the transaction. This basically allows us to have a general rule: "Clients can rely on durability of only those transactions which they received the "success" result on". I.e. all transactions that were committed locally but didn't receive semi-sync ack are ok to lose later, and that won't be a serious offense on MySQL side. Of course "enhanced semi-sync replication" will help with this a lot and we'll be really happy to have it. But without it we at least don't want semi_sync_master to turn itself off ever.
I agree that the fact that semi_sync turns off itself seems stupid. And it clearly would be highly desirable that client can know of the failure of semi-sync. The problem here is that the transaction _is_ committed locally. If we return an error, we are confusing all existing applications that expect an error return from commit to mean that the transaction is guaranteed _not_ to be committed. Did you consider this issue, and possible different ways to solve your problem that would not have this issue? For example: - The client could receive a warning, rather than an error. The warning could be handled by those applications that are interested. - The master could kill the client connection rather than return the error. This matches the normal ACID expectations: If commit returns ok then transaction is durable. If it returns error then transaction is not committed. If it does not return (connection lost), then it is unknown if the transaction is committed or not. - The master could check during the prepare phase if any slaves are connected. If not, the transaction could be rolled back and a normal error returned to the client. - The master could crash itself, causing promotion of a new master, which then could involve checking all replication servers to find the one that is most advanced. - The master could truncate the current binlog file to before the offending transaction and roll back the InnoDB changes. Of course, since this is not true synchronous replication, this leaves the possibility that the transaction exists on a slave but not on the master.
This not only leaves the client unaware of the problem, but also allows the server to accept transactions from clients at a very high rate when no slaves are present. And if then machine with master fails all those accepted transactions will be permanently lost. So in the situation when master doesn't have slaves we want to slow down clients as much as possible even though their transactions will be committed locally and they will be able to check with SELECTs that transactions are actually committed.
So you expect every application to implement error handling for every update that does some SELECTs to check if their transaction was committed or not? That sounds very specialised, surely not something to be expected in general. (But why even do such SELECTs? The client could just check the error code, if it is "semisync error" then the transaction is committed locally, else it is not). I still do not understand how the client will handle the error in your scenario. I think it would clarify things if you could explain this in detail. Eg. explain the original problem you are trying to solve, rather than your proposed solution. - Kristian.
Hi, Kristian! On Nov 14, Kristian Nielsen wrote:
Pavel Ivanov <pivanof@google.com> writes:
So basically my question is: if I prepare a patch that will restore the original behavior of semi-sync replication (and remove the tests added for Bug#45672) will that be acceptable for MariaDB?
I don't have anything against it, as I said I do not have much opinion on semi-sync one way or the other.
But I would like to hear at least one other opinion (Serg maybe?)
I don't have it. I think semi-sync is a pretty fragile hack that simply cannot work correctly in the general case - at best it can work for someone in special use cases. Thus I don't care much what the subset of these cases is, and I don't have anything against your proposed change. Regards, Sergei
On Thu, Nov 14, 2013 at 10:44 AM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Kristian!
On Nov 14, Kristian Nielsen wrote:
Pavel Ivanov <pivanof@google.com> writes:
So basically my question is: if I prepare a patch that will restore the original behavior of semi-sync replication (and remove the tests added for Bug#45672) will that be acceptable for MariaDB?
I don't have anything against it, as I said I do not have much opinion on semi-sync one way or the other.
But I would like to hear at least one other opinion (Serg maybe?)
I don't have it. I think semi-sync is a pretty fragile hack that simply cannot work correctly in the general case - at best it can work for someone in special use cases. Thus I don't care much what the subset of these cases is, and I don't have anything against your proposed change.
Do you think "smart semi-sync replication" (https://mariadb.atlassian.net/browse/MDEV-162) will be better and will work in general case? Pavel
Kristian, Let me try to explain and maybe answer most of your questions. Semi-sync replication for us is a DBA tool that helps to achieve durability of transactions in the world where MySQL doesn't do any flushes to disk. As you may guess by removing disk flushes we can achieve a very high transaction throughput. Plus if we accept the reality that disks can fail and repairing information from it is time-consuming and expensive (if at all possible), with such reality you can realize that flush or no flush there's no durability if disk fails, and thus disk flushes don't make much sense. So to get durability we use semi-sync. And definition of "durability" in this case is "if client gets ok on the transaction he will find this data after that". And that should stand in case of any master failures and failovers. If we set semi_sync_master_timeout = infinity we get something that is very close to that kind of durability. Yes there is a problem that while one connection is waiting for semi-sync ack another one can already see the data committed. And if the first client doesn't ever receive "ok" from the transaction then we can consider it non-existent and we can safely "lose" it during failover. And that will confuse the second client a lot (the data he was seeing suddenly disappears). That's a trade-off we are ready to accept. It looks like MySQL 5.7.2 already implements another way of semi-sync replication when transaction is not visible to other connections until it's semi-sync ack'ed (http://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_r...). We will be happy to try that. But it has another trade-off that could be hard to accept sometimes -- InnoDB releases all row locks only when semi-sync ack is received. And that could slow down inter-dependent transactions significantly. So that's how we look at the semi-sync replication. BTW, digging through some history I've realized that semi-sync plugins in MariaDB look very close to how semi-sync patch looked like at Google in 2008. Apparently back then it was included into MySQL, but then it evolved here and all the changes already didn't make it to upstream. Now to your questions.
The problem here is that the transaction _is_ committed locally. If we return an error, we are confusing all existing applications that expect an error return from commit to mean that the transaction is guaranteed _not_ to be committed. Did you consider this issue, and possible different ways to solve your problem that would not have this issue?
For example:
- The client could receive a warning, rather than an error. The warning could be handled by those applications that are interested.
As I said above semi-sync replication is a DBA tool, so it's not up to application to be interested in it or not. It's up to DBAs to make sure that application developers don't get feeling that they have lost some data. DBAs should be able to guarantee durability even if it's with some constraints in usage.
- The master could kill the client connection rather than return the error. This matches the normal ACID expectations: If commit returns ok then transaction is durable. If it returns error then transaction is not committed. If it does not return (connection lost), then it is unknown if the transaction is committed or not.
I think this makes sense. And this is actually how we use semi-sync now -- we use it only with semi_sync_master_timeout = infinity, i.e. connection either gets semi-sync ack or gets killed (or gets client-side timeout).
- The master could check during the prepare phase if any slaves are connected. If not, the transaction could be rolled back and a normal error returned to the client.
This is racy and basically introduces complexity to the code without eliminating the situation when transaction is committed, but client gets error. So overall I'm not sure this is worth it.
- The master could crash itself, causing promotion of a new master, which then could involve checking all replication servers to find the one that is most advanced.
This is the scariest proposition of all. Deliberate crash in production can lead to higher than necessary periods of service unavailability.
- The master could truncate the current binlog file to before the offending transaction and roll back the InnoDB changes. Of course, since this is not true synchronous replication, this leaves the possibility that the transaction exists on a slave but not on the master.
This is actually what https://mariadb.atlassian.net/browse/MDEV-162 (and probably MySQL 5.7.2 implementation) is about, right? I hope our view of the way how semi-sync replication should work is clear to you now. Pavel
On 2013-11-15 07:32, Pavel Ivanov wrote:
Semi-sync replication for us is a DBA tool that helps to achieve durability of transactions in the world where MySQL doesn't do any flushes to disk. As you may guess by removing disk flushes we can achieve a very high transaction throughput. Plus if we accept the reality that disks can fail and repairing information from it is time-consuming and expensive (if at all possible), with such reality you can realize that flush or no flush there's no durability if disk fails, and thus disk flushes don't make much sense. So to get durability we use semi-sync. And definition of "durability" in this case is "if client gets ok on the transaction he will find this data after that". And that should stand in case of any master failures and failovers.
Hi Pavel, Please pardon this arrogant interruption of your discussion and shameless self-promotion, but I just could not help noticing that Galera replication was designed specifically with these goals in mind. And it does seem to achieve them better than semi-sync plugin. Have you considered Galera? What makes you prefer semi-sync over Galera, if I may ask? Kind regards, Alex
Pavel
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
-- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
On Fri, Nov 15, 2013 at 1:28 AM, Alex Yurchenko <alexey.yurchenko@codership.com> wrote:
Please pardon this arrogant interruption of your discussion and shameless self-promotion, but I just could not help noticing that Galera replication was designed specifically with these goals in mind. And it does seem to achieve them better than semi-sync plugin. Have you considered Galera? What makes you prefer semi-sync over Galera, if I may ask?
To be honest I never looked at how Galera works before. I've looked at it now and I don't see how it can fit with us. The major disadvantages I immediately see: 1. Synchronous replication. That means client must wait while transaction is applied on all nodes which is unacceptably big latency of each transaction. And what if there's a network blip and some node becomes inaccessible? All writes will just freeze? I see the statement that "failed nodes automatically excluded from the cluster", but to do that cluster must wait for some timeout in case it's indeed a network blip and node will "quickly" reconnect. And every client must wait for cluster to decide what happened with that one node. 2. Let's say node fell out of the cluster for 5 minutes and then reconnected. I guess it will be treated as "new node", it will generate state transfer and the node will start downloading the whole database? And while it's trying to download say 500GB of data files all other nodes (or maybe just donor?) won't be able to change those files locally and thus will blow up its memory consumption. That means they could quickly run out of memory and "new node" won't be able to finish its "initialization"... 3. It looks like there's strong asymmetry in starting cluster nodes -- the first one should be started with empty wsrep_cluster_address and all others should be started with the address of the first node. So I can't start all nodes uniformly and then issue some commands to connect them to each other. That's bad. 4. What's the transition path? How do I upgrade MySQL/MariaDB replicating using usual replication to Galera? It looks like there's no such path and the solution is stop the world using regular replication and restart it using Galera. Sorry I can't do that with our production systems. I believe these problems are severe enough for us, so that we can't work with Galera. Pavel
On 2013-11-15 19:34, Pavel Ivanov wrote:
On Fri, Nov 15, 2013 at 1:28 AM, Alex Yurchenko <alexey.yurchenko@codership.com> wrote:
Please pardon this arrogant interruption of your discussion and shameless self-promotion, but I just could not help noticing that Galera replication was designed specifically with these goals in mind. And it does seem to achieve them better than semi-sync plugin. Have you considered Galera? What makes you prefer semi-sync over Galera, if I may ask?
To be honest I never looked at how Galera works before. I've looked at it now and I don't see how it can fit with us. The major disadvantages I immediately see: 1. Synchronous replication. That means client must wait while transaction is applied on all nodes which is unacceptably big latency of each transaction. And what if there's a network blip and some node becomes inaccessible? All writes will just freeze? I see the statement that "failed nodes automatically excluded from the cluster", but to do that cluster must wait for some timeout in case it's indeed a network blip and node will "quickly" reconnect. And every client must wait for cluster to decide what happened with that one node. 2. Let's say node fell out of the cluster for 5 minutes and then reconnected. I guess it will be treated as "new node", it will generate state transfer and the node will start downloading the whole database? And while it's trying to download say 500GB of data files all other nodes (or maybe just donor?) won't be able to change those files locally and thus will blow up its memory consumption. That means they could quickly run out of memory and "new node" won't be able to finish its "initialization"... 3. It looks like there's strong asymmetry in starting cluster nodes -- the first one should be started with empty wsrep_cluster_address and all others should be started with the address of the first node. So I can't start all nodes uniformly and then issue some commands to connect them to each other. That's bad. 4. What's the transition path? How do I upgrade MySQL/MariaDB replicating using usual replication to Galera? It looks like there's no such path and the solution is stop the world using regular replication and restart it using Galera. Sorry I can't do that with our production systems.
I believe these problems are severe enough for us, so that we can't work with Galera.
Pavel, you seem to be terribly mistaken on almost all accounts: 1. *Replication* (i.e. data buffer copying) is indeed synchronous. But nobody said that commit is. What Galera does is very similar to semi-sync, except that it does it technically better. I would not dare to suggest Galera replication if I didn't believe it to be superior to semi-sync in every respect. As an example here's an independent comparison of Galera vs. semi-sync performance: http://linsenraum.de/erkules/2011/06/momentum-galera.html. In fact, majority of Galera users migrated from the regular *asynchronous* MySQL replication, which I think is a testimony to Galera performance. 2. Node reconnecting to cluster will normally receive only events that it missed while being disconnected. 3. You are partially right about it, but isn't it much different from regular MySQL replication where you first need to set up master and then connect slaves (even if you have physically launched the servers at the same time). Yet, Galera nodes can be started simultaneously and then joined together by setting wsrep_cluster_address from mysql client connection. This is not advertised method, because in that case state snapshot transfer can be done only by mysqldump. If you set the address in advance, rsync or xtrabackup can be used to provision the fresh node. 4. Every Galera node can perfectly work as either master or slave to native MySQL replication. So migration path is quite clear. It is very sad that you happen to have such gross misconceptions about Galera. If those were true, how would MariaDB Galera Cluster get paying customers? May be my reply will convince you to have a second look at it. (In addition to the above Galera is fully multi-master, does parallel applying and works great in WAN) Kind regards, Alex
Pavel
-- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
participants (4)
-
Alex Yurchenko
-
Kristian Nielsen
-
Pavel Ivanov
-
Sergei Golubchik