Re: [Maria-developers] MariaDB Galera replication
I'm starting a new thread as this is already doesn't have anything to do with the original topic. On Fri, Nov 15, 2013 at 10:46 AM, Alex Yurchenko <alexey.yurchenko@codership.com> wrote:
Please pardon this arrogant interruption of your discussion and shameless self-promotion, but I just could not help noticing that Galera replication was designed specifically with these goals in mind. And it does seem to achieve them better than semi-sync plugin. Have you considered Galera? What makes you prefer semi-sync over Galera, if I may ask?
To be honest I never looked at how Galera works before. I've looked at it now and I don't see how it can fit with us. The major disadvantages I immediately see: 1. Synchronous replication. That means client must wait while transaction is applied on all nodes which is unacceptably big latency of each transaction. And what if there's a network blip and some node becomes inaccessible? All writes will just freeze? I see the statement that "failed nodes automatically excluded from the cluster", but to do that cluster must wait for some timeout in case it's indeed a network blip and node will "quickly" reconnect. And every client must wait for cluster to decide what happened with that one node. 2. Let's say node fell out of the cluster for 5 minutes and then reconnected. I guess it will be treated as "new node", it will generate state transfer and the node will start downloading the whole database? And while it's trying to download say 500GB of data files all other nodes (or maybe just donor?) won't be able to change those files locally and thus will blow up its memory consumption. That means they could quickly run out of memory and "new node" won't be able to finish its "initialization"... 3. It looks like there's strong asymmetry in starting cluster nodes -- the first one should be started with empty wsrep_cluster_address and all others should be started with the address of the first node. So I can't start all nodes uniformly and then issue some commands to connect them to each other. That's bad. 4. What's the transition path? How do I upgrade MySQL/MariaDB replicating using usual replication to Galera? It looks like there's no such path and the solution is stop the world using regular replication and restart it using Galera. Sorry I can't do that with our production systems.
I believe these problems are severe enough for us, so that we can't work with Galera.
Pavel, you seem to be terribly mistaken on almost all accounts:
1. *Replication* (i.e. data buffer copying) is indeed synchronous. But nobody said that commit is. What Galera does is very similar to semi-sync, except that it does it technically better. I would not dare to suggest Galera replication if I didn't believe it to be superior to semi-sync in every respect.
Well, apparently we have a different understanding of what the term "synchronous replication" means. This term is all over the Galera doc, but I didn't find the detailed description of how actually Galera replication work. So I assumed that my understanding of the term (which actually seem to be in line with wiki's definitions http://en.wikipedia.org/wiki/Replication_(computing) ) is what was implied there. So I hope you'll be able to describe in detail how Galera replication works.
As an example here's an independent comparison of Galera vs. semi-sync performance: http://linsenraum.de/erkules/2011/06/momentum-galera.html.
This is a nice blog post written in German and posted in 2011. And while Google Translate gave me an idea what post was about it would be nice to see something more recent and with better description of what was the actual testing set up.
In fact, majority of Galera users migrated from the regular *asynchronous* MySQL replication, which I think is a testimony to Galera performance.
I don't mean to troll, but this can also mean that everyone who migrated didn't care much about performance and Galera's performance was within sane boundaries... BTW, just found here https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/ : "by design performance of the cluster cannot be higher than performance of the slowest node; however, even if you have only one node, its performance can be considerably lower comparing to running the same server in a standalone mode". That contradicts your words.
2. Node reconnecting to cluster will normally receive only events that it missed while being disconnected.
This seem to contradict to the docs. Again from https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/ : "After a temporary split, if the 'good' part of the cluster was still reachable and its state was modified, resynchronization occurs".
3. You are partially right about it, but isn't it much different from regular MySQL replication where you first need to set up master and then connect slaves (even if you have physically launched the servers at the same time).
Operation of setting up master and then connecting slaves consists of mostly only executing CHANGE MASTER TO and then START SLAVE on all slaves after all MySQL instances (including master) were started with the same set of command line flags. This is fundamentally different from starting instances with different arguments, especially when these arguments should be different depending on whether the replica is starting first or there's already some other replica running.
Yet, Galera nodes can be started simultaneously and then joined together by setting wsrep_cluster_address from mysql client connection. This is not advertised method, because in that case state snapshot transfer can be done only by mysqldump. If you set the address in advance, rsync or xtrabackup can be used to provision the fresh node.
This is of course better because I can start all instances with the same command line arguments. But transferring snapshot of a very big database using mysqldump, and causing the node that creates mysqldump to blow up memory consumption during the process, that is still a big problem.
4. Every Galera node can perfectly work as either master or slave to native MySQL replication. So migration path is quite clear.
Nope, not clear yet. So I'll be able to upgrade all my MySQL instances to a Galera-supporting binary while they are replicating using standard MySQL replication. That's good. Now, how the Galera replication is turned on after that? What will happen if I just set wsrep_cluster_address address on all replicas? What will replicas do, and what will happen with the standard MySQL replication?
It is very sad that you happen to have such gross misconceptions about Galera. If those were true, how would MariaDB Galera Cluster get paying customers?
Care to share some numbers? Like what's the rough amount of those paying customers? What size is the biggest installation -- number of clusters, replicas, highest QPS load? I'm not asking to share any confidential information, but the rough ballpark of the numbers would be helpful.
May be my reply will convince you to have a second look at it. (In addition to the above Galera is fully multi-master, does parallel applying and works great in WAN)
I hope your explanation of how Galera replication work will help me understand how great it works over WAN and how you could make full multi-master work without fully synchronous replication in my understanding of that term. Pavel
On 2013-11-15 23:59, Pavel Ivanov wrote:
I'm starting a new thread as this is already doesn't have anything to do with the original topic.
Fair enough.
On Fri, Nov 15, 2013 at 10:46 AM, Alex Yurchenko <alexey.yurchenko@codership.com> wrote:
Please pardon this arrogant interruption of your discussion and shameless self-promotion, but I just could not help noticing that Galera replication was designed specifically with these goals in mind. And it does seem to achieve them better than semi-sync plugin. Have you considered Galera? What makes you prefer semi-sync over Galera, if I may ask?
To be honest I never looked at how Galera works before. I've looked at it now and I don't see how it can fit with us. The major disadvantages I immediately see: 1. Synchronous replication. That means client must wait while transaction is applied on all nodes which is unacceptably big latency of each transaction. And what if there's a network blip and some node becomes inaccessible? All writes will just freeze? I see the statement that "failed nodes automatically excluded from the cluster", but to do that cluster must wait for some timeout in case it's indeed a network blip and node will "quickly" reconnect. And every client must wait for cluster to decide what happened with that one node. 2. Let's say node fell out of the cluster for 5 minutes and then reconnected. I guess it will be treated as "new node", it will generate state transfer and the node will start downloading the whole database? And while it's trying to download say 500GB of data files all other nodes (or maybe just donor?) won't be able to change those files locally and thus will blow up its memory consumption. That means they could quickly run out of memory and "new node" won't be able to finish its "initialization"... 3. It looks like there's strong asymmetry in starting cluster nodes -- the first one should be started with empty wsrep_cluster_address and all others should be started with the address of the first node. So I can't start all nodes uniformly and then issue some commands to connect them to each other. That's bad. 4. What's the transition path? How do I upgrade MySQL/MariaDB replicating using usual replication to Galera? It looks like there's no such path and the solution is stop the world using regular replication and restart it using Galera. Sorry I can't do that with our production systems.
I believe these problems are severe enough for us, so that we can't work with Galera.
Pavel, you seem to be terribly mistaken on almost all accounts:
1. *Replication* (i.e. data buffer copying) is indeed synchronous. But nobody said that commit is. What Galera does is very similar to semi-sync, except that it does it technically better. I would not dare to suggest Galera replication if I didn't believe it to be superior to semi-sync in every respect.
Well, apparently we have a different understanding of what the term "synchronous replication" means. This term is all over the Galera doc, but I didn't find the detailed description of how actually Galera replication work. So I assumed that my understanding of the term (which actually seem to be in line with wiki's definitions http://en.wikipedia.org/wiki/Replication_(computing) ) is what was implied there. So I hope you'll be able to describe in detail how Galera replication works.
There can be much detail ;) I'll start with this: 1) During transaction execution Galera records unique keys of the rows modified or referenced (foreign keys) by transaction. 2) At prepare time it takes the keys and binlog events from the thread IO cache and wraps them into a "writeset". 3) The writeset is synchronously copied to all nodes. This is the only synchronous operation and can be done either over TCP or multicast UDP. All nodes, including the sender receive writesets in exactly the same order, which defines the sequence number part of the GTID. The writeset is placed in the receive queue for further processing. 4) The writeset is picked from the queue and (in seqno order) is passed through certification algorithm which determines whether the writeset can be applied or not and also which writesets it can be applied in parallel with. 5) If certification verdict is positive, master commits the transaction and sends OK to client, slave applies and commits the binlog events from the writeset. 6) If certification verdict is negative, master rolls back the transaction and sends deadlock error to client, slave just discards the writeset. In the end transaction is either committed on all nodes (except for those that fail) or none at all. Here is a picture of the process: http://www.codership.com/wiki/doku.php?id=certification. The certification algorithm itself was proposed by Fernando Pedone in his PhD thesis. The idea is that by global event ordering allows us to make consistent decisions without the need for additional communication. Note that if only one node in the cluster accepts writes, certification will always be positive.
As an example here's an independent comparison of Galera vs. semi-sync performance: http://linsenraum.de/erkules/2011/06/momentum-galera.html.
This is a nice blog post written in German and posted in 2011. And
You don't seriously expect that something has changed in that department since then, do you? ;)
while Google Translate gave me an idea what post was about it would be nice to see something more recent and with better description of what was the actual testing set up.
Sure thing, but who will bother? However here's something from 2012 and in English - but no pictures: http://www.mysqlperformanceblog.com/2012/06/14/comparing-percona-xtradb-clus... Being WAN test it may be not directly relevant to your case, but it kinda shows that Galera replication is more efficient than semi-sync in WAN, and is likely to be also more efficient in LAN. In fact, given that semi-sync replicates one transaction at a time, it is hard to be less efficient than semi-sync. Only through deliberate sabotage.
In fact, majority of Galera users migrated from the regular *asynchronous* MySQL replication, which I think is a testimony to Galera performance.
I don't mean to troll, but this can also mean that everyone who migrated didn't care much about performance and Galera's performance was within sane boundaries...
BTW, just found here https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/ : "by design performance of the cluster cannot be higher than performance of the slowest node; however, even if you have only one node, its performance can be considerably lower comparing to running the same server in a standalone mode". That contradicts your words.
Replication has its overhead, and it is not inconceivable to create a load where that overhead will dominate. Still I doubt that it will be higher than that of a standalone server WITH BINLOG ENABLED. At least with real life loads.
2. Node reconnecting to cluster will normally receive only events that it missed while being disconnected.
This seem to contradict to the docs. Again from https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/ : "After a temporary split, if the 'good' part of the cluster was still reachable and its state was modified, resynchronization occurs".
Yes, but it does not specify the sort of synchronization - whether it is a full state snapshot transfer or merely a catch up with missing transactions. But, depending on the circumstances any of those can occur.
3. You are partially right about it, but isn't it much different from regular MySQL replication where you first need to set up master and then connect slaves (even if you have physically launched the servers at the same time).
Operation of setting up master and then connecting slaves consists of mostly only executing CHANGE MASTER TO and then START SLAVE on all slaves after all MySQL instances (including master) were started with the same set of command line flags. This is fundamentally different from starting instances with different arguments, especially when these arguments should be different depending on whether the replica is starting first or there's already some other replica running.
It looks like either way you have to treat master and slaves differently. However with modern Galera this difference simply boils to: - you start the first node of a cluster with service mysql start --wsrep-new-cluster - you start all other nodes with just service mysql start. (wsrep_cluster_address can be the same on all nodes)
Yet, Galera nodes can be started simultaneously and then joined together by setting wsrep_cluster_address from mysql client connection. This is not advertised method, because in that case state snapshot transfer can be done only by mysqldump. If you set the address in advance, rsync or xtrabackup can be used to provision the fresh node.
This is of course better because I can start all instances with the same command line arguments. But transferring snapshot of a very big database using mysqldump, and causing the node that creates mysqldump to blow up memory consumption during the process, that is still a big problem.
How would you do this with semi-sync? Restore from backup and replay missing events? Well, you can do the same with Galera.
4. Every Galera node can perfectly work as either master or slave to native MySQL replication. So migration path is quite clear.
Nope, not clear yet. So I'll be able to upgrade all my MySQL instances to a Galera-supporting binary while they are replicating using standard MySQL replication. That's good. Now, how the Galera replication is turned on after that? What will happen if I just set wsrep_cluster_address address on all replicas? What will replicas do, and what will happen with the standard MySQL replication?
Ok, I was clearly too brief there. 1) you shutdown the first slave, upgrade software, add required configuration, restart it as a single node cluster, connect it back to master as a regular slave. 2) for the rest of the slaves: shut down the slave, upgrade software, add required configuration, join it to Galera cluster. Galera cluster functions as a single collective slave now. Only Galera replication between the nodes. Depending on how meticulous you are, you can avoid full state snapshot if you take care to notice the offset (in the number of transactions) between the moments the first and this nodes were shut down. Then you can forge the Galera GTID corresponding to this node position and just replay missing transactions cached by the first node (make sure it is specified in wsrep_sst_donor). If the node does not know its Galera GTID, then, obviously it needs full SST. 3) when all nodes are converted perform master failover to one of Galera nodes like you'd normally do. Now you can stop the remaining slave. 4) Convert former master as per 2) If this looks dense, quick Google search gives: http://www.severalnines.com/blog/field-live-migration-mmm-mariadb-galera-clu... https://github.com/percona/xtradb-cluster-tutorial/blob/master/instructions/...
It is very sad that you happen to have such gross misconceptions about Galera. If those were true, how would MariaDB Galera Cluster get paying customers?
Care to share some numbers? Like what's the rough amount of those paying customers? What size is the biggest installation -- number of clusters, replicas, highest QPS load? I'm not asking to share any confidential information, but the rough ballpark of the numbers would be helpful.
Unfortunately I'm not at liberty to discuss paying customers, especially given that many of them are customers of our partners, and I myself am not privy to the details. The point of that remark was that we are making a living, and it would be very hard to make a living on something that is no better than MySQL semi-sync, especially given the quality of our marketing materials ;) Some public material is available at our site: http://www.codership.com/user-stories. However it mostly contains no hard numbers.
May be my reply will convince you to have a second look at it. (In addition to the above Galera is fully multi-master, does parallel applying and works great in WAN)
I hope your explanation of how Galera replication work will help me understand how great it works over WAN and how you could make full multi-master work without fully synchronous replication in my understanding of that term.
Pavel
-- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
On Fri, Nov 15, 2013 at 5:55 PM, Alex Yurchenko <alexey.yurchenko@codership.com> wrote:
To be honest I never looked at how Galera works before. I've looked at it now and I don't see how it can fit with us. The major disadvantages I immediately see: 1. Synchronous replication. That means client must wait while transaction is applied on all nodes which is unacceptably big latency of each transaction. And what if there's a network blip and some node becomes inaccessible? All writes will just freeze? I see the statement that "failed nodes automatically excluded from the cluster", but to do that cluster must wait for some timeout in case it's indeed a network blip and node will "quickly" reconnect. And every client must wait for cluster to decide what happened with that one node. 2. Let's say node fell out of the cluster for 5 minutes and then reconnected. I guess it will be treated as "new node", it will generate state transfer and the node will start downloading the whole database? And while it's trying to download say 500GB of data files all other nodes (or maybe just donor?) won't be able to change those files locally and thus will blow up its memory consumption. That means they could quickly run out of memory and "new node" won't be able to finish its "initialization"... 3. It looks like there's strong asymmetry in starting cluster nodes -- the first one should be started with empty wsrep_cluster_address and all others should be started with the address of the first node. So I can't start all nodes uniformly and then issue some commands to connect them to each other. That's bad. 4. What's the transition path? How do I upgrade MySQL/MariaDB replicating using usual replication to Galera? It looks like there's no such path and the solution is stop the world using regular replication and restart it using Galera. Sorry I can't do that with our production systems.
I believe these problems are severe enough for us, so that we can't work with Galera.
Pavel, you seem to be terribly mistaken on almost all accounts:
1. *Replication* (i.e. data buffer copying) is indeed synchronous. But nobody said that commit is. What Galera does is very similar to semi-sync, except that it does it technically better. I would not dare to suggest Galera replication if I didn't believe it to be superior to semi-sync in every respect.
Well, apparently we have a different understanding of what the term "synchronous replication" means. This term is all over the Galera doc, but I didn't find the detailed description of how actually Galera replication work. So I assumed that my understanding of the term (which actually seem to be in line with wiki's definitions http://en.wikipedia.org/wiki/Replication_(computing) ) is what was implied there. So I hope you'll be able to describe in detail how Galera replication works.
There can be much detail ;) I'll start with this:
1) During transaction execution Galera records unique keys of the rows modified or referenced (foreign keys) by transaction. 2) At prepare time it takes the keys and binlog events from the thread IO cache and wraps them into a "writeset". 3) The writeset is synchronously copied to all nodes. This is the only synchronous operation and can be done either over TCP or multicast UDP. All nodes, including the sender receive writesets in exactly the same order, which defines the sequence number part of the GTID. The writeset is placed in the receive queue for further processing. 4) The writeset is picked from the queue and (in seqno order) is passed through certification algorithm which determines whether the writeset can be applied or not and also which writesets it can be applied in parallel with. 5) If certification verdict is positive, master commits the transaction and sends OK to client, slave applies and commits the binlog events from the writeset. 6) If certification verdict is negative, master rolls back the transaction and sends deadlock error to client, slave just discards the writeset.
In the end transaction is either committed on all nodes (except for those that fail) or none at all.
Here is a picture of the process: http://www.codership.com/wiki/doku.php?id=certification. The certification algorithm itself was proposed by Fernando Pedone in his PhD thesis. The idea is that by global event ordering allows us to make consistent decisions without the need for additional communication.
Note that if only one node in the cluster accepts writes, certification will always be positive.
So the picture seem to suggest that certification happens on each server independently. I don't know how you make sure that the result of the certification is the same on each server (would be nice to know that). But anyway looks like you need at least one roundtrip to each node to deliver writeset and make sure that it's delivered. And I guess only one misbehaving node will freeze all transactions until that node is excluded from the cluster. Is that correct?
As an example here's an independent comparison of Galera vs. semi-sync performance: http://linsenraum.de/erkules/2011/06/momentum-galera.html.
This is a nice blog post written in German and posted in 2011. And
You don't seriously expect that something has changed in that department since then, do you? ;)
while Google Translate gave me an idea what post was about it would be nice to see something more recent and with better description of what was the actual testing set up.
Sure thing, but who will bother?
Are you serious with these questions? So you are telling me "cluster is much better than semi-sync", I'm asking you "give me the proof", and you answer me "who bothers to have a proof"? And you want me to treat your claims seriously?
However here's something from 2012 and in English - but no pictures: http://www.mysqlperformanceblog.com/2012/06/14/comparing-percona-xtradb-clus...
This is really ridiculous testing with really ridiculous conclusions. What kind of comparison is that if you are testing 6-replica Percona Cluster against 2-replica setting with semi-sync? Disabling log_bin and innodb_support_xa on Percona Cluster is also very nice -- how will you recover from server crashes? And where will nodes take last events from after network disconnection? "I ignored quorum arbitration" also doesn't sound promising even though I don't know what it is.
Being WAN test it may be not directly relevant to your case, but it kinda shows that Galera replication is more efficient than semi-sync in WAN, and is likely to be also more efficient in LAN. In fact, given that semi-sync replicates one transaction at a time, it is hard to be less efficient than semi-sync. Only through deliberate sabotage.
Well, sure, as long as your only definition of "efficiency" is something like 32-threaded sysbench results. But how about single-threaded sysbench results, i.e. average transaction latency in single-threaded client mode? And how about another killer case: what is the maximum number of parallel updates per second that you can make to a single row? When you talk about efficiency you need to talk about a wide range of different use cases.
2. Node reconnecting to cluster will normally receive only events that it missed while being disconnected.
This seem to contradict to the docs. Again from https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/ : "After a temporary split, if the 'good' part of the cluster was still reachable and its state was modified, resynchronization occurs".
Yes, but it does not specify the sort of synchronization - whether it is a full state snapshot transfer or merely a catch up with missing transactions. But, depending on the circumstances any of those can occur.
It would be nice to see what algorithm is used to choose which kind of synchronization is necessary to do.
Yet, Galera nodes can be started simultaneously and then joined together by setting wsrep_cluster_address from mysql client connection. This is not advertised method, because in that case state snapshot transfer can be done only by mysqldump. If you set the address in advance, rsync or xtrabackup can be used to provision the fresh node.
This is of course better because I can start all instances with the same command line arguments. But transferring snapshot of a very big database using mysqldump, and causing the node that creates mysqldump to blow up memory consumption during the process, that is still a big problem.
How would you do this with semi-sync? Restore from backup and replay missing events? Well, you can do the same with Galera.
I'm sorry, but this is not mentioned anywhere in the docs. So I don't know what Galera allows to do in this case.
4. Every Galera node can perfectly work as either master or slave to native MySQL replication. So migration path is quite clear.
Nope, not clear yet. So I'll be able to upgrade all my MySQL instances to a Galera-supporting binary while they are replicating using standard MySQL replication. That's good. Now, how the Galera replication is turned on after that? What will happen if I just set wsrep_cluster_address address on all replicas? What will replicas do, and what will happen with the standard MySQL replication?
Ok, I was clearly too brief there.
1) you shutdown the first slave, upgrade software, add required configuration, restart it as a single node cluster, connect it back to master as a regular slave. 2) for the rest of the slaves: shut down the slave, upgrade software, add required configuration, join it to Galera cluster. Galera cluster functions as a single collective slave now. Only Galera replication between the nodes. Depending on how meticulous you are, you can avoid full state snapshot if you take care to notice the offset (in the number of transactions) between the moments the first and this nodes were shut down. Then you can forge the Galera GTID corresponding to this node position and just replay missing transactions cached by the first node (make sure it is specified in wsrep_sst_donor). If the node does not know its Galera GTID, then, obviously it needs full SST.
Hm... As Galera is not available for MariaDB 10.0 I assume Galera GTID is not the same as MariaDB's GTID. This is confusing, and it's apparently not documented anywhere...
3) when all nodes are converted perform master failover to one of Galera nodes like you'd normally do. Now you can stop the remaining slave. 4) Convert former master as per 2)
If this looks dense, quick Google search gives: http://www.severalnines.com/blog/field-live-migration-mmm-mariadb-galera-clu... https://github.com/percona/xtradb-cluster-tutorial/blob/master/instructions/...
This is the best advice I've ever heard from (presumably) developer of a big and complicated piece of software: if you need documentation on how to use it go google it and you may find some blog posts by someone who uses it... OK, thanks, I know now how I can find more info on Galera Cluster. Pavel
<snip>
There can be much detail ;) I'll start with this:
1) During transaction execution Galera records unique keys of the rows modified or referenced (foreign keys) by transaction. 2) At prepare time it takes the keys and binlog events from the thread IO cache and wraps them into a "writeset". 3) The writeset is synchronously copied to all nodes. This is the only synchronous operation and can be done either over TCP or multicast UDP. All nodes, including the sender receive writesets in exactly the same order, which defines the sequence number part of the GTID. The writeset is placed in the receive queue for further processing. 4) The writeset is picked from the queue and (in seqno order) is passed through certification algorithm which determines whether the writeset can be applied or not and also which writesets it can be applied in parallel with. 5) If certification verdict is positive, master commits the transaction and sends OK to client, slave applies and commits the binlog events from the writeset. 6) If certification verdict is negative, master rolls back the transaction and sends deadlock error to client, slave just discards the writeset.
In the end transaction is either committed on all nodes (except for those that fail) or none at all.
Here is a picture of the process: http://www.codership.com/wiki/doku.php?id=certification. The certification algorithm itself was proposed by Fernando Pedone in his PhD thesis. The idea is that by global event ordering allows us to make consistent decisions without the need for additional communication.
Note that if only one node in the cluster accepts writes, certification will always be positive.
So the picture seem to suggest that certification happens on each server independently. I don't know how you make sure that the result of the certification is the same on each server (would be nice to know that).
Certification test is deterministic provided the writesets are processed in the same order. Group communication transport makes sure that the writesets are globally totally ordered. That is basically the main Galera difference: group communication instead of unrelated TCP links.
But anyway looks like you need at least one roundtrip to each node to deliver writeset and make sure that it's delivered. And I guess only one misbehaving node will freeze all transactions until that node is excluded from the cluster. Is that correct?
Yes, you're correct. It's kinda clusterish.
As an example here's an independent comparison of Galera vs. semi-sync performance: http://linsenraum.de/erkules/2011/06/momentum-galera.html.
This is a nice blog post written in German and posted in 2011. And
You don't seriously expect that something has changed in that department since then, do you? ;)
while Google Translate gave me an idea what post was about it would be nice to see something more recent and with better description of what was the actual testing set up.
Sure thing, but who will bother?
Are you serious with these questions? So you are telling me "cluster is much better than semi-sync", I'm asking you "give me the proof", and you answer me "who bothers to have a proof"? And you want me to treat your claims seriously?
That really was a rhetoric question, but if you insist... One. 2 years ago one dude decided to compare Galera and semi-sync as best as he could. And it was kinda a clear case. You know how semi-sync works and you know its hard to do worse. Since then only Jay cared to do it in WAN and it was what everybody expected. Besides that, literally, nobody bothered about semi-sync. Even Kristian told you that he does not. Two. We are kinda busy developing and improving our software. And as long as we believe that there is enough evidence from the field that our software gets better, it would be irresponsible of us spending time and money on churning out quarterly benchmark results, wouldn't it? Especially given that most of those are hardly applicable in real life and anyone can dismiss them as skewed. Or expired. Three. I'm not trying to sell you anything. Had it been about asynchronous replication, I would not have spoken at all. However sincerely believing that Galera covers all semi-sync use cases, I asked why you don't use it. I wanted to know why it does not work for you, why are you fixing semi-sync instead. But now we ended up here. In the public mailing list. And that kinda makes me obliged to expose your misconceptions and accept just criticism.
However here's something from 2012 and in English - but no pictures: http://www.mysqlperformanceblog.com/2012/06/14/comparing-percona-xtradb-clus...
This is really ridiculous testing with really ridiculous conclusions.
That's a debatable statement ;) I think many would disagree.
What kind of comparison is that if you are testing 6-replica Percona Cluster against 2-replica setting with semi-sync?
Well, Jay was comparing Percona Cluster with one master replicating to (eventually) 5 slaves (which is presumably more work) and MySQL semi-sync with one master replicating to one slave. And he sees that Percona Cluster does no worse than semi-sync with one client thread and WAY better with several threads. It kinda answers many of your questions about performance.
Disabling log_bin and innodb_support_xa on Percona Cluster is also very nice -- how will you recover from server crashes?
And what other nodes are for? Don't you yourself want to employ semi-sync to avoid extra flushes? And how would you recover from crashes then? Here's the quote from your original post which prompted me to ask you about Galera: "Semi-sync replication for us is a DBA tool that helps to achieve durability of transactions in the world where MySQL doesn't do any flushes to disk. As you may guess by removing disk flushes we can achieve a very high transaction throughput. Plus if we accept the reality that disks can fail and repairing information from it is time-consuming and expensive (if at all possible), with such reality you can realize that flush or no flush there's no durability if disk fails, and thus disk flushes don't make much sense." This is exactly what we stand for with Galera: durability through redundancy. Or am I missing something?
And where will nodes take last events from after network disconnection?
From the cluster. That's what it is there for.
"I ignored quorum arbitration" also doesn't sound promising even though I don't know what it is.
This really isn't a big deal. He just had two datacenters with equal number of nodes in them. Had network been broken between them there'd be a "split-brain". It is relevant to multi-master use case only. And practically irrelevant to performance benchmarking that he did.
Being WAN test it may be not directly relevant to your case, but it kinda shows that Galera replication is more efficient than semi-sync in WAN, and is likely to be also more efficient in LAN. In fact, given that semi-sync replicates one transaction at a time, it is hard to be less efficient than semi-sync. Only through deliberate sabotage.
Well, sure, as long as your only definition of "efficiency" is something like 32-threaded sysbench results. But how about single-threaded sysbench results, i.e. average transaction latency in single-threaded client mode?
That was in the first table: semi-sync: 102 ms Percona cluster: 108 ms Ok, this was not sysbench, it was just manual inserts.
And how about another killer case: what is the maximum number of parallel updates per second that you can make to a single row?
But of course, it is now well known, 1/RTT.
When you talk about efficiency you need to talk about a wide range of different use cases.
2. Node reconnecting to cluster will normally receive only events that it missed while being disconnected.
This seem to contradict to the docs. Again from https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/ : "After a temporary split, if the 'good' part of the cluster was still reachable and its state was modified, resynchronization occurs".
Yes, but it does not specify the sort of synchronization - whether it is a full state snapshot transfer or merely a catch up with missing transactions. But, depending on the circumstances any of those can occur.
It would be nice to see what algorithm is used to choose which kind of synchronization is necessary to do.
It is rather simple: if possible (required transactions are present in donor cache) - replay missing transactions, if not - copy a full snapshot. But yes, this area is not totally without gotchas yet...
Yet, Galera nodes can be started simultaneously and then joined together by setting wsrep_cluster_address from mysql client connection. This is not advertised method, because in that case state snapshot transfer can be done only by mysqldump. If you set the address in advance, rsync or xtrabackup can be used to provision the fresh node.
This is of course better because I can start all instances with the same command line arguments. But transferring snapshot of a very big database using mysqldump, and causing the node that creates mysqldump to blow up memory consumption during the process, that is still a big problem.
How would you do this with semi-sync? Restore from backup and replay missing events? Well, you can do the same with Galera.
I'm sorry, but this is not mentioned anywhere in the docs. So I don't know what Galera allows to do in this case.
It is now plain to see our complete failure with documentation. And I guess that answers my initial question of why you're not using Galera.
4. Every Galera node can perfectly work as either master or slave to native MySQL replication. So migration path is quite clear.
Nope, not clear yet. So I'll be able to upgrade all my MySQL instances to a Galera-supporting binary while they are replicating using standard MySQL replication. That's good. Now, how the Galera replication is turned on after that? What will happen if I just set wsrep_cluster_address address on all replicas? What will replicas do, and what will happen with the standard MySQL replication?
Ok, I was clearly too brief there.
1) you shutdown the first slave, upgrade software, add required configuration, restart it as a single node cluster, connect it back to master as a regular slave. 2) for the rest of the slaves: shut down the slave, upgrade software, add required configuration, join it to Galera cluster. Galera cluster functions as a single collective slave now. Only Galera replication between the nodes. Depending on how meticulous you are, you can avoid full state snapshot if you take care to notice the offset (in the number of transactions) between the moments the first and this nodes were shut down. Then you can forge the Galera GTID corresponding to this node position and just replay missing transactions cached by the first node (make sure it is specified in wsrep_sst_donor). If the node does not know its Galera GTID, then, obviously it needs full SST.
Hm... As Galera is not available for MariaDB 10.0 I assume Galera GTID is not the same as MariaDB's GTID. This is confusing, and it's apparently not documented anywhere...
Yes, at the moment it is the case. We develop our patch against Oracle's sources and then it gets ported to PXC and MariaDB Cluster. Currently MariaDB Cluster is a bit behind and MariaDB GTID support may be challenging. However this will be of relevance only if you decide to heavily mix Galera and native replication (as in having two Galera clusters replicate to each other asynchronously). For migration it is probably of little importance.
3) when all nodes are converted perform master failover to one of Galera nodes like you'd normally do. Now you can stop the remaining slave. 4) Convert former master as per 2)
If this looks dense, quick Google search gives: http://www.severalnines.com/blog/field-live-migration-mmm-mariadb-galera-clu... https://github.com/percona/xtradb-cluster-tutorial/blob/master/instructions/...
This is the best advice I've ever heard from (presumably) developer of a big and complicated piece of software: if you need documentation on how to use it go google it and you may find some blog posts by someone who uses it... OK, thanks, I know now how I can find more info on Galera Cluster.
Sarcasm is good. But if you look at it realistically these were the real world guys solving their real world problems. How can a developer of not so big, but nevertheless complicated *C++* software provide you with exhaustive instructions on how to do *DBA* stuff, which given the admitted complexity of the problem and diversity of requirements and approaches would take volumes? Apparently these guys didn't have it that hard to understand how Galera is applicable to their problem. This is not to say that our documentation doesn't suck, but how are these blog posts worse than something I would have written? Why should not I refer to 3rd party knowledge? Anyway, as I already said above, the point is taken, even though it is besides technical merits of Galera. Regards, Alex
Pavel
-- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
On Sat, Nov 16, 2013 at 6:05 PM, Alex Yurchenko <alexey.yurchenko@codership.com> wrote:
Disabling log_bin and innodb_support_xa on Percona Cluster is also very nice -- how will you recover from server crashes?
And what other nodes are for? Don't you yourself want to employ semi-sync to avoid extra flushes? And how would you recover from crashes then? Here's the quote from your original post which prompted me to ask you about Galera:
"Semi-sync replication for us is a DBA tool that helps to achieve durability of transactions in the world where MySQL doesn't do any flushes to disk. As you may guess by removing disk flushes we can achieve a very high transaction throughput. Plus if we accept the reality that disks can fail and repairing information from it is time-consuming and expensive (if at all possible), with such reality you can realize that flush or no flush there's no durability if disk fails, and thus disk flushes don't make much sense."
This is exactly what we stand for with Galera: durability through redundancy. Or am I missing something?
For me "durability through redundancy" means if I said OK to client then client will always find the new data later even if immediately after the OK I pull the plug on the machine where master runs, or mysqld crashes. But that doesn't mean that every time mysqld crashed I want to throw away my database and start copying it from another node again.
And how about another killer case: what is the maximum number of parallel updates per second that you can make to a single row?
But of course, it is now well known, 1/RTT.
I guess this is true for Galera Cluster. MySQL with semi-sync can accept much more than that (not with the new default semi-sync mode from 5.7.2 though).
Yes, but it does not specify the sort of synchronization - whether it is a full state snapshot transfer or merely a catch up with missing transactions. But, depending on the circumstances any of those can occur.
It would be nice to see what algorithm is used to choose which kind of synchronization is necessary to do.
It is rather simple: if possible (required transactions are present in donor cache) - replay missing transactions, if not - copy a full snapshot. But yes, this area is not totally without gotchas yet...
I see one more new term here -- "donor cache". I have no idea what it is, how big it is and how it works...
This is not to say that our documentation doesn't suck, but how are these blog posts worse than something I would have written? Why should not I refer to 3rd party knowledge?
I think references to 3rd party knowledge is good as long as they can be easily discovered starting from here https://mariadb.com/kb/en/galera/. Pavel
On 2013-11-17 04:44, Pavel Ivanov wrote:
On Sat, Nov 16, 2013 at 6:05 PM, Alex Yurchenko <alexey.yurchenko@codership.com> wrote:
Disabling log_bin and innodb_support_xa on Percona Cluster is also very nice -- how will you recover from server crashes?
And what other nodes are for? Don't you yourself want to employ semi-sync to avoid extra flushes? And how would you recover from crashes then? Here's the quote from your original post which prompted me to ask you about Galera:
"Semi-sync replication for us is a DBA tool that helps to achieve durability of transactions in the world where MySQL doesn't do any flushes to disk. As you may guess by removing disk flushes we can achieve a very high transaction throughput. Plus if we accept the reality that disks can fail and repairing information from it is time-consuming and expensive (if at all possible), with such reality you can realize that flush or no flush there's no durability if disk fails, and thus disk flushes don't make much sense."
This is exactly what we stand for with Galera: durability through redundancy. Or am I missing something?
For me "durability through redundancy" means if I said OK to client then client will always find the new data later even if immediately after the OK I pull the plug on the machine where master runs, or mysqld crashes. But that doesn't mean that every time mysqld crashed I want to throw away my database and start copying it from another node again.
And so how does disabling XA and/or binlog imply throwing away the database? I kinda more than hinted a number of times that Galera nodes can do missing transactions replay...
And how about another killer case: what is the maximum number of parallel updates per second that you can make to a single row?
But of course, it is now well known, 1/RTT.
I guess this is true for Galera Cluster. MySQL with semi-sync can accept much more than that (not with the new default semi-sync mode from 5.7.2 though).
That's very curious. And semi-sync can do so while at the same time satisfying your requirement that the client finds the data even in the event of immediate master crash following OK? What you are saying also implies that semi-sync can do more than 1/RTT transactions per second (if by "accept" we shall understand actual row modification, not mere queueing for lock). That refutes the findings of the tests I referred to, which to my knowledge so far have not been disputed by anybody. That makes it even more curious. However your remark about 5.7.2 seems to smear this whole claim, and Oracle press release confirms that: "MySQL 5.7.2 DMR also delivers lossless semi-synchronous replication, enabling transactions to only be committed to the storage engine and externalized on the master after the slave has acknowledged receipt." This sounds like prior to 5.7.2 semi-sync isn't even that "semi-sync" as one would expect. So I'm not exactly sure how it can fit your requirements. It would be really great if you could clear my confusion displayed above.
Yes, but it does not specify the sort of synchronization - whether it is a full state snapshot transfer or merely a catch up with missing transactions. But, depending on the circumstances any of those can occur.
It would be nice to see what algorithm is used to choose which kind of synchronization is necessary to do.
It is rather simple: if possible (required transactions are present in donor cache) - replay missing transactions, if not - copy a full snapshot. But yes, this area is not totally without gotchas yet...
I see one more new term here -- "donor cache". I have no idea what it is, how big it is and how it works...
It caches replication events. It is as big as you configure it - and even bigger when necessary. And it works like a memory mapped ring buffer?
This is not to say that our documentation doesn't suck, but how are these blog posts worse than something I would have written? Why should not I refer to 3rd party knowledge?
I think references to 3rd party knowledge is good as long as they can be easily discovered starting from here https://mariadb.com/kb/en/galera/.
If you choose so. But that really narrows your options...
Pavel
-- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
And how about another killer case: what is the maximum number of parallel updates per second that you can make to a single row?
But of course, it is now well known, 1/RTT.
I guess this is true for Galera Cluster. MySQL with semi-sync can accept much more than that (not with the new default semi-sync mode from 5.7.2 though).
That's very curious. And semi-sync can do so while at the same time satisfying your requirement that the client finds the data even in the event of immediate master crash following OK?
What you are saying also implies that semi-sync can do more than 1/RTT transactions per second (if by "accept" we shall understand actual row modification, not mere queueing for lock). That refutes the findings of the tests I referred to, which to my knowledge so far have not been disputed by anybody. That makes it even more curious.
However your remark about 5.7.2 seems to smear this whole claim, and Oracle press release confirms that:
"MySQL 5.7.2 DMR also delivers lossless semi-synchronous replication, enabling transactions to only be committed to the storage engine and externalized on the master after the slave has acknowledged receipt."
This sounds like prior to 5.7.2 semi-sync isn't even that "semi-sync" as one would expect. So I'm not exactly sure how it can fit your requirements.
Answering to myself: indeed this can be done if transactions are committed on master asynchronously and only OK is sent to client after ACK from slave. But then semi-sync should be capable of more than 1/RTT transactions per second, yet somehow it is not what was found in any benchmark that I know of. So this is still a controversy I'm begging you to resolve. -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
On Sun, Nov 17, 2013 at 4:06 AM, Alex Yurchenko <alexey.yurchenko@codership.com> wrote:
And how about another killer case: what is the maximum number of parallel updates per second that you can make to a single row?
But of course, it is now well known, 1/RTT.
I guess this is true for Galera Cluster. MySQL with semi-sync can accept much more than that (not with the new default semi-sync mode from 5.7.2 though).
That's very curious. And semi-sync can do so while at the same time satisfying your requirement that the client finds the data even in the event of immediate master crash following OK?
What you are saying also implies that semi-sync can do more than 1/RTT transactions per second (if by "accept" we shall understand actual row modification, not mere queueing for lock). That refutes the findings of the tests I referred to, which to my knowledge so far have not been disputed by anybody. That makes it even more curious.
However your remark about 5.7.2 seems to smear this whole claim, and Oracle press release confirms that:
"MySQL 5.7.2 DMR also delivers lossless semi-synchronous replication, enabling transactions to only be committed to the storage engine and externalized on the master after the slave has acknowledged receipt."
This sounds like prior to 5.7.2 semi-sync isn't even that "semi-sync" as one would expect. So I'm not exactly sure how it can fit your requirements.
Answering to myself: indeed this can be done if transactions are committed on master asynchronously and only OK is sent to client after ACK from slave. But then semi-sync should be capable of more than 1/RTT transactions per second, yet somehow it is not what was found in any benchmark that I know of. So this is still a controversy I'm begging you to resolve.
It looks like during the conversation both of us have got completely confused with terminology and talked about different meanings of performance in different places. So let me try to bring that to a more sensible description. To be short I'll consider only three different aspects of performance: 1. Maximum steady rate of completely independent transactions measured long after the beginning of testing. For semi-sync this is limited to 1 per RTT to the closest node. The safe limit (the one when farthest nodes won't fall behind in replication) could be lower, but it depends mostly on network throughput to the farthest node (not on RTT). I don't know what are limiting factors in Galera for this situation, but I'm sure that with the right implementation of inter-node communication Galera can outperform semi-sync by a big margin here. 2. Maximum steady rate of dependent transactions (e.g. updates to a single row) measured long after the beginning of testing. For semi-sync this is the same as for previous one -- limited to 1 per RTT to the closest node. For Galera this is limited to 1 per RTT to the farthest node, so definitely worse than with semi-sync. 3. Sudden burst of parallel dependent transactions, i.e. what maximum number of updates to a single row can one perform in parallel in say 15 seconds (probably from hundreds of connections) if he didn't perform any transactions before that and won't perform any transactions after those 15 seconds. For semi-sync with rpl_semi_sync_master_wait_point = AFTER_COMMIT this is limited only to performance of a single node as if there was no replication at all (just writing of binlogs). For semi-sync with rpl_semi_sync_master_wait_point = AFTER_SYNC this is still limited to 1 per RTT to the closest node. For Galera this is still limited to 1 per RTT to the farthest node. So anyone who's reading this can make his own conclusions what tradeoffs he prefers to make for his production systems. Pavel
On 2013-11-18 03:22, Pavel Ivanov wrote:
It looks like during the conversation both of us have got completely confused with terminology and talked about different meanings of performance in different places. So let me try to bring that to a more sensible description. To be short I'll consider only three different aspects of performance:
1. Maximum steady rate of completely independent transactions measured long after the beginning of testing. For semi-sync this is limited to 1 per RTT to the closest node. The safe limit (the one when farthest nodes won't fall behind in replication) could be lower, but it depends mostly on network throughput to the farthest node (not on RTT). I don't know what are limiting factors in Galera for this situation, but I'm sure that with the right implementation of inter-node communication Galera can outperform semi-sync by a big margin here.
2. Maximum steady rate of dependent transactions (e.g. updates to a single row) measured long after the beginning of testing. For semi-sync this is the same as for previous one -- limited to 1 per RTT to the closest node. For Galera this is limited to 1 per RTT to the farthest node, so definitely worse than with semi-sync.
3. Sudden burst of parallel dependent transactions, i.e. what maximum number of updates to a single row can one perform in parallel in say 15 seconds (probably from hundreds of connections) if he didn't perform any transactions before that and won't perform any transactions after those 15 seconds. For semi-sync with rpl_semi_sync_master_wait_point = AFTER_COMMIT this is limited only to performance of a single node as if there was no replication at all (just writing of binlogs). For semi-sync with rpl_semi_sync_master_wait_point = AFTER_SYNC this is still limited to 1 per RTT to the closest node. For Galera this is still limited to 1 per RTT to the farthest node.
Many thanks! That clears it. Regards, Alex
So anyone who's reading this can make his own conclusions what tradeoffs he prefers to make for his production systems.
Pavel
-- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Ahoi Pavel, On Fri, Nov 15, 2013 at 01:59:49PM -0800, Pavel Ivanov wrote: [snip]
As an example here's an independent comparison of Galera vs. semi-sync performance: http://linsenraum.de/erkules/2011/06/momentum-galera.html.
This is a nice blog post written in German and posted in 2011. And while Google Translate gave me an idea what post was about it would be nice to see something more recent and with better description of what was the actual testing set up.
it is my post and as a fact you are right a newer one would be also nice. As a fact there is one: http://linsenraum.de/erkules/2012/03/galera-als-replikationsersatz.html (Even there is a mistake: missing setting for innodb_flush_log_at_trx_commit=0). There you also get some infos about the used hardware I think Im going to write a new one. Also with newer hardware. And there will be an English version. The basic idea oft the tests is. Not to rely on the 'master'. Thats why I used settings like innodb_flush_log_at_trx_commit=0 and innodb_doublewrite=0 The basic idea is not to rely on any data of a crushed node. Working also in 'cloudenvironments' I prefer to rebuild instead to repair a node. To make it short. * Galera is always faster than Semisync. You can compare it to async repl. It gets his speed out of the parallel applying. * With Galera you have (virtual) synchronous replication. Using Semisync you know nothing. All you can do is monitoring semisync variables. But I doubt it will tell you anything about the 'last' transactions So you got async repl speed with synchronous data \o/ Regards Erkan -- über den grenzen muß die freiheit wohl wolkenlos sein
participants (3)
-
Alex Yurchenko
-
erkan yanar
-
Pavel Ivanov