Hi Ingo! Your e-mail is totally relevant and I have almost nothing there to respond to in particular - its all as you say, I have no essential remarks. Instead I want to respond to it in whole, thus I'll omit a lengthy quote, suffuce say that it is a direct response. The problem is that you cannot really design and program by use cases, unorthodox as it may sound. You cannot throw an arbitrary bunch of use cases as input and get code as output (that is in a finite time and of finite quality). Whether you like it or not, you always program some model. It is by definition that a program is a description of some model. If you have not settled on a model, you're in trouble - and that's where mysql replication is. This is a direct consequence of trying to satisfy a bunch of use cases without first putting them in a perspective of some general abstract model. I mention this not to belittle anything or anyone - everybody makes mistakes. But the subject of this thread is "Ideas for improving MariaDB/MySQL replication", and so mistakes should be learned upon, but not repeated. Let me refer to the following analogy: suppose you want to create a transport agency. To transport stuff. You know, people, animals, cargo - stuff. There's a billion of use cases. But when you get to it you have models to choose (thankfully there are already models for that, you don't have to develop one). E.g. you can transport by air or by land. And each of these models has it own laws and limitations. Like you reliably cannot transport by land faster than at 200 km/h. You cannot transport a lot of cargo by air, as well as you can't have stops every 10km to pick up passengers. So you gotta settle on the model that suits you most. Now you can say that why? Why not choose both models? Well, notice that they are still models. There is a whole lot of other use cases that you cannot satisfy by them. Next, do you know many companies that do both land and air transportation? You can own both of them indeed, but for the sake of efficiency they'll be different companies because aside from load()/unload() functions their interfaces, internals and logistics are likely to be very different. This is a clumsy analogy indeed, but I hope it helps. So now we have a proposed model based on Redundancy Sets, linearly ordered global transaction IDs and ordered commits. We pretty much understand how it will work, what sort of redundancy it will provide and, as you agreed, is easy to use for recovery and node joining. It satisfies a whole bunch of use cases, even those where ordering of commits is not strictly required. Perhaps we won't be able to have some optimizations where we could have had them without ordering of commits, but the benefit of such optimizations is highly questionable IMO. MySQL/Galera is a practical implementation of such model, may be not exactly what we want to achieve here, but it gives a good estimate of performance and performance is good. Now this model may not fit, for instance, NDB-like use case. What options do we have here? 1) Extend somehow the proposed model to satisfy NDB use case. I don't see it likely. Because, as you agreed, NDB is not really about redundancy, it is about performance. Redundancy is quite specific there. And it is not by chance that it is hard to migrate applications to use it. 2) Develop a totally different model to describe NDB use case and have it as a different API. Which is exactly what it is right now if I'm not mistaken. So that it just falls out of scope of today's topic. There is one more option - just forget about NDB use case which may be there only because there is nothing better. There are other ways to get partitioning and replication to work together without pushing them behind the same interface. E.g. you can have "replication cluster" of "partition clusters" - or "partition cluster" of "replication clusters" (i.e. each replication cluster replicating a single partition) Disclaimer: NDB use case was taken as an example. The bottom line - you can just say that sometimes we don't need total ordering of commits. You gotta put it in the model. On Wed, 17 Mar 2010 13:03:02 +0200, Henrik Ingo <henrik.ingo@avoinelama.fi> wrote: <skip>
I don't think that you need 2PC between redundancy service and the storage engines, because redundancy service never fails. Well, when it fails,
you
have something more important to worry about than disk flushes anyways.
How does synchronous replication happen without 2PC?
It does, it does. E.g. it does so in MySQL/Galera, see my response to Kristian. Actually how can it work otherwise? What is the meaning of prepare() in replication step? How can engine commit fail at this point except for the crash? Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011