On Wed, Mar 17, 2010 at 9:01 PM, Alex Yurchenko <alexey.yurchenko@codership.com> wrote:
The problem is that you cannot really design and program by use cases, unorthodox as it may sound. You cannot throw an arbitrary bunch of use cases as input and get code as output (that is in a finite time and of finite quality). Whether you like it or not, you always program some model.
Uh, I'm not sure I can accept this proposition. At least it seems contradictory to MariaDB's vision of being a practical, user and customer driven, database. As I see it, for real world applications, you should always start with use cases. But it is ok if you want to come back to me and say that a subset of use cases should be discarded because they are too difficult to service, or even contradict each other. But just saying that you'd like to implement an abstract model without connection to any use cases sounds dangerous to me. I'm also a fan of abstract thinking though. Sometimes you can get great innovations from starting with a nice abstract model, and then ask yourself which real world problems it would (and would not) solve. Either way, you end up with anchoring yourself in real world use cases.
It is by definition that a program is a description of some model. If you have not settled on a model, you're in trouble - and that's where mysql replication is. This is a direct consequence of trying to satisfy a bunch of use cases without first putting them in a perspective of some general abstract model.
Yes. It is ok to say that just use cases without some "umbrella" like an abstract model will just lead to chaos.
So now we have a proposed model based on Redundancy Sets, linearly ordered global transaction IDs and ordered commits. We pretty much understand how it will work, what sort of redundancy it will provide and, as you agreed, is easy to use for recovery and node joining. It satisfies a whole bunch of use cases, even those where ordering of commits is not strictly required. Perhaps we won't be able to have some optimizations where we could have had them without ordering of commits, but the benefit of such optimizations is highly questionable IMO. MySQL/Galera is a practical implementation of such model, may be not exactly what we want to achieve here, but it gives a good estimate of performance and performance is good.
Back on track: So the API should of course implement something which has as broad applicability as possible. This is the whole point of questioning you, since now you have just suggested a model which happens to nicely satisfy Galera's needs :-) But another real-world argument you can make is that we don't need parallel replication for speed, because at least Galera does well without it. That should then be benchmarked by someone. The real-world requirement here is after all "speed", not "parallel replication".
Now this model may not fit, for instance, NDB-like use case. What options do we have here?
1) Extend somehow the proposed model to satisfy NDB use case. I don't see it likely. Because, as you agreed, NDB is not really about redundancy, it is about performance. Redundancy is quite specific there. And it is not by chance that it is hard to migrate applications to use it. <cut>
Actually, I don't think the issues with migration/performance has anything at all to do with how it does replication. (It does have to do with the partitioning/sharding and just limitations of the MySQL storage engine interfae.) But we should distinguish 2 things here: How NDB does it's own cluster internal (node-to-node) replication can for our purposes be considered as an engine-internal issue. Otoh MySQL Cluster also uses the standard MySQL replication and binlog. From there we can derive some interesting behavior that we should certainly support in the replication API. Ie hypothetically MySQL Cluster could use our replication api for geographical replication, as it uses MySQL replication today, but there could also be some other engine with these same requirements. The requirements I can think of is: 1) As Kristian explained, transactions are often committed on only one pair or a few pairs of nodes, but not all nodes (partitions) in the cluster. The only cluster-global (or database global) sync point is the epoch, which will collect several transactions packed between cluster-wide heartbeats. To restore to a consistent cluster wide state, you must choose the border between 2 epochs, not just any transaction. -> A consequence for the mysql binlog and replication is that the whole epoch is today considered one large transaction. I don't know if this has any consequence for our discussion, other than the "transactions" (epochs) being large. A nice feature here could be support for "groups of transactions" (not to be confused with group commit) or sub-transactions, whichever way you prefer to look at it. This way an engine like NDB could send information about both the epoch and each individual transaction inside the epoch to the redundancy services. (The redundancy services then may or may not use that info, but the API could support it.) 2) Sending of transactions to mysql binlog is asynchronous, totally decoupled from the actual commit that happens in the datanodes. The reason is that a central binlog would otherwise become a bottleneck in an otherwise distributed cluster. -> This is ok also in our current discussion. If the engine doesn't want to include the replication api in a commit, it just doesn't do so and there's nothing we can or need to do about it. For instance in the case of NDB it is NDB who gives you adequate guarantees for redundancy, the use of mysql binlog is for other reasons. (Asynchronous geographical replication, and potentially playback and point-in-time restoring of transactions.) 3) Transactions arrive at the mysql binlog in a somewhat random order, and it is impossible to know which order they actually committed in. Due to (2) NDB does not want to sync with a central provider of global transaction ID's either. -> When transactions arrive to the replication api, the NDB side may just act as if they are being committed, even if they already have been committed in the engine. The replication api would then happily assign global transaction id's to the transactions. As in (2), this makes redundancy services behind this api unusable for database recovery or node recovery, the engine must guarantee that functionality (which they do today anyway, in particular NDB). -> Transactions "committed" to the replication api become linearly ordered, even if this order does not 100% correspond to the real order of how the engine committed them originally. However, I don't see a problem with this at this point. -> Assuming that there would be benefit on an asynchronous slave to do parallel replication, it would be advantageous to be able to commit transactions "out of order". For instance if we introduce the concept of transaction groups (or sub-transactions), a slave could decide to commit transactions in random order inside a group, but would have to sync at the boundary of a transaction group. (This requirement may in fact worsen performance, since in every epoch you would still have to wait for the longest running transaction.) So those are the requirements I could derive from having NDB use our to-be-implemented API. My conclusion from the above is that we should consider adding to the model the concept of a transaction group, which: -> the engine (or MariaDB server, for multi-engine transactions?) MAY provide information of which transactions had been committed within the same group. -> If such information was provided, a redundancy service MAY process transactions inside a group in parallel or out of order, but MUST make sure that all transactions in transaction group G1 are processed/committed before the first transaction in G2 is processed/comitted.
2) Develop a totally different model to describe NDB use case and have it as a different API. Which is exactly what it is right now if I'm not mistaken. So that it just falls out of scope of today's topic.
We should not include the NDB internal replication in this discussion. Or we might in the sense that real world examples can give could ideas on implementation details and corner cases. But it is not a requirement. How NDB uses the MySQL row based replication is imho an interesting topic to take into account. henrik -- email: henrik.ingo@avoinelama.fi tel: +358-40-5697354 www: www.avoinelama.fi/~hingo book: www.openlife.cc