On Tue, Mar 16, 2010 at 7:32 AM, Alex Yurchenko <alexey.yurchenko@codership.com> wrote:
I think "a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions" is an internally contradictory concept. Suppose node1 committed T1, but not T2 yet, and node2 committed T2, but not T1 yet. Now do database dumps from those nodes. Which one will represent a consistent transactional view?
I'm not sure if this is the same as Kristian is thinking about, but I don't see it as internally contradictory. See below...
I guess you're trying to make a cause for parallel transaction applying on slaves here. You can do parallel out of order applying alright. It is just
This is relevant to being able to do parallel slave execution, but not restricted to that. Actually, what Kristian is describing can even happen on a single-node, multi-threaded database. It is not strictly an issue with clusters at all. I'll borrow my old example from http://forge.mysql.com/worklog/task.php?id=4648 We have 2 tables t1 and t2, and 4 transactions applied in a sequence. Assume that all 4 transactions are done "at the same time" by 4 different threads. The sequence randomly ends up being: /* Assume all columns are "0" to start */ /* 0 */ (SET AUTOCOMMIT=1) /* 1 */ UPDATE t1 SET f1=1, f2=2, timestamp=001 WHERE id=1; /* 2 */ UPDATE t2 SET f1=3, f2=4, timestamp=002 WHERE id=1; /* 3 */ UPDATE t1 SET f1=5, f2=6, timestamp=003 WHERE id=1; /* 4 */ UPDATE t2 SET f1=7, f2=8, timestamp=004 WHERE id=1; Now, depending on how the database internally commits transactions and writes a transaction log, it may impose a total ordering on these transactions or not. If you now replay these transactions (such as apply them on a slave), what happens if you do: /* 1 */ UPDATE t1 SET f1=1, f2=2, timestamp=001 WHERE id=1; /* 3 */ UPDATE t1 SET f1=5, f2=6, timestamp=003 WHERE id=1; /* 4 */ UPDATE t2 SET f1=7, f2=8, timestamp=004 WHERE id=1; /* The following statement is discarded since NEW.timestamp < OLD.timestamp */ /* 2 */ UPDATE t2 SET f1=3, f2=4, timestamp=002 WHERE id=1 Depending on what you want to achieve, skipping transaction #2 is ok or bad. The (slave) state between #3 and #4 is now something that never existed in the original (master) database. However, taken the assumption that transactions were independently executed "at the same time", the ordering was determined by chance, so it is a state that *could have existed* on the original database. Whether it actually could have existed, depends on our assumptions of what the application is doing. If transactions 1-4 are really independent, it could have existed. If they are interrelated - such as tables t1 and t2 depending on each other - the application could never actually produce such a state. I agree with Alex that if a 100% total ordering is not imposed on the transactions, it is not correct to say that the second replay (the slave) is a *replica* of the first run (the master), since it can end up in states that never existed on the master. However the original sentence a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions" is not contradictory in itself, it is something that can happen even on a single node database. The argument in favor of "allowing" above like behavior is that we should be allowed to assume that an application always commits transactions that produce consistent end states. So if some "nearby" transactions are then replicated in a different order, each state is still meaningful to the application, even if some states may not have actually existed on the master if you look at the database as a whole. As an example, a bank has a database that reflects how much money we have in our accounts. If I transfer 100€ to you, then the withdrawal and deposit are done in the same transaction and the database state is consitent at the end of the transaction. But if I give 100€ to charity and you win 100€ on the lottery, these are independent transactions. For practical purposes no application should care which way these transactions were committed. Note that also other properties, for instance the total amount of money the bank has, would still always be correct regardless of which way those 2 individual transactions were committed. So if we now replicate the bank database, it is not in practice important to replicate the transactions exactly in the same sequence as they happened to execute on the master database. It is only in theory that it is a problem that the slave may have states (when looking at the database as a whole) that didn't actually exist on the master and therefore are not replicas. So personally I see that in theory any replication should of course exactly produce the same states of the "original" database, in practice if you can get significant performance gains by not imposing a total ordering between each and every transactions, it is a practical approach that should be allowed. (At least by an option, since ultimately it is the app developer or DBA that can decide if such replication is "safe" or not.)
The point is that if you want _replication_, you gotta have a linear order. Without it you can't compare db states and therefore can't say that this db is a replica of that db. What you end up with is a so called "eventual consistency", and "eventual" may never happen.
Theoretically speaking, yes.
3. Uncoupled multiple masters, like multiple nodes in MySQL asynchronous replication. In this case there really is no total linear order, as the transactions are not coordinated between servers. At most there would be an approximate order like a time stamp, but that would be not 100% reliable due to time skew and such.
That's what the concept of Replication Set was about below. In the scenario described above you effectively have multiple replication sets and multiple logical clusters. So each one of them will have its own global transaction ID sequence. If they are so much uncoupled there is no point in considering them a single cluster.
What you say above is significant. At least for myself, when speaking of "replication" I think of solving both the problem of running a cluster (as you define as tightly coupled above) and replication between "uncoupled" databases (replication sets), such as 2 datacenters across 2 continents. Trying to force a strict total ordering across 2 datacenters seemed like a bad idea, now I understand you never intended to do that.
I guess I have addressed this above already. In short, parallel applying does not need to mean out-of-order commit.
You still end up with some form of group commit. Continuing from above, there are also applications that do multi-master replication but the masters are "uncoupled" out of necessity. An example is when you want to have geographical redundancy and allow writes to all datacenters. As we have agreed above, it is not practical to force a total ordering globally across the changesets in each datacenter, but they still replicate to each other. (Using conflict resolution such as based on a timestamp, or trying to guarantee that one datacenter is always the single master for a certain subset of records.) Apparently our mobile phone networks work this way (the HLR). In such a setup, you can do parallel applying all you want, since there is no "one true" ordering of the transactions anyway.
In other words, is it necessary to enforce such total order at the level of the API, instead of just introducing it at the point where it becomes needed, and if so, why?
There are at least 2 problems with not having linearly ordered commits:
1) Database states are incomparable. It is impossible (ok, impractically hard) to tell if two databases have identical contents and determine which transactions are missing. How will you join a node to a cluster in this case?
2) Causality preservation. Suppose you commit something on master and now want to select results from slave. How will you tell that slave already has committed these results?
I don't exactly understand how you separate "level of the API" and "the point where is becomes needed", do you mean code separation or making linear ordering an optional feature of the redundancy service?
In any case, I think we all understand quite well what IS database replication and what are redundancy guarantees when commits are linearly ordered. Id like to hear a more detailed concept of redundancy in the absence of linearly ordered commits. What is redundancy in that case, what are the guarantees? How recovery will be performed and so on.
I think you yourself also admitted that in some cases the "replicas" cannot be so tightly coupled that a globally consistent linear ordering would be possible/desirable. It is true that some operations like joining a new node become simple and convenient when you have a linear ordering to refer to. But there are situations when you don't have that, for some reason, the most obvious use case being geographical replication.
I don't think that you need 2PC between redundancy service and the storage engines, because redundancy service never fails. Well, when it fails, you have something more important to worry about than disk flushes anyways.
How does synchronous replication happen without 2PC? henrik (choosing to take part in something interesting rather than doing my real work :-) -- email: henrik.ingo@avoinelama.fi tel: +358-40-5697354 www: www.avoinelama.fi/~hingo book: www.openlife.cc