Alex Yurchenko <alexey.yurchenko@codership.com> writes:
On Mon, 15 Mar 2010 12:29:14 +0100, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
One possible implementation for that can be (UUID, long long) pair.
How is this different from (server_id, group_id)? (I'd like to understand).
It is different in that UUID it that proposal is a replication sequence UUID. It is one per cluster. All events generated by cluster nodes have the same UUID. And it serves to distinguish replication events from different clusters, explicitly making them incomparable (the same way that (server_id, group_id) pairs with different server_id's are incomparable). So that you can compare only transaction IDs with the same UUID.
Ok, thanks for the clarification, I think I see now. "server_id" does not work, as there can be multiple different server_ids in a single cluster. We need every server in the same cluster (NDB, Galera, ...) to use the same UUID in the global transaction ID.
I guess you're trying to make a cause for parallel transaction applying on slaves here. You can do parallel out of order applying alright. It is just when you have committed T2 but not T1 yet, the node does not have a T2 state. And it does not have a T0 state. It has an undefined state. Only after T1 is committed it suddenly becomes T2. There can be several ways to deal with it. For example, you do apply T2, but you don't commit it. And after you have applied T1, you commit both of them in one go.
Agree. Even with parallel transaction application, one would want to commit in some fixed (from the origin) order. (Following above example, even if there is no evidence on master as to whether T1 or T2 came first, we will want the order to be consistent among multiple slaves, so that there is no possibility to see T1-without-T2 on one slave and T2-without-T1 on another).
There are at least 2 problems with not having linearly ordered commits:
1) Database states are incomparable. It is impossible (ok, impractically hard) to tell if two databases have identical contents and determine which transactions are missing. How will you join a node to a cluster in this case?
Yes. The issue of joining a cluster is also a good point. After reading your comments, I tend to agree that global transaction IDs should be totally ordered.
I don't exactly understand how you separate "level of the API" and "the point where is becomes needed", do you mean code separation or making linear ordering an optional feature of the redundancy service?
(I meant that the comparison privided by the API could be a partial order, and if it returned T1 and T2 as "equivalent", any plugin that needs a total order could just pick T1 or T2 arbitrarily as the first. But I agree now that even if the choice is arbitrary, it should be enforced in the API for consistency.)
3. However there can be more than one RS. Moreover, the same RS can end up in different clusters and undergo different changes. So, to achieve truly global unambiguity each changeset, in addition to seqno, should be marked with a RS History ID. Obviously seqnos from different histories are logically incomparable. Therefore RS History ID can be any globally unique identifier, with no need for < or > operations. This is the second component of global transaction ID.
Can you give an example of what an "RS History" would be? It was not 100% clear to me.
It is the sequence of changes that happens to RS. Like UPDATE t1 WHERE...; INSERT INTO t2 VALUES...; etc. Perhaps you could hint at what is not clear about it?
I think about an RS as for example a set of tables in a set of schemas. And of an RS History as for example a binlog. So what is not clear to me is how the IDs get assigned to an RS, and to an RS History. Is it assigned by the DBA in server configuration? Is it assigned automatically by the server? Or assigned somehow by the redundancy service? Also, it is not clear to me when the RS ID and RS History ID will differ. So if I have a simple MySQL-style master->slave replication, will I have one or two replication sets? If I have one master and two slaves, which of the three servers will have the same RS ID, and which will have a different one? Likewise, if I have a two-level MySQL-style replication master->slave1->slave2, how many RS Histories will there be, and how many of those will have different/same RS History IDs?
What is not so obvious here is that since global transaction ID is generated by logging/replication service, it is that service that defines the order of commits, not vice versa. As a result transaction should first be passed to that service and only then committed. For one-way master-slave replication the order of operations is not so important. However for multi-master it is crucial. Note that the actual replication/logging can still happen asynchronously, but replication service must generate transaction ID before it is committed.
I don't think that you need 2PC between redundancy service and the storage engines, because redundancy service never fails. Well, when it fails, you have something more important to worry about than disk flushes anyways.
Hm, I don't understand... What if the redundancy service is invoked, it logs (or whatever) the changeset. Then the machine crashes before the engine(s) have time to commit. When the server comes back up, how do we avoid that the redundancy service, (and hence the slaves) will have the change, but the master will not? The ability to implement reliable crash-recovery is something I see requested a lot, and I think it is really important. In current MySQL replication, there is 2-phase commit between storage engine and binlog. I do not see how this can be avoided in the general case (without loosing the ability to recover after crash).
As for the order, the problem is that in general you never know in what order redundancy service will log/replicate transaction until you ask it. It is crucial for multi-master replication as you never know what will be the total order of transactions until you replicate and receive them. It is important for (semi)synchronous replication. It also makes sense in regular asynchronous master-slave as you can do replication concurrently with committing. So asking redundancy service first is good all around.
Of course, this means that there can only be one redundancy service plugin at a time, doesn't it? But true, it is essential that storage engine(s) and redundancy service(s) commit in the same order. It is however not yet clear to me that asking redundancy service first is sufficient for this, nor that it is necessary. So let me write up the steps to commit a transaction, and maybe the answer will be clearer to me: It seems to me that to ensure same order in participating engine(s) and redundancy service(s), we will need a server-global commit lock, and a call into each participant while that lock is taken: lock(commit_order_lock); redundancy_service->txn_fix_order(thd); engine->txn_fix_order(thd); unlock(commit_order_lock); The idea is that in this call, each participant will do the minimum amount of work to ensure that the transaction `thd' will be the next transaction committed, *if* it is committed. Eg. insert the transaction data into the transaction log/binlog buffer (but not necessary write it to disk yet). Next, the transaction needs to be prepared in all participants, ie. two-phase commit for crash recovery. I think this can be done without a global lock: redundancy_service->txn_prepare(thd); engine->txn_prepare(thd); In this step, each participant must ensure that the transaction is persistent (eg. fflush()). Also note that both this and the previous step may fail in either engine, because of some internal error, or because of a crash. Thus each participant must be prepared up to here to roll back the transaction, either immediately or during crash recovery after bringing the server back up. However, once this step is successfully completed for all participants, the transaction can no longer be rolled back. The group commit facility comes in here. Since this step is outside of the global lock commit_order_lock, there could be multiple transactions waiting in txn_prepare(). And each participant is free to use a single fflush() to persist all of them at once. This is crucial for good performance. (For maximum flexibility, we might even want this: redundancy_service->txn_prepare_start(thd); engine->txn_prepare_start(thd); redundancy_service->txn_prepare_wait(thd); engine->txn_prepare_wait(thd); This would allow both participants to run their fflush() or whatever in parallel, which seems to be desirable.) (Note that participants would be free to implement all actions in txn_fix_order() and have txn_prepare_{start,wait} be empty, at the cost of loosing group commit functionality. Likewise, txn_prepare_wait() could be empty, just the API leaves the option of having the split open.) Finally, if all participants prepared their transactions without problems, they will be really committed: redundancy_service->txn_commit(thd); engine->txn_commit(thd); (I am not sure if we need to split into _start and _wait here. I seem to recall that this step also needs to persist to disk (fflush()), but just now I do not see why it would need it). In the case of crash recovery, the server will ask each participant for the list of transactions that had successfully completed the txn_prepare_wait() step. For each transaction, if it completed in all participants it will be txn_commit()'ed, otherwise it will be rolled back. So with this, is there a need to invoke the redundancy service first? If we do, it will be possible to pass the global transaction ID on to all the subsequent steps. So yes, that seems desirable. Any other reasons? Another point: It seems to me that in the general case, it will limit us to require that no gaps in the global transaction ID can be introduced. I do not see how to make group commit possible in this case (following above steps)? So, any comments/objections to the above sketch of the commit part of an API?
(I am also thinking that generating the global transaction ID at commit time is too late in the general case. The current MySQL replication buffers everything until commit time, but I would like to see us support also logging events during transactions. And this will require an ID generated at transaction start. But maybe this is a different ID than "global transaction ID", which will still be needed to know commit order of course.)
This is exactly the problem we're facing now with support for LOAD DATA (and other potentially huge transactions). There clearly is the need for two IDs in that case. But as you have noticed, the semantics of these two IDs is rather different. Global transaction ID we've been talking about so far needs to be linearly ordered (according to my understanding at least) and does not need to have a server_id, while the ID generated at transaction start does not have to be linearly ordered and SHOULD have a server_id as part of it (to ensure uniqueness, manageability and simplify debugging).
Yes, I agree, these are different. Basically, at transaction start one will probably make a local transaction ID, and tag all events with this ID to know they are from the same transaction (changeset). But for the final commit event, one needs to assign the global transaction ID. The local IDs just need to be unique, the global ID needs to be totally ordered. I think I was confusing these two a bit previously.
I think "a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions" is an internally contradictory concept. Suppose node1 committed T1, but not T2 yet, and node2 committed T2, but not T1 yet. Now do database dumps from those nodes. Which one will represent a consistent transactional view?
I am thinking about shared-nothing storage engines like NDB, where tables are partitioned, with partitions being located on different machines. If T1 and T2 both involve node1 and node2, they will need to be linearly ordered. However, if T1 involves only node1, and T2 involves only node2, and there is no communication between node1 and node2 while T1 and T2 runs, then there is no way to distinguish whether T1 or T2 committed first. So we can use (engine local) transaction IDs (T1, node1) and (T2, node2). And if T1==T2, then that would mean that no outside entity has evidence as to whether T1 or T2 was committed first, so we do not actually need a defined ordering in this case. On the other hand, if such evidence could exist, then we will need T1<T2 (or T2<T1). So in this case we avoid having to do extra communication across all nodes just to syncronise the "next transaction ID" counter if it is not needed. ... But I think you are right that in practise one would impose a total order anyway, to get a consistent view. So if T1==T2, one would arbitrarily define that T1 came before T2, since node1<node2. And maybe this is in any case rather academic. - Kristian.