On Wed, 17 Mar 2010 10:48:50 +0100, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Can you give an example of what an "RS History" would be? It was not 100% clear to me.
It is the sequence of changes that happens to RS. Like UPDATE t1 WHERE...; INSERT INTO t2 VALUES...; etc. Perhaps you could hint at what is not clear about it?
I think about an RS as for example a set of tables in a set of schemas. And of an RS History as for example a binlog.
So what is not clear to me is how the IDs get assigned to an RS, and to an RS History. Is it assigned by the DBA in server configuration? Is it assigned automatically by the server? Or assigned somehow by the redundancy service?
Also, it is not clear to me when the RS ID and RS History ID will differ.
So if I have a simple MySQL-style master->slave replication, will I have one or two replication sets? If I have one master and two slaves, which of
Ok, that's a good question. The way I see it is 1) RS ID. Obviously RS is an object and as such needs some way to refer to it. Who decides what goes into RS and that we even need one? - Redundancy service user. So he creates it and he names it. This ID has to be unique just to the extent that there'd be no confusion between two RS's should they meet in the same context. Say, it can be a human-readable name. In trivial cases, when RS is just a whole database, like in our wsrep API, it could be a cluster name. Galera nodes refuse to connect to nodes with a different cluster name. In general case, when there could be several RS's on one physical box (e.g. one for replication and one for binlogging), we'll have introduce a concept of a logical cluster - i.e. all processes which participate in replication of a given RS. 2) RS History ID. Suppose you have a cluster that replicates "RS1". And currently it is at T12345. A node joins the cluster and says: "I have 'RS1' at T12344". What we can have here: - the node was previously a member of this cluster, went down for maintenance, meanwhile cluster committed one more trx and all that is left is to send T12345 to this node. - the node never was a member of this cluster, 'RS1' name stands there by operator mistake. This node needs a total state snapshot. - the node was previously a member of this cluster, lost connectivity to it at T12000, but went on accepting local queries with hopes for subsequent conflict resolution (i.e. effectively spawned a new cluster of its own). There needs to be conflict resolution phase (hopefully this redundancy plugin supports it) - something else happened So in all cases but the first one we need to be able to tell that T12344 is not really comparable to T12345, because the state of that node does not really correspond to the state of the cluster at T12344 because they underwent different sequences of changes. So here comes RS History ID. It has to be globally unique and is generated by the redundancy service (because it is the one who knows what's going on), just like global transaction sequence number. E.g. in Galera it is a time-based UUID. So RS ID is just an RS name. RS History ID is a part of a global transaction ID. There does not have to be 1-to-1 relation between the two. the
three servers will have the same RS ID, and which will have a different one?
All of the will have the same RS ID as they participate in replication if the same RS. Moreover they'll have the same RS History ID as they are obviously have the save history of RS changes which is defined singlehandedly by master.
Likewise, if I have a two-level MySQL-style replication master->slave1->slave2, how many RS Histories will there be, and how many of those will have different/same RS History IDs?
Same as above. In this example - one.
I don't think that you need 2PC between redundancy service and the storage engines, because redundancy service never fails. Well, when it fails, you have something more important to worry about than disk flushes anyways.
Hm, I don't understand...
What if the redundancy service is invoked, it logs (or whatever) the changeset. Then the machine crashes before the engine(s) have time to commit. When the server comes back up, how do we avoid that the redundancy service, (and hence the slaves) will have the change, but the master will not?
The ability to implement reliable crash-recovery is something I see requested a lot, and I think it is really important.
Well, this is what redundancy service is for: when you crash, you can recover, isn't it? My understanding is that you do want the redundancy service to have the change and that you can take it from there and commit uncommitted. This is how Galera works and I believe this is how semi-sync in MySQL works (there's hardly any other way for synchronous replication to work) In a sense you can see the redundancy service as a glorified transaction journal.
In current MySQL replication, there is 2-phase commit between storage engine and binlog. I do not see how this can be avoided in the general case (without loosing the ability to recover after crash).
As for the order, the problem is that in general you never know in what order redundancy service will log/replicate transaction until you ask it. It is crucial for multi-master replication as you never know what will be the total order of transactions until you replicate and receive them. It is important for (semi)synchronous replication. It also makes sense in regular asynchronous master-slave as you can do replication concurrently with committing. So asking redundancy service first is good all around.
Of course, this means that there can only be one redundancy service
I might be not qualified enough to comment on this particular solution as I don't know that code, perhaps other developers can. AFAIK, binlog in MySQL is treated as some kind of a storage engine and perhaps this is why 2PC is used, but it's a big mistake. Binlog is not a storage engine (yes, they both write to disk, but that's where similarities end). It needs different interface - exactly what we're discussing right now. My thinking is that while the engines need to have a guarantee that binlog write was successful, binlog (or any other redundancy service) does not need the same guarantee from the engines. So instead of full blown 2PC you need a simple if()/else clause: if (redundancy_service->pre_commit() == SUCCESS) { /* commit */ } else { /* rollback */ } More on that below plugin
at a time, doesn't it?
It sounds like that in the general case. But realistically, there'd be few plugins which have to decide on the global transaction ID, others can be told. E.g. unlike multi-master replication, binlog plugin can easily log events in the order it is told to. So you can effectively stack redundancy plugins putting the one which decides on global transaction ID on top. The others just follow the defined order. So you can have the following stack: multi-master synchronous plugin (for HA) ---- master-slave asynchronous plugin (for fanning out) ---- binlog plugin (for local recovery after crash)
But true, it is essential that storage engine(s) and redundancy service(s) commit in the same order. It is however not yet clear to me that asking redundancy service first is sufficient for this, nor that it is necessary.
So let me write up the steps to commit a transaction, and maybe the answer will be clearer to me:
It seems to me that to ensure same order in participating engine(s) and redundancy service(s), we will need a server-global commit lock, and a call into each participant while that lock is taken:
lock(commit_order_lock); redundancy_service->txn_fix_order(thd); engine->txn_fix_order(thd); unlock(commit_order_lock);
The idea is that in this call, each participant will do the minimum amount of work to ensure that the transaction `thd' will be the next transaction committed, *if* it is committed. Eg. insert the transaction data into
transaction log/binlog buffer (but not necessary write it to disk yet).
Next, the transaction needs to be prepared in all participants, ie. two-phase commit for crash recovery. I think this can be done without a global lock:
redundancy_service->txn_prepare(thd); engine->txn_prepare(thd);
In this step, each participant must ensure that the transaction is persistent (eg. fflush()). Also note that both this and the previous step may fail in either engine, because of some internal error, or because of a crash. Thus each participant must be prepared up to here to roll back the
either immediately or during crash recovery after bringing the server back up.
However, once this step is successfully completed for all participants,
transaction can no longer be rolled back.
The group commit facility comes in here. Since this step is outside of
global lock commit_order_lock, there could be multiple transactions waiting in txn_prepare(). And each participant is free to use a single fflush() to persist all of them at once. This is crucial for good performance.
(For maximum flexibility, we might even want this:
redundancy_service->txn_prepare_start(thd); engine->txn_prepare_start(thd);
redundancy_service->txn_prepare_wait(thd); engine->txn_prepare_wait(thd);
This would allow both participants to run their fflush() or whatever in parallel, which seems to be desirable.)
(Note that participants would be free to implement all actions in txn_fix_order() and have txn_prepare_{start,wait} be empty, at the cost of loosing group commit functionality. Likewise, txn_prepare_wait() could be empty, just the API leaves the option of having the split open.)
Finally, if all participants prepared their transactions without
they will be really committed:
redundancy_service->txn_commit(thd); engine->txn_commit(thd);
(I am not sure if we need to split into _start and _wait here. I seem to recall that this step also needs to persist to disk (fflush()), but just now I do not see why it would need it).
In the case of crash recovery, the server will ask each participant for
As far as I understand, any (semi)synchronous replication requires that. the transaction, the the problems, the
list of transactions that had successfully completed the txn_prepare_wait() step. For each transaction, if it completed in all participants it will be txn_commit()'ed, otherwise it will be rolled back.
So with this, is there a need to invoke the redundancy service first? If we do, it will be possible to pass the global transaction ID on to all the subsequent steps. So yes, that seems desirable. Any other reasons?
Another point: It seems to me that in the general case, it will limit us to require that no gaps in the global transaction ID can be introduced. I do not see how to make group commit possible in this case (following above steps)?
So, any comments/objections to the above sketch of the commit part of an API?
Hm. I'm not well versed in MySQL engine interfaces, so I might be missing something here. But why can't we have something like this instead: engine->txn_prepare(thd); if (redundancy_service->pre_commit(thd) == SUCCESS) { engine->txn_commit(thd); redundancy_service->post_commit(thd); } else { engine->txn_rollback(thd); } pre_commit() call locks commit_order_lock in the right order. post_commit() releases commit_order_lock and the resources that might be allocated by transaction in the redundancy service. Or, if txn_commit() is to heavy an operation, then maybe like that?: engine->txn_prepare(thd); if (redundancy_service->pre_commit(thd) == SUCCESS) { engine->txn_fix_order(thd); redundancy_service->post_commit(thd); engine->txn_commit(thd); } else { engine->txn_rollback(thd); }
I think "a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions" is an internally contradictory concept. Suppose node1 committed T1, but not
yet, and node2 committed T2, but not T1 yet. Now do database dumps from those nodes. Which one will represent a consistent transactional view?
I am thinking about shared-nothing storage engines like NDB, where
are partitioned, with partitions being located on different machines.
If T1 and T2 both involve node1 and node2, they will need to be linearly ordered.
However, if T1 involves only node1, and T2 involves only node2, and
is no communication between node1 and node2 while T1 and T2 runs, then
T2 tables there there
is no way to distinguish whether T1 or T2 committed first.
So we can use (engine local) transaction IDs (T1, node1) and (T2, node2). And if T1==T2, then that would mean that no outside entity has evidence as to whether T1 or T2 was committed first, so we do not actually need a defined ordering in this case. On the other hand, if such evidence could exist, then we will need T1<T2 (or T2<T1).
So in this case we avoid having to do extra communication across all nodes just to syncronise the "next transaction ID" counter if it is not needed.
Oh, yes, this is a corner case which is not easily described by the suggested model. But I guess it is highly unconventional and does not really qualify as redundancy or replication in the defined sense. My understanding is that one model never fits all. If we need something like NDB, it is better to develop a separate model and API for that, rather than try to fit everything in one.
... But I think you are right that in practise one would impose a total order anyway, to get a consistent view. So if T1==T2, one would arbitrarily define that T1 came before T2, since node1<node2. And maybe this is in any case rather academic.
- Kristian.
Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011