On Mon, 15 Mar 2010 12:29:14 +0100, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
2) It correctly identifies that (the part of) ID should be a monotonic ordinal number.
Ok. But should it increase in order of transaction start? Or in order of transaction commit?
I think this is easy: until transaction is committed, it does not exist. Even with MyISAM a certain time passes from query parsing to actually modifying table data so that the changes become visible to others. So this moment, when changes become visible to others defines the order of database modifications. Transaction start order is hardly of any interest to anybody except for the very curious.
Or maybe your point is that it should be monotonic, but that the actual order of changesets is an implementation detail of the particular replication system? I'd really like to hear your thoughts on this.
3) It correctly identifies that the global transaction ID is generated by redundancy service - in that case "MYSQL_LOG::write(Log_event *) (sql/log.cc, line 1708)"
Right. So the _concept_ of global transaction ID needs to be part of the framework. But the actual _implementation_ of that global transaction ID is an implementation detail of the redundancy service. Correct?
Point 3) actually answers the questions above, but, perhaps, not in an obvious manner. There are two things: - global transaction ID format and comparison operations are defined by the API. I elaborated on it in the previous e-mail. But the values are filled by the redundancy service. That's one of its "services". - commit order should follow the global transaction ID order. So the ORDER of commit operations is effectively decided by redundancy service. And here maybe lies the biggest controversy of all.
What's bad about it:
2) It fails to address multi-master: (server_id, group_id) is not going to work - such pairs cannot be linearly ordered and, therefore, compared. And from the perspective of the node that needs to apply the changeset - does server_id really matter? It may be good for debugging, but it can't be a part of a global transaction ID.
I don't understand why it would not work for multi-master. Can you elaborate?
Suppose you have node1, node2 and node3. Node1 and node2 are acting as masters. So node1 sends (node1, 11), (node1, 12). Node2 sends (node2, 2), (node2, 3). Partial order of commits is well established, but node3 still won't be able to tell in what general order it should apply those transactions. Does (node1, 11) go before (node2, 3)? why?
Especially given that you later suggest yourself:
One possible implementation for that can be (UUID, long long) pair.
How is this different from (server_id, group_id)? (I'd like to understand).
My thoughts on multi-master is that there are multiple scenarios, depending on how tightly coupled the nodes of the clusters are with respect to transactions:
1. Tightly-coupled cluster, that maintain a cluster-wide serialised transaction concept. NDB replication is like this (for epochs, which are based on NDB global checkpoint IDs AFAIR). In this case, the entire cluster effectively is a single server wrt. replication, so it's not really multi-master for this discussion.
2. One can imagine a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions. Transaction T1 only needs to be ordered after T2 if it can be proven (in terms of communication within the system) that T1 committed before T2. If
is no evidence one way or the other to the actual order of T1 and T2 (eg.
It is different in that UUID it that proposal is a replication sequence UUID. It is one per cluster. All events generated by cluster nodes have the same UUID. And it serves to distinguish replication events from different clusters, explicitly making them incomparable (the same way that (server_id, group_id) pairs with different server_id's are incomparable). So that you can compare only transaction IDs with the same UUID. there they
ran in parallel against disjoint data), there is no a priori reason they should have to be linearly ordered. If on the other hand there is such evidence (eg. T2 waited for a lock released by T1 commit), then there would have to be such linear ordering. I think NDB individual transactions are a bit like this. So this would make transaction IDs only partially ordered (but any arbitrary extension to a total order would be valid).
I think "a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions" is an internally contradictory concept. Suppose node1 committed T1, but not T2 yet, and node2 committed T2, but not T1 yet. Now do database dumps from those nodes. Which one will represent a consistent transactional view? I guess you're trying to make a cause for parallel transaction applying on slaves here. You can do parallel out of order applying alright. It is just when you have committed T2 but not T1 yet, the node does not have a T2 state. And it does not have a T0 state. It has an undefined state. Only after T1 is committed it suddenly becomes T2. There can be several ways to deal with it. For example, you do apply T2, but you don't commit it. And after you have applied T1, you commit both of them in one go. The point is that if you want _replication_, you gotta have a linear order. Without it you can't compare db states and therefore can't say that this db is a replica of that db. What you end up with is a so called "eventual consistency", and "eventual" may never happen.
3. Uncoupled multiple masters, like multiple nodes in MySQL asynchronous replication. In this case there really is no total linear order, as the transactions are not coordinated between servers. At most there would be an approximate order like a time stamp, but that would be not 100% reliable due to time skew and such.
That's what the concept of Replication Set was about below. In the scenario described above you effectively have multiple replication sets and multiple logical clusters. So each one of them will have its own global transaction ID sequence. If they are so much uncoupled there is no point in considering them a single cluster.
So I don't understand what issue you had in mind for multi-master replication here?
I'll take this opportunity to put forth some theory behind the global transaction IDs as we see it at Codership.
1. We have an abstract set of data subject to replication/logging. It can be a whole database, a schema, a table, a row. Lets call it a Replication Set (RS).
2. RS is undergoing changes in time which can be represented as a series of atomic changes. Let's call it RS History. That it is a _series_ is trivial but important - otherwise we can't reproduce historical RS state evolution. Each RS change is represented by a changeset. Since it is a series, RS changesets can be enumerated with a sequence of natural numbers without gaps within a given RS History. Here comes the first component of a global transaction ID: sequence number (seqno).
Right, the series _can_ be linearly ordered. But does it have to be?
Suppose we enforce an artificial linear order among some changesets that have no a priori data dependencies between them (either order of applying
An ability to unambiguously identify a database state by a global transaction ID. them
would be correct). It seems to me that this needlessly discards an opportunity for parallelism. Given that lack of parallelism (on the slave end) is arguably the biggest performance bottleneck of current MySQL replication, this is something to avoid if possible.
I guess I have addressed this above already. In short, parallel applying does not need to mean out-of-order commit.
Isn't it enough that changesets are partially ordered, as long as that partial order encodes all dependencies that need to be respected?
In other words, is it necessary to enforce such total order at the level of the API, instead of just introducing it at the point where it becomes needed, and if so, why?
3. However there can be more than one RS. Moreover, the same RS can end up in different clusters and undergo different changes. So, to achieve
There are at least 2 problems with not having linearly ordered commits: 1) Database states are incomparable. It is impossible (ok, impractically hard) to tell if two databases have identical contents and determine which transactions are missing. How will you join a node to a cluster in this case? 2) Causality preservation. Suppose you commit something on master and now want to select results from slave. How will you tell that slave already has committed these results? I don't exactly understand how you separate "level of the API" and "the point where is becomes needed", do you mean code separation or making linear ordering an optional feature of the redundancy service? In any case, I think we all understand quite well what IS database replication and what are redundancy guarantees when commits are linearly ordered. Id like to hear a more detailed concept of redundancy in the absence of linearly ordered commits. What is redundancy in that case, what are the guarantees? How recovery will be performed and so on. truly
global unambiguity each changeset, in addition to seqno, should be marked with a RS History ID. Obviously seqnos from different histories are logically incomparable. Therefore RS History ID can be any globally unique identifier, with no need for < or > operations. This is the second component of global transaction ID.
Can you give an example of what an "RS History" would be? It was not 100% clear to me.
It is the sequence of changes that happens to RS. Like UPDATE t1 WHERE...; INSERT INTO t2 VALUES...; etc. Perhaps you could hint at what is not clear about it?
What is not so obvious here is that since global transaction ID is generated by logging/replication service, it is that service that defines the order of commits, not vice versa. As a result transaction should first be passed to that service and only then committed. For one-way master-slave replication the order of operations is not so important. However for multi-master it is crucial. Note that the actual replication/logging can still happen asynchronously, but replication service must generate transaction ID before it is committed.
Right. But is this really possible? After all, the commit may fail, so we don't know before if we will actually commit or rollback... we probably don't even know the actual order of commits in parallel threads of execution until after actual commit, unless we take a global lock around (global transaction id, commit); and then it sounds like we get into the issue with broken group commit again?
Why, it is possible alright. Galera replication works this way. There always is a point after which commit just can't fail. When it's been decided that it is committed.
So we need two-phase commit here between the replication service and the storage engines, right? And in this case, it shouldn't matter which comes first, commit or replication service, does it?
We need to be very careful with the interface here, two-phase commit can be really expensive needing multiple disk flushes for every transaction to ensure reliable crash recovery. And we really do not want to re-do the mistake of breaking group commit that we had for years after two-phase commit was introduced for the MySQL binlog.
I don't think that you need 2PC between redundancy service and the storage engines, because redundancy service never fails. Well, when it fails, you have something more important to worry about than disk flushes anyways. As for the order, the problem is that in general you never know in what order redundancy service will log/replicate transaction until you ask it. It is crucial for multi-master replication as you never know what will be the total order of transactions until you replicate and receive them. It is important for (semi)synchronous replication. It also makes sense in regular asynchronous master-slave as you can do replication concurrently with committing. So asking redundancy service first is good all around.
(I am also thinking that generating the global transaction ID at commit time is too late in the general case. The current MySQL replication buffers everything until commit time, but I would like to see us support also logging events during transactions. And this will require an ID generated at transaction start. But maybe this is a different ID than "global transaction ID", which will still be needed to know commit order of course.)
This is exactly the problem we're facing now with support for LOAD DATA (and other potentially huge transactions). There clearly is the need for two IDs in that case. But as you have noticed, the semantics of these two IDs is rather different. Global transaction ID we've been talking about so far needs to be linearly ordered (according to my understanding at least) and does not need to have a server_id, while the ID generated at transaction start does not have to be linearly ordered and SHOULD have a server_id as part of it (to ensure uniqueness, manageability and simplify debugging). Perhaps it is a good time to settle on terminology. Besides 'global transaction ID' is just prohibitively long for intensive discussions and variable bames. Does anyone have better suggestions on how to call these IDs?
- Kristian.
Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011