Alex Yurchenko <alexey.yurchenko@codership.com> writes:
The global transaction ID is a cornerstone concept of a any replication system which aspires to be pluggable, extensible and go beyond basic master-slave. It is hardly possible to even start designing the rest of the API without first setting on global transaction ID. This is one of the
Some very interesting thoughts, thanks! (and sorry for the delay in continuing the discussion). Some questions and comments below.
What's good about http://code.google.com/p/google-mysql-tools/wiki/GlobalTransactionIds:
1) It introduces the concept of atomic database changesets. (Which, ironically, it calls "groups" due to dreaded binlog heritage)
Right. So this is what I think of as "transaction", though it need not be (eg. NDB replication uses "epoch", which is a collection of multiple transactions replicated as a whole).
2) It correctly identifies that (the part of) ID should be a monotonic ordinal number.
Ok. But should it increase in order of transaction start? Or in order of transaction commit? Or maybe your point is that it should be monotonic, but that the actual order of changesets is an implementation detail of the particular replication system? I'd really like to hear your thoughts on this.
3) It correctly identifies that the global transaction ID is generated by redundancy service - in that case "MYSQL_LOG::write(Log_event *) (sql/log.cc, line 1708)"
Right. So the _concept_ of global transaction ID needs to be part of the framework. But the actual _implementation_ of that global transaction ID is an implementation detail of the redundancy service. Correct?
What's bad about it:
2) It fails to address multi-master: (server_id, group_id) is not going to work - such pairs cannot be linearly ordered and, therefore, compared. And from the perspective of the node that needs to apply the changeset - does server_id really matter? It may be good for debugging, but it can't be a part of a global transaction ID.
I don't understand why it would not work for multi-master. Can you elaborate? Especially given that you later suggest yourself:
One possible implementation for that can be (UUID, long long) pair.
How is this different from (server_id, group_id)? (I'd like to understand). My thoughts on multi-master is that there are multiple scenarios, depending on how tightly coupled the nodes of the clusters are with respect to transactions: 1. Tightly-coupled cluster, that maintain a cluster-wide serialised transaction concept. NDB replication is like this (for epochs, which are based on NDB global checkpoint IDs AFAIR). In this case, the entire cluster effectively is a single server wrt. replication, so it's not really multi-master for this discussion. 2. One can imagine a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions. Transaction T1 only needs to be ordered after T2 if it can be proven (in terms of communication within the system) that T1 committed before T2. If there is no evidence one way or the other to the actual order of T1 and T2 (eg. they ran in parallel against disjoint data), there is no a priori reason they should have to be linearly ordered. If on the other hand there is such evidence (eg. T2 waited for a lock released by T1 commit), then there would have to be such linear ordering. I think NDB individual transactions are a bit like this. So this would make transaction IDs only partially ordered (but any arbitrary extension to a total order would be valid). 3. Uncoupled multiple masters, like multiple nodes in MySQL asynchronous replication. In this case there really is no total linear order, as the transactions are not coordinated between servers. At most there would be an approximate order like a time stamp, but that would be not 100% reliable due to time skew and such. So I don't understand what issue you had in mind for multi-master replication here?
I'll take this opportunity to put forth some theory behind the global transaction IDs as we see it at Codership.
1. We have an abstract set of data subject to replication/logging. It can be a whole database, a schema, a table, a row. Lets call it a Replication Set (RS).
2. RS is undergoing changes in time which can be represented as a series of atomic changes. Let's call it RS History. That it is a _series_ is trivial but important - otherwise we can't reproduce historical RS state evolution. Each RS change is represented by a changeset. Since it is a series, RS changesets can be enumerated with a sequence of natural numbers without gaps within a given RS History. Here comes the first component of a global transaction ID: sequence number (seqno).
Right, the series _can_ be linearly ordered. But does it have to be? Suppose we enforce an artificial linear order among some changesets that have no a priori data dependencies between them (either order of applying them would be correct). It seems to me that this needlessly discards an opportunity for parallelism. Given that lack of parallelism (on the slave end) is arguably the biggest performance bottleneck of current MySQL replication, this is something to avoid if possible. Isn't it enough that changesets are partially ordered, as long as that partial order encodes all dependencies that need to be respected? In other words, is it necessary to enforce such total order at the level of the API, instead of just introducing it at the point where it becomes needed, and if so, why?
3. However there can be more than one RS. Moreover, the same RS can end up in different clusters and undergo different changes. So, to achieve truly global unambiguity each changeset, in addition to seqno, should be marked with a RS History ID. Obviously seqnos from different histories are logically incomparable. Therefore RS History ID can be any globally unique identifier, with no need for < or > operations. This is the second component of global transaction ID.
Can you give an example of what an "RS History" would be? It was not 100% clear to me.
What is not so obvious here is that since global transaction ID is generated by logging/replication service, it is that service that defines the order of commits, not vice versa. As a result transaction should first be passed to that service and only then committed. For one-way master-slave replication the order of operations is not so important. However for multi-master it is crucial. Note that the actual replication/logging can still happen asynchronously, but replication service must generate transaction ID before it is committed.
Right. But is this really possible? After all, the commit may fail, so we don't know before if we will actually commit or rollback... we probably don't even know the actual order of commits in parallel threads of execution until after actual commit, unless we take a global lock around (global transaction id, commit); and then it sounds like we get into the issue with broken group commit again? So we need two-phase commit here between the replication service and the storage engines, right? And in this case, it shouldn't matter which comes first, commit or replication service, does it? We need to be very careful with the interface here, two-phase commit can be really expensive needing multiple disk flushes for every transaction to ensure reliable crash recovery. And we really do not want to re-do the mistake of breaking group commit that we had for years after two-phase commit was introduced for the MySQL binlog. (I am also thinking that generating the global transaction ID at commit time is too late in the general case. The current MySQL replication buffers everything until commit time, but I would like to see us support also logging events during transactions. And this will require an ID generated at transaction start. But maybe this is a different ID than "global transaction ID", which will still be needed to know commit order of course.) - Kristian.