[Maria-developers] Thoughts on Global Transaction ID, parallel slave, and multi-source replication
This is a writeup of some thoughts I have on getting Global Transaction ID in MariaDB, as well as some related thoughts on parallel slave and multi-source replication. MySQL 5.6 has global transaction ID now, and I looked into it in some details. Unfortunately, I _really_ did not like what I saw. There are two main problems, in my opinion: 1. The basic design is bad, leading to an excessively complicated implementation. 2. The implementation looks very incomplete/immature, to the point that I am amazed that anyone would push code like this into a main tree. Point 2 means that I really do not want to maintain this piece of **** in MariaDB, though we can hope that things will improve in subsequent releases. Point 1 leads me to suggest that instead of merging the MySQL 5.6 implementation, we could do our own implementation, using a design that is both much simpler, and at the same time more flexible that what is in MySQL 5.6. I will elaborate further on these points in the rest of the mail. ---- With respect to point 2, let me quote this little gem from the MySQL 5.6 code: /* Apparently this code is not being called. We need to investigate if this is a bug or this code is not necessary. /Alfranio */ DBUG_ASSERT(0); /NOTREACHED/ Now, certainly, there are dark corners of the MySQL replication code, and adding a comment/assert like this could be justified if a developer encounters it and finds it may be dead code. But this is _new_ code, added in a new 5.6 release! So the authors of the new feature, by their own statement, have no idea what the new code they are pushing into the main tree is actually doing. That is just insane. I have no idea what the replication team at MySQL is up to these days pulling stunts like this, I have to confess I am disappointed. I found many similar examples in the 5.6 code. I wrote up a more detailed analysis in TODO-171, where I elaborate more on the problems I see with the MySQL 5.6 Global Transaction ID code: https://mariadb.atlassian.net/browse/TODO-171 When the authors do not understand the code themselves, it really does not seem attractive for us to try to support and bugfix it. I also found a lot of what I think are left-over, unsused parts of an even more complicated design that was apparently abandoned (or postponed?), but left in the code, which does not make supporting this any easier. ---- To explain point 1, I first need to mention the MySQL 5.6 feature of parallel slave apply, as I believe that a major factor in the design of MySQL Global Transaction ID has been the need to interoperate with that feature. The MySQL parallel replication slave feature works so that transactions on tables in different databases can be run in parallel with each other (incidentally, there is some indication that the MySQL replication team considers this *the* meaning of parallel slave, which is sad, as it is just one piece of the general problem of replication slave scalability - it helps in some workloads, but it is useless for others). So with parallel slave, transactions on the master can be executed in a different order on the slave. This means that the crucial notion of a slaves current position in the replication stream becomes more complicated - it is not just a single point in the master binlog, rather every slave replication thread can be at a different point in the master binlog. MySQL 5.6 solves this by introducing "Gtid sets". A Gtid is a MySQL 5.6 global transaction ID - a unique, 24-byte identifier for every transaction. The state of a replication slave at any given moment is then the set Gtids of all transactions ever executed on that slave server This notion of having a set of all transactions ever executed really complicates the design. There is a need to represent such a set as basically intervals of "holes" in the Gtid sequence, to avoid the memory representation of such sets growing without bound over time. This in turns introduces the requirement that there can _never_ be holes in the sequence of Gtids generated on a master - if there ever is introduced a hole, slaves will wait around for ever waiting for a transaction containing the missing Gtid. I think recovering from errors with such a concept will be forever a nightmare. I think this basic fundament of the design is the main source of my dislike for the MySQL global transaction ID. It sounds nice on a first look, but once you start to think deeper on the problem it just does not seem to work well. ---- So I think I have a better idea of how to make Global Transaction ID work well with parallel slave based on executing transactions in different databases out-of-order. The MySQL Gtid is defined as a pair of server-id and sequence number. Every master server has a single server-id (despite no less than 128 bits being allocated for this - anyone planning a replication hierarchy with 2**128 servers?). The master generates a single stream of events in the binlog, though streams from other servers may be interleaved with it in case of multi-master. It is then the task of a slave to try to split up the single master stream in to multiple streams that can be executed in parallel. I think this is the root of the problem. The slave is the wrong place to do this splitting! It is on the master that we know how to interpret transactions (ie. that transactions in different databases are / are not independent), so doing this on the slave is harder. And two slaves can in general split differently, which makes it harder for a slave deeper in the replication hierarchy to switch to a new master (that master being itself a slave of a higher-up master), leading to the need for a complex set-of-all-executed-transactions concept. Suppose we instead make the decision on the _master_ on how to split up the replication stream for parallel apply. Instead of a single server-id for the master, we can have multiple - let us call them "Source ID". On the _master_, we will decide that two transactions in different databases are independent, and assign them different Source ID. The slave then receives multiple replication streams, and is free to execute each stream in parallel - or not, at the discretion of the DBA. Transactions _within_ a single stream are always executed in-order, so the slave has no need to remember any sets of all transactions ever executed - all it needs to remember is the sequence number within each stream it has seen - ie. last sequence number for each Source ID. This is a simple and obvious generalisation of the old concept of current binlog position. Downstream slaves see the exact same binlog sequences, each with transactions in the same order (though sequences may be interleaved differently) - so changing masters is simpler, all servers in the hierarchy have the same view. So to implement the same functionality as MySQL 5.6 parallel replication, all we need is on the master to generate Source ID as a hash of the database name (just as MySQL 5.6 assigns transactions to slave threads based on such hash of the database name). And implement a facility on the slave such that transactions with different Source ID can be executed in parallel by multiple slave threads. And now, since we decide on the master, we can provide much more flexibility for the application. We can have _some_ databases db1,db2 be executed in parallel on the slave - while other databases db3,db4 are considered dependent, and not modified out-of-order on the slave. Or the application can declare that two transactions are independent within the same database or even within the same table; for example log record inserts into a table can often be allowed to happen in different order on slaves. Or the DBA can assign a special Source ID to a long-runnign ALTER TABLE, and such ALTER TABLE can then execute independently on slaves to not stall the normal load. All we need is to provide a simple facility (eg. session variable) that applications/users can use to specify Source ID. And then the next step is multi-source replication, which is the much-request feature that one slave can replicate from two or more different masters at the same time. The different masters will just supply different Source IDs, so apart from the need to maintain multiple TCP connections, one to each master, from the rest of the code's point of view multi-source replication is no different, it is just more independent replication streams. So my suggestion is: rather than struggle to try and merge and support the MySQL 5.6 Global Transaction ID and parallel slave, let us instead implement something like the above idea, with multiple Source IDs on the master, and parallel execution of distinct replication streams on the slave. I believe this will be no more work, and the end result will be both simpler and more flexible. Thanks, - Kristian.
On Thu, Aug 09, 2012 at 11:34:44AM +0200, Kristian Nielsen wrote:
So my suggestion is: rather than struggle to try and merge and support the MySQL 5.6 Global Transaction ID and parallel slave, let us instead implement something like the above idea, with multiple Source IDs on the master, and parallel execution of distinct replication streams on the slave. I believe this will be no more work, and the end result will be both simpler and more flexible.
As far I understood the proposal, it means that we will loose Master<->Slave compatibility between MariaDB and MySQL? (that is, it will no longer be possible to hook MariaDB slave to a MySQL master, or vice versa). This may be a problem. Of course, - In order for this to become a problem, MySQL's GTID replication must be widely used in production, which may be not in the nearest future, if our assesment of the state of their code is correct. - We can make MariaDB slave be able to handle MySQL GTID replication master. One probably will not be able to put MariaDB servers at arbitrary points in MySQL GTID replication hierarchy, but being able to function as a "sink" or as a "source" will solve the problem for 90%. BR Sergei -- Sergei Petrunia, Software Developer Monty Program AB, http://askmonty.org Blog: http://s.petrunia.net/blog
Sergei Petrunia <psergey@askmonty.org> writes:
As far I understood the proposal, it means that we will loose Master<->Slave compatibility between MariaDB and MySQL? (that is, it will no longer be possible to hook MariaDB slave to a MySQL master, or vice versa). This may be a problem. Of course,
Not really. MySQL GTID is an optional feature, normal replication as in MySQL <=5.5 still works and is default. (In fact, IIRC from MySQL 5.6 docs, to use GTID one must first set up replication using the traditional way). I'm also fairly sure that it is possible for a slave to replicate the old, compatible way against a replication hierarchy that is otherwise using GTID. But if we want to fully support the MySQL variant of global transaction ID, then we have no choice really but to implement the full complexity with Gtid sets and so on, as transactions can come in any order from a master... - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
this seems to be private, and I can't view even having created a JIRA account.
Suppose we instead make the decision on the _master_ on how to split up the replication stream for parallel apply. Instead of a single server-id for the master, we can have multiple - let us call them "Source ID".
On the _master_, we will decide that two transactions in different databases are independent, and assign them different Source ID. The slave then receives multiple replication streams, and is free to execute each stream in parallel - or not, at the discretion of the DBA. Transactions _within_ a single stream are always executed in-order, so the slave has no need to remember any sets of all transactions ever executed - all it needs to remember is the sequence number within each stream it has seen - ie. last sequence number for each Source ID. This is a simple and obvious generalisation of the old concept of current binlog position. Downstream slaves see the exact same binlog sequences, each with transactions in the same order (though sequences may be interleaved differently) - so changing masters is simpler, all servers in the hierarchy have the same view.
How would this work with a cross database transaction that depends on previous transactions in both database? -- Stewart Smith
Hi, Stewart! On Aug 10, Stewart Smith wrote:
Suppose we instead make the decision on the _master_ on how to split up the replication stream for parallel apply. Instead of a single server-id for the master, we can have multiple - let us call them "Source ID".
On the _master_, we will decide that two transactions in different databases are independent, and assign them different Source ID.
How would this work with a cross database transaction that depends on previous transactions in both database?
Same as in MySQL - it won't, so don't use per-database splitting in this case. You've missed the point, the main idea is about *where* to split the replication stream on parallel "independent" source streams. But *how* to split - is a separate question. There are many different approaches, per-database is just one of those. Regards, Sergei
Sergei Golubchik <serg@askmonty.org> writes:
On Aug 10, Stewart Smith wrote:
Suppose we instead make the decision on the _master_ on how to split up the replication stream for parallel apply. Instead of a single server-id for the master, we can have multiple - let us call them "Source ID".
On the _master_, we will decide that two transactions in different databases are independent, and assign them different Source ID.
How would this work with a cross database transaction that depends on previous transactions in both database?
Same as in MySQL - it won't, so don't use per-database splitting in this case. You've missed the point, the main idea is about *where* to split the replication stream on parallel "independent" source streams. But *how* to split - is a separate question. There are many different approaches, per-database is just one of those.
Idea: if txn touches more than one database, record as metadata the position of other db, so slave just has to wait for at least that id to be applied in other database before applying this txn in this source. This seems like a relatively simple solution to the problem (unless i've missed something). -- Stewart Smith
Stewart Smith <stewart@flamingspork.com> writes:
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
this seems to be private, and I can't view even having created a JIRA account.
Oops :-( Very sorry about that. I *think* I fixed it now, at least I could open the link after logging out, but otherwise it now just re-directs to this which *should* work: https://mariadb.atlassian.net/browse/MDEV-445 Thanks a lot for pointing this out! - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Stewart Smith <stewart@flamingspork.com> writes:
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
this seems to be private, and I can't view even having created a JIRA account.
Oops :-( Very sorry about that. I *think* I fixed it now, at least I could open the link after logging out, but otherwise it now just re-directs to this which *should* work:
https://mariadb.atlassian.net/browse/MDEV-445
Thanks a lot for pointing this out!
It works now, thanks! -- Stewart Smith
Stewart Smith <stewart@flamingspork.com> writes:
Idea: if txn touches more than one database, record as metadata the position of other db, so slave just has to wait for at least that id to be applied in other database before applying this txn in this source. This seems like a relatively simple solution to the problem (unless i've missed something).
Yes, this sounds like an interesting idea. More generally, for a given transaction with some global transaction ID, we can optionally log as extra metadata a list of other global transaction IDs that slave must wait for before executing this transaction. In your example, there would be one element in the list, the global transaction ID of the last transaction touching "other db". And it could be used for lots of different other cases as well. Seems kind of nice. But before that, I would want to look into something like this http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184 which allows to apply transactions in parallel on the slave even within the same table. This is a good supplement to the per-database parallel slave apply, which is not sufficient for all applications. - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Stewart Smith <stewart@flamingspork.com> writes:
Idea: if txn touches more than one database, record as metadata the position of other db, so slave just has to wait for at least that id to be applied in other database before applying this txn in this source. This seems like a relatively simple solution to the problem (unless i've missed something).
Yes, this sounds like an interesting idea. More generally, for a given transaction with some global transaction ID, we can optionally log as extra metadata a list of other global transaction IDs that slave must wait for before executing this transaction.
This would allow master to write information about dependencies between transactions in the same stream. This could be used to help pipeline and parallelise slave transaction execution within the same stream. Basically, you start applying a batch of transactions at once and just COMMIT in the correct order. If you know they're non conflicting, this should all go fine. If they are conflicting, then try again but in series. This (of course) works a whole lot better when you get rid of statement based replication. (this is what i've been advocating for a while, and it's possible to do entirely on the slave, but the master could help a bit if it has any knowledge).
In your example, there would be one element in the list, the global transaction ID of the last transaction touching "other db". And it could be used for lots of different other cases as well. Seems kind of nice.
Yeah... it seems like a decent idea, I've been meaning to implement something like it for a few years now... a million and one other more important/urgent patches :) -- Stewart Smith
participants (4)
-
Kristian Nielsen
-
Sergei Golubchik
-
Sergei Petrunia
-
Stewart Smith