Robert Hodges <robert.hodges@continuent.com> writes:
I think we need to take this out to one of the email threads. I hope other people would be interested in these problems.
Agree, I've quoted liberally and Cc:'ed maria-developers@
On Aug 31, 2012, at 6:14 AM PDT, Kristian Nielsen wrote:
Robert Hodges <robert.hodges@continuent.com> writes:
1.) The server-id of which I was speaking is a very simple patch to allow us to do the following client command: 'set session server-id=X' where X is whatever we choose. Currently MySQL replication sets the server ID from an internal structure (THD as I recall), which means that applications like Tungsten cannot get the server-id into the binlog when applying updates on a slave. Server-id is essential to prevent replication loops. Monty and I discussed this feature briefly over dinner in April. I hacked in support using a UDF about a year ago but it would be better to have this become a dynamic session variable.
Yes, it does not seem too hard to do (though as always some care is needed to introduce it in a clean way).
I created an issue for this with some descriptions:
SET SESSION pseudo_server_id=236;
Then the server_id 236 would to go into the binlog. This seems like a good idea and avoids accidentally reseting the server_id, which is a pretty significant variable.
Seems there is hope we can have this in 10.0.
2.) I agree with your criticisms of the MySQL 5.6 replication design. It's surprisingly complex and looks as if it will have a lot of bugs. I felt at
Yes. They need this complexity because their parallel replication (which is similar to Tungsten parallel replication I think?) can arbitrarily reorder transactions in a slave binlog compared to the order in the master binlog.
We just use the sequential log. The slaves are complex but it does not pollute the log. In fact, everything up to the point of applying to the slave is purely sequential. I think their design got this wrong.
Well, the MySQL --log-slave-updates work by generating completely new events on the slave as the received events are executed by the SQL thread(s). Do you mean that Tungsten does not do this, but uses the original binlog with original events from the master? How is multi-master (one slave replicating from several masters) handled (if it is handled) ?
the time that the better approach on transaction IDs would be to put enough information into logs to enable applications to compute a vector clock on the stream of binlog events coming from multiple servers. This is a
Interesting, I though about a concept of "vector clock" before, but did not know the terminology.
Aha, so your point is that even though we will apply changes from multiple upstream master servers in parallel on a single slave, with a vector clock, we can still order them on a slave according to which transactions on one server where visible to which transactions on the other. And stuff like that. Interesting!
But note that to be truly "theoretically sound", all queries (ie. SELECT) as well as all applications that interact directly or indirectly with multiple servers need to be included in the vector clock. So it is not without complexity.
Are you thinking about holding locks? For practical purposes it's enough to do writes only. My point is that you have a global ID that is partially ordered by originating server. Vector clocks are the traditional way of handling that. This is the basis for most mechanisms of conflict resolution--you check to see if the clocks are ordered and if they are not (because two thing happened at the same time, hence are not comparable), it signals a conflict that must be reconciled.
Well, I was mostly just reacting to the phrase "truly theoretically sound". If readers are not considered in the vector clock, then they are able to observe different states of the database at different nodes in the cluster. But as you say, that is probably acceptable.
This does not immediately apply to the MySQL design, as they work with a single master (applying in parallel transactions in different databases), and as they arbitrarily reorder between different databases, without any attempt to maintain inter-database consistency. But of course, my point is that this design is fundamentally wrong.
theoretically sound approach to enable both synchronous and asynchronous multi-master replication. (Galera incidentally uses approximately this approach for sync replication as you may be aware.) If you give us server IDs we don't need anything else. Oops, I forgot. Making the binlog truly crash safe in every possible case is also very helpful.
I think it is getting closer to being truly crash safe - can you mention the situations where it is not? And btw., are you interested in the recent testing we (especially Elena) made for testing such crash safety, and the fixes made for it? We found some fairly interesting (and scary) stuff...
Actually your (MP's?) fix for group commit is the most important improvement. Can we turn sync_binlog back on without killing performance?
Yes, well at least performance is substantially improved with sync_binlog=1. With a parallel load you can see improvements of 10x or more, and even single-threaded load is improved in 10.0.
I have been thinking about this for some time. Basically, the correct approach is to make the binlog into a true transaction log using standard textbook techniques. Pre-allocate the log, make it page based (this will also improve performance). Do ping-pong writes at the end so that we can recover even from torn pages. Stuff like that. Would be cool to do, but so far I have not really seen much interest for something like this.
It would be a lot of work. If group commit is fixed, then most on-premise people can solve enough problems with RAID and BBU that they don't need much more.
Right, agree, I thought you had something more subtle than sync_binlog=0 in mind. (BTW, I fixed a couple binlog crash recovery bugs recently in MariaDB, which also affect MySQL).
3.) I read the email you cited. I need some time to study it in order to comment intelligently. The most immediate feedback is that I doubt you want to make assumptions about how slaves will apply transactions. The role of the master is to provide local transactions in serial order. Looking across a set of masters you then have a set of easily identifiable streams of partially ordered transactions. That in turn provides a mathematically tractable model for reasoning about things like update order, dependencies, and potential conflicts. Any metadata you can supply about dependencies, e.g., which schema(s) transactions actually touch is useful but may or may not be relevant on slaves depending on the application. It's up to slaves to make decisions on reordering.
But think about a three-level multi-master topology. Servers A and B are masters for server C, which is itself a master for D. The DBA/application developer will declare that transactions on A and B are independent, so C can apply them in any order, and in parallel. Clearly on D we will want to also apply them in parallel, and the only way we can do that is to use the different server ids A and B in the stream of events in the binlog of C.
So now consider a simple two-level topology with just master C and slave D. Why not still let C put different "server_id"s in different events, as a way to tell server D that events are independent? For example, if the DBA runs a long-running ALTER TABLE, he can ensure that no transactions touch this table while it runs, and just SET SESSION SERVER_ID to something else, and it can run in parallel on D without delaying normal transactions.
Or think about MySQL 5.6 style parallel replication, where the DBA declares that on server A, transactions against different databases are to be considered independent and applied in parallel on the slave. While on server B such transactions are not independent. Now we want to setup slave C to replicate from both A and B. Clearly, C is the wrong place to configure whether transactions in different databases are to be considered independent - this is a property of how the transactions are generated, not where they are replicated. But if we just let the server_id of transactions on master A be a hash of the affected database, then MySQL 5.6 style parallel replication just falls out of multi-master for tree, without any of their complexity.
I have not ever seen this topology arise in practice within MySQL, as people tend to keep things pretty homogeneous. It seems to arise mostly in data warehouse loading where multiple applications load data into the same store for reporting. These are more heterogeneous but arallel information from MySQL is of limited value here--data warehouses in Vertica or Hadoop don't have constraints and you can just jam data into them in parallel.
Ok.
It is just a matter of not artificially restricting transactions generated on a given server to have just one server_id value. Which is in fact exactly what you asked me about originally with SET SESSION SERVER_ID :)
On the other hand, the slave is free to try to discover more opportunities for independence / parallelism if it can, and use them. But automatically discovering opportunities for parallelism on the slave is a very hard problem in general, as I am sure you agree ...
Meanwhile, I don't like the idea of using server_id in the way you describe because it overloads the meaning. Server_id just tells us which server actually committed the transaction in the first place. The reason we need the session variable was to be able to signal this properly. I was not intending to use it more broadly.
Ok, I've noted your concern. It would be interesting to learn how you use the knowledge that one mysqld instance can have at most one server id in generated transactions?
More generally, I believe that if you want to allow programmers to declare that things are independent it would be best to add SQL-level features. For instance, Facebook has a patch that prevents transactions from spanning schemas. This is a very useful feature, because it ensures that all updates in one schema are independent from others. That in turn means you can parallelize without being afraid of deadlocks (a big problem in this space) or constraint violations.
You could generalize this into the notion of a shard, which is a scope for independent transactions and presumably also referential integrity. This would be a very useful feature for SQL databases. I don't know if you have read it but Pat Helland has a great paper on this topic called "Life beyond Distributed Transactions: an Apostate’s Opinion." It was visionary at the time (and still is) and has never been properly implemented in relational DBMS to my knowledge.
Well, independent schemas are nice, but not enough. There are many applications that need the slave to apply transactions in parallel even within a single table. - Kristian.