[Maria-developers] Ideas for improving MariaDB/MySQL replication
The three companies Continuent, Codership, and Monty Program are planning to start working on some enhancements to the replication system in MariaDB, together with anyone interested in joining in. At this stage, there are no fixed directions for the project, and to do this in as open a way possible with the maximum community involvement and interest, we agreed to start with an email discussion on the maria-developers@ mailing list. So consider it started! The plan so far is: 1) The parties taking this initiative, MP, Continuent, and Codership, present their own ideas in this thread on maria-developers@ (and everyone else who wants to chime in at this stage). 2) Once we have some concrete suggestions as a starting point, we use this to reach out in a broader way with some blog posts on planetmysql / planetmariadb to encourage further input and discussions for possible directions of the project. Eventually we want to end up with a list of the most important goals and a possible roadmap for replication enhancements. (It is best to have something concrete as a basis of a broad community discussion/process). To start of, here are some points of interest that I collected. Everyone please chime in with your own additional points, as well as comments and further details on these one. Three areas in particular seem to be of high interest in the community currently (not excluding any other areas): - High Availability * Most seems to focus on building HA solutions on top of MySQL replication, eg. MMM and Tungsten. * For this project, seems mainly to be to implement improvements to replication that help facilitate improving these on-top HA solutions. * Tools to automate (or help automate) failover from a master. * Better facilities to do initial setup of new slave without downtime, or re-sync of an old master or slave that has been outside of the replication topology for some period of time. - Performance, especially scalability * Multi-threaded slave SQL thread. * Store the binlog inside a transactional engine (eg. InnoDB) to reduce I/O, solve problems like group commit, and simplify crash recovery. - More pluggable replication * Make the replication code and APIs be more suitable for people to build extra functionality into or on top of the stock MySQL replication. * Better documentation of C++ APIs and binlog format. * Adding extra information to binlog that can be useful for non-standard replication stuff. For example column names (for RBR), checksums. * Refactoring the server code to be more modular with APIs more suitable for external usage. * Add support for replication plugins, types to be determined. For example binlog filtering plugins? It is also very important to consider the work that the replication team at MySQL is doing (and has done). I found a good deal of interesting information about this here: http://forge.mysql.com/wiki/MySQL_Replication:_Walk-through_of_the_new_5.1_a...) This describes a number of 6.0/5.4 and preview features that we could merge and/or contribute to. Here are the highlights that I found: - Features included in 6.0/5.4 (which are cancelled I think, but presumably this will go in a milestone release): * CHANGE MASTER ... IGNORE_SERVER_IDS for better support of circular replication. * Replication heartbeat. * sync_relay_log_info, sync_master_info, sync_relay_log, relay_log_recovery for crash recovery on slave. * Binlog Performance Optimization (lock contention improvement). * Semi-synchronous Replication, with Pluggable Replication Architecture. http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication - Feature previews: * Parallel slave application: WL#4648 http://forge.mysql.com/wiki/ReplicationFeatures/ParallelSlave * Time-delayed replication: WL#344 http://forge.mysql.com/wiki/ReplicationFeatures/DelayedReplication * Scriptable Replication: WL#4008 http://forge.mysql.com/wiki/ReplicationFeatures/ScriptableReplication * Synchronous Replication. Drizzle is also doing work on a new replication system. I read through the series of blog posts that Jay Pipes wrote on this subject. They mostly deal with how this is designed in terms of the Drizzle server code, and is low on detail about how the replication will actually work (the only thing I could really extract was that it is a form of row-based replication). If someone has links to information about this that I missed, it could be interesting. Let the discussion begin! - Kristian.
Hi, Kristian, thanks for starting this discussion. I'm glad you mentioned the need to improve replication APIs. Hereby I will present some points which we at Codership found to be essential to the success of the project. These are not technical requirements, but more of a conceptual suggestions pertaining replication API design. They are based on the work/research we have done so far with our MySQL/Galera cluster. 1) It is time to drop MASTER/SLAVE mentality. This has nothing to do with replication per se. For example multi-master Galera cluster is turned into master-slave simply by directing all writing transactions to a single node. Without a single change in nodes' configuration, let alone our replication API. So master-slave is purely a load balancer thing - the node that receives writes IS the master even if the slaves think otherwise. 2) It is time to drop SYNCHRONOUS/ASYNCHRONOUS mentality. Although Galera cluster currently supports only synchronous operation, it can be turned into asynchronous with rather small changes in the code - again without any changes to API. This is merely a quality of replication engine. What's bad about this mentalities is that they force one to focus on peculiarities with a subsequent failure to achieve a general solution that can satisfy a broader range of requirements. If you look at the primary purpose of replication (and binlogging), it is to make redundant copies of data changes as soon as possible. All other goals like performance scalability, point-in-time recovery and so on are pending on that primary requirement - that somewhere there is a copy of the needed information. And how exactly the copy is made should be of no concern to the server. So when refactoring replication code and API we suggest to think of replication as of redundancy service and establish a general API for such service that can be utilized by different implementations with different qualities of service. In other words - make a whole replication system a plugin (like storage engines are), not only some measly filters. We realize that this may sound too broad of a goal and unattainable in the short timeframe. However, from our experience it is not that crazy - we have an example of such API (https://launchpad.net/wsrep). It is not final yet and may be not suitable for immediate utilization in MariaDB, but still - a good (and working) example of the interface that is equally suitable for synchronous and asynchronous replication, binlogging and what not. Thanks, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Alex Yurchenko
So when refactoring replication code and API we suggest to think of replication as of redundancy service and establish a general API for such service that can be utilized by different implementations with different qualities of service. In other words - make a whole replication system a plugin (like storage engines are), not only some measly filters.
Serg pointed me to this page, which is an early description of a plugin API for replication that the replication team at MySQL has implemented: http://forge.mysql.com/wiki/ReplicationFeatures/ReplicationInterface It seems this is what they used to implement Semi-Synchronous replication, which (as I mentioned) is available as a feature preview: http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication You might want to look into it. The semi-synchronous replication is in fact loaded as a plugin in the MySQL implementation; this is from the presentation: On master – INSTALL PLUGIN 'rpl_semi_sync_master' SONAME 'libsemisync_master.so'; – SET rpl_semi_sync_master_enabled=1; – SET rpl_semi_sync_master_timeout=1000; (1s, default 10ms) On slave – INSTALL PLUGIN 'rpl_semi_sync_slave' SONAME 'libsemisync_slave.so'; – SET rpl_semi_sync_slave_enabled=1; – START SLAVE; We already have a number of links now, so I created a wiki page to collect this and other information for easy access: http://askmonty.org/wiki/index.php/ReplicationProject (Let me know if you need edit rights on this page; we haven't fully sorted out how to handle community access to edit on the wiki, but we will add it somehow if necessary). - Kristian.
On Fri, 22 Jan 2010 21:14:29 +0100, Kristian Nielsen
Serg pointed me to this page, which is an early description of a plugin API for replication that the replication team at MySQL has implemented:
http://forge.mysql.com/wiki/ReplicationFeatures/ReplicationInterface
Uhhhh. I am of rather poor opinion of that interface. It is not breaking with bad traditions at all. I'm not sure if it is high time for criticism already, but if it is, should I direct it here? ;) Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Alex Yurchenko
On Fri, 22 Jan 2010 21:14:29 +0100, Kristian Nielsen
wrote:
http://forge.mysql.com/wiki/ReplicationFeatures/ReplicationInterface
Uhhhh. I am of rather poor opinion of that interface. It is not breaking with bad traditions at all. I'm not sure if it is high time for criticism already, but if it is, should I direct it here? ;)
I think it would be useful if you explained what the problems are with that interface, in your opinion. It seems that this has been pushed into MySQL 5.5, which as far as I understand is the basis for the next MySQL release. So we will have to deal with the interface eventuially (one way or the other).
From my initial look at the new interfaces, it seems they deal mainly with higher-level stuff like transaction start/end, binlog writing, and relay log reception. I did not see anything in the interfaces related to the actual generation of the events. The ReplicationInterface document has a class Binlogger that mentions this, but this class description is low on details (no argument specification for log_row()), and I did not find anything like this in the MySQL 5.5 code.
I am thinking that this is mainly a refactoring to expose mostly already present functionality in a clean way to new plugins (semisync. replication in particular), but I will have to look deeper to know for sure. - Kristian.
On Mon, 25 Jan 2010 13:55:44 +0100, Kristian Nielsen
I think it would be useful if you explained what the problems are with
that
interface, in your opinion.
Let me start with that I'm not that much familiar with the current MySQL replication code and may not be qualified to judge how much this new MySQL Replication Interface improves on what we have in 5.1.x. Perhaps it does. But as a replication developer with a goal of creating a generic redundancy service API that will handle a broad range of tasks I can say that it is a step in the dead-end direction. This interface does not seem to improve anything about how redundancy is achieved in MySQL. Moreover, it seems to cement all the bad decisions that were made in years into an explicit interface: - It makes explicit distinction between binlogging and replication. - It does not care to introduce a concept of global transaction ID. - It exposes what should be redundancy service internal implementation details to SQL server. I do understand that there are perfectly good reasons why MySQL replication API ended to be such a mess. But if we want to move further, we must recognize that it is a mess beyond repairs. It cannot be inherited. I think the main problem here is that many people by force of habit regard things that should be internal implementation details of redundancy service as integral parts of an SQL server. Like binlog storage or relay service. We won't get a clean flexible generic API before we clearly sort out what belongs where. And for that we'll need to look at redundancy service unencumbered by existing code. This is not a call to revolution. It is a suggestion to create a completely new parallel redundancy service API and _gradually_ reimplement required functionality under that API. Please understand that I'm not questioning current replication implementation. It may be well reusable. I'm questioning where and how the redundancy API line is drawn. Exposing concrete binlog storage implementation to SQL server is not only pointless, it is harmful. One more reason to design redundancy API from scratch and not start from this one is because whenever you'll want to change anything inside, you'll inevitably have to change this API simply because it exposes so much of internals. To illustrate this somehow, on page 18 of replication slides from UC 2009 (http://forge.mysql.com/wiki/MySQL_Replication:_Walk-through_of_the_new_5.1_a...) we can see unification of logging and replication functionality behind something called "Logging Kernel", but it does not seem to be reflected in any way neither in the aforementioned Replication Interface spec. nor on page 23 of the slides. Apparently, intended plugin points are to be various observer interfaces shown as diamonds below delegate boxes. Well, we can do much better than that and raise redundancy plugin boundary much higher. Specifically, everything but "SQL execution" and "Slave IO thread" on that slide must be moved behind the redundancy service plugin interface and become implementation detail. (This is not to say that there can't be plugins to redundancy service plugin.)
I am thinking that this is mainly a refactoring to expose mostly already present functionality in a clean way to new plugins (semisync.
replication
in particular), but I will have to look deeper to know for sure.
It sure looks so. But notice that more than half of the APIs there are not even used by semi-sync. In fact existence if semi-sync in no way justifies this interface. It can be implemented much easier with wsrep API. Thanks, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Alex Yurchenko
On Mon, 25 Jan 2010 13:55:44 +0100, Kristian Nielsen
wrote: I think it would be useful if you explained what the problems are with
that
interface, in your opinion.
This interface does not seem to improve anything about how redundancy is achieved in MySQL. Moreover, it seems to cement all the bad decisions that were made in years into an explicit interface:
- It exposes what should be redundancy service internal implementation details to SQL server.
We won't get a clean flexible generic API before we clearly sort out what belongs where. And for that we'll need to look at redundancy service unencumbered by existing code. This is not a call to revolution. It is a suggestion to create a completely new parallel redundancy service API and _gradually_ reimplement required functionality under that API.
Right. So if I understand you correctly, with "internal implementation details" we do not mean just that the APIs expose internals of the SQL server which we want to shield plugins from. Rather, the way the interface is designed it makes assumptions about how the plugin that will use the iterface will be implemented, thus making it unsuitable for other plugins that have other ideas about what to do. So more concretely, what we want is an API that does not make assumptions about the format of the binlog file, or even that there is a binlog stored in a file. And an API that does not assume that events to be applied will be read from a specific mysql connection to a master server, returning data in a particular binlog format. Like you wrote:
I think the main problem here is that many people by force of habit regard things that should be internal implementation details of redundancy service as integral parts of an SQL server. Like binlog storage or relay service.
I had a brief but inspiring discussion with Serg about this at our meeting two weeks ago. So basically, what we could aim for is to make the entire current MySQL replication into a set of plugins. These plugins would be made against a new plugin interface that would support not only the existing MySQL replication for backwards compatibility, but also things like Galera and Tungsten, and other ideas. So while the compatibility *plugins* would contain the legacy MySQL binlog storage and relay service, the plugin *interface* would not. I think this is what you had in mind? So the basic for such an interface would be the ability to install hooks to be called with row data for every handler::write_row(), handler::update_row(), and handler::delete_row() invocation, just like the current row-based binlogging does. And similar for SQL statement execution like statement-based logging does now. That should be clear enough. Then comes the need to hook into transaction start and commit and opening tables. At this point, more of the internals of the MySQL server start to appear, and some careful thought will be needed to get an interface that exposes enough that plugins can do what they need, without exposing too much internal details of how MySQL query execution is implemented. (But note that this is two different issues regarding "internal implementations". One is how the *query execution* is implemented. The other is how the *plugins* are implemented. If I understood you correctly, the interface used for semisync in MySQL fails on the latter point). One example of how a lot of details from query execution pop up is with regard to the mixed-mode binlogging. This is where queries are logged as statements when this is safe, and as row events when this is not safe (nondeterministic queries). The concept of "mixed mode binlogging" certainly seems like something that should be an implementation detail of the plugin, not part of the interface. On the other hand, determining whether a query is safe for statement-based logging is highly complex, and exposing enough of the server for the plugin to be able to determine this by itself may be too much. (Maybe just expose an is_safe_for_statement() function to plugins could be enough). Another example of hairy details is all the extra information that can go with an SQL statement into the binary log. Things like current timestamp, random seed, user-set @variables, etc. To support a statement-based replication plugin, we probably have to expose all of this on the interface in a clean fashion.
- It does not care to introduce a concept of global transaction ID.
Right. As I wrote earlier, this seems to be central to many of the ideas involved in this project. What I am wondering at the moment is if the concept of global transaction ID should be a part of the new API, or if it is really an implemtation detail of the reduncancy service. On the one hand, if we make it part of the API, can we make it general enough to support everything we want? For example, some plugin will need the ID to be generated at the start of a transaction. Some will need it to be generated at the end of the transaction. On the other hand, if we make it _not_ part of the API, we run the risk of making the API overly general and just pushing the problem down for each plugin to try to solve individually. I'll start working more deeply into these issues of new API and global transaction ID. - Kristian.
On Mon, 15 Mar 2010 10:57:41 +0100, Kristian Nielsen
Right.
So if I understand you correctly, with "internal implementation details" we do not mean just that the APIs expose internals of the SQL server which we want to shield plugins from. Rather, the way the interface is designed it
makes
assumptions about how the plugin that will use the iterface will be implemented, thus making it unsuitable for other plugins that have other ideas about what to do.
Exactly. And it does not only affect plugin construction. It may simply become hard to see the "forest" behind the implementation detail "trees". So you really wan to go abstract here.
I had a brief but inspiring discussion with Serg about this at our meeting two weeks ago. So basically, what we could aim for is to make the entire current MySQL replication into a set of plugins. These plugins would be made against a new plugin interface that would support not only the existing MySQL replication for backwards compatibility, but also things like Galera and Tungsten, and other ideas. So while the compatibility *plugins* would contain the legacy MySQL binlog storage and relay service, the plugin *interface* would not.
I think this is what you had in mind?
- It does not care to introduce a concept of global transaction ID.
Right. As I wrote earlier, this seems to be central to many of the ideas involved in this project.
What I am wondering at the moment is if the concept of global
I'm afraid that "compatibility plugins" you mention is a bit too idealistic a picture. My concern is that the current built-in replication code just cannot be mated with new API extension points. Also it may be that too many shops have grown dependent on the old style replication and the drawbacks that it has. So what I had in mind was just to leave current built-in replication as it is. And then, when a better replacement plugin is made, just gradually phase it out. From what I understand it is quite doable. But I think you guys have better judgment here. transaction
ID should be a part of the new API, or if it is really an implemtation detail of the reduncancy service.
On the one hand, if we make it part of the API, can we make it general enough to support everything we want? For example, some plugin will need the ID to be generated at the start of a transaction. Some will need it to be generated at the end of the transaction.
On the other hand, if we make it _not_ part of the API, we run the risk of making the API overly general and just pushing the problem down for each plugin to try to solve individually.
I'd go about it in the following way. We have an SQL server proper. And it has a state (database). And it is a state of the server that we want to be redundant (replicate, log, whatever). The particular server state is identified by a global transaction ID. From here is follows that global transaction ID should be the same regardless of the plugin. It is also quite clear that each plugin will be using its own ID format internally. E.g. binlogger will be obviously using file offsets and Galera will be using 64-bit signed integers. Then plugins will just have to implement their own mapping to the ID defined in API. Which in most cases will be trivial. Having a unified global transaction ID is unbelievably good, especially when you have cascading replication, where each cascade can use its own plugin. It is so good that you will never ever have any troubles with it, and no troubles with global transaction ID amounts to nirvana. ;) Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Alex Yurchenko
On Mon, 15 Mar 2010 10:57:41 +0100, Kristian Nielsen
wrote:
What I am wondering at the moment is if the concept of global transaction ID should be a part of the new API, or if it is really an implemtation detail of the reduncancy service.
I'd go about it in the following way. We have an SQL server proper. And it has a state (database). And it is a state of the server that we want to be redundant (replicate, log, whatever). The particular server state is identified by a global transaction ID. From here is follows that global transaction ID should be the same regardless of the plugin.
It is also quite clear that each plugin will be using its own ID format internally. E.g. binlogger will be obviously using file offsets and Galera will be using 64-bit signed integers. Then plugins will just have to implement their own mapping to the ID defined in API. Which in most cases will be trivial.
Having a unified global transaction ID is unbelievably good, especially when you have cascading replication, where each cascade can use its own plugin. It is so good that you will never ever have any troubles with it, and no troubles with global transaction ID amounts to nirvana. ;)
Hm. So in such cascading replication scenario, the changeset would actually keep its identity in the form of the global transaction ID? So if on master1, the user does BEGIN; INSERT INTO t1 VALUES (...); COMMIT; this might get global transaction ID (UUID_master1, 100) This might get replicated to a slave1 with multiple masters. The slave1 might then end up with three changesets, the one from master1, another from master2, and a third made by the user directly on slave1: (UUID_master1, 100) (UUID_master2, 200) (UUID_slave1, 50) So what if we now want to cascade replicate from slave1 (now as a master) to slave2? Would slave2 then see the same three global transaction IDs? (UUID_master1, 100) (UUID_master2, 200) (UUID_slave1, 50) That does not seem to work, does it? Seems to me slave1 would need to assign each changeset a new global transaction id in order for slave2 to know in which order to apply the changesets? In particular, whether to apply (UUID_slave1, 50) before or after (UUID_master1, 100). So I think I misunderstood you here? Or did you mean that the _format_ of the global transaction ID should be the same across all plugins, so that in a cascading replication scenario where servers are using different replication plugins, the IDs can be treated uniformly? - Kristian.
On Tue, 16 Mar 2010 13:20:40 +0100, Kristian Nielsen
Alex Yurchenko
writes: On Mon, 15 Mar 2010 10:57:41 +0100, Kristian Nielsen
wrote: What I am wondering at the moment is if the concept of global transaction ID should be a part of the new API, or if it is really an implemtation detail of the reduncancy service.
I'd go about it in the following way. We have an SQL server proper. And it has a state (database). And it is a state of the server that we want to be redundant (replicate, log, whatever). The particular server state is identified by a global transaction ID. From here is follows that global transaction ID should be the same regardless of the plugin.
It is also quite clear that each plugin will be using its own ID format internally. E.g. binlogger will be obviously using file offsets and Galera will be using 64-bit signed integers. Then plugins will just have to implement their own mapping to the ID defined in API. Which in most cases will be trivial.
Having a unified global transaction ID is unbelievably good, especially when you have cascading replication, where each cascade can use its own plugin. It is so good that you will never ever have any troubles with it, and no troubles with global transaction ID amounts to nirvana. ;)
Hm.
So in such cascading replication scenario, the changeset would actually keep its identity in the form of the global transaction ID?
So if on master1, the user does
BEGIN; INSERT INTO t1 VALUES (...); COMMIT;
this might get global transaction ID (UUID_master1, 100)
This might get replicated to a slave1 with multiple masters. The slave1 might then end up with three changesets, the one from master1, another from master2, and a third made by the user directly on slave1:
(UUID_master1, 100) (UUID_master2, 200) (UUID_slave1, 50)
So what if we now want to cascade replicate from slave1 (now as a master) to slave2? Would slave2 then see the same three global transaction IDs?
(UUID_master1, 100) (UUID_master2, 200) (UUID_slave1, 50)
That does not seem to work, does it? Seems to me slave1 would need to assign each changeset a new global transaction id in order for slave2 to know in which order to apply the changesets? In particular, whether to apply (UUID_slave1, 50) before or after (UUID_master1, 100).
So I think I misunderstood you here?
Or did you mean that the _format_ of the global transaction ID should be the same across all plugins, so that in a cascading replication scenario where servers are using different replication plugins, the IDs can be treated uniformly?
- Kristian.
Yes, you have misunderstood me, it is the value of the global transaction ID that stays constant (and format too, of course) ;) First of all, your example doesn't work exactly because you have chosen your global trx ID format (source, id_on_source) to be linearly incomparable. Second, let's forget for a moment about global transaction ID format and exact implementation, just remember that you can build a monotonic gapless sequence out of them. And suppose that (UUID_master1, 100) has ID1, (UUID_master2, 200) has ID2 and (UUID_slave1, 50) has ID3. And they are ordered ID1 < ID2 < ID3 without gaps. So slave1 has ID1, ID2, ID3. Slave2 will see the same, as everybody else. Suppose slave2 crashes/reboots after it applied ID1. Now it can connect to ANY node of the cluster and say "hey, I need events starting at ID2". And every node will know where to start from, because ID2 means the same trx on every node. This was all talking about a single Replication Set. You're probably envisioning master1 and master2 modifying disjoint (or maybe even the same) sets of data independently and slave1 aggregating changes from both of them. The masters don't see each others changes, so they can't mutually order their changesets, only slave1 can. How to go about that? Well, the trick here is that master1 and master2 in this case are not really members of the same replication cluster. They don't replicate to each other, right? So they have their own individual RS and their own global transaction ID sequences which are indeed incomparable. slave1 participates in both clusters, but can we say that the db on slave1 is a replica of master1 or master2. Well, it depends. If master1 modifies db1 and master2 modifies db2, then we just have 2 independent master-slave clusters happening to share the same physical hardware as a slave. If master1 and master2 modify the same db independently, then, strictly speaking, we don't have a case of db replication here and slave1 will order the changesets and assign his own ID sequence to them. To summarize, there can be various esoteric setups and RS concept is the key to understand the scope of global transaction ID there. Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Hi Kristian,
I agree with Alex's response, and I'll pick the hopefully all the
remaining questions to answer here.
Quoting Kristian Nielsen
So the basic for such an interface would be the ability to install hooks to be called with row data for every handler::write_row(), handler::update_row(), and handler::delete_row() invocation, just like the current row-based binlogging does. And similar for SQL statement execution like statement-based logging does now. That should be clear enough.
Then comes the need to hook into transaction start and commit and opening tables. At this point, more of the internals of the MySQL server start to appear, and some careful thought will be needed to get an interface that exposes enough that plugins can do what they need, without exposing too much internal details of how MySQL query execution is implemented.
Yes, that's a good plan.
(But note that this is two different issues regarding "internal implementations". One is how the *query execution* is implemented. The other is how the *plugins* are implemented. If I understood you correctly, the interface used for semisync in MySQL fails on the latter point).
One example of how a lot of details from query execution pop up is with regard to the mixed-mode binlogging. This is where queries are logged as statements when this is safe, and as row events when this is not safe (nondeterministic queries). The concept of "mixed mode binlogging" certainly seems like something that should be an implementation detail of the plugin, not part of the interface. On the other hand, determining whether a query is safe for statement-based logging is highly complex, and exposing enough of the server for the plugin to be able to determine this by itself may be too much. (Maybe just expose an is_safe_for_statement() function to plugins could be enough).
Mixed mode replication (or binlog format, as called in MySQL code), is something I would leave completely for the DBMS to decide about. The replication plugin should offer calls for SQL and ROW level replication, and DBMS just decides which one to call in each case. Note that replication plugin has it next to impossible to judge if passed SQL statement is valid to be replicated directly or if ROW event should be used (this decision would require parsing in replicator...). In general, it is better if replicator does not need to look inside replication events at all. (ok, there are requirements for SQL level filtering heterogeneous replication, query rewriting etc..., which may be valid use cases as well)
Another example of hairy details is all the extra information that can go with an SQL statement into the binary log. Things like current timestamp, random seed, user-set @variables, etc. To support a statement-based replication plugin, we probably have to expose all of this on the interface in a clean fashion.
SQL level replication requires that session context will be maintained, passed for the replicator and enforced in the applying side. This will be a bit complicated to implement, but is something that cannot be avoided. Support for session context management must in replication API, but the context can be presented as opaque object and replicator does not need to know about the details. Seppo -- http://www.codership.com seppo.jaakola@codership.com tel: +358 40 510 5938 skype: seppo_jaakola
Hi, Alex! On Jan 22, Alex Yurchenko wrote:
So when refactoring replication code and API we suggest to think of replication as of redundancy service and establish a general API for such service that can be utilized by different implementations with different qualities of service. In other words - make a whole replication system a plugin (like storage engines are), not only some measly filters.
That's exactly how I see it - and what I'd like to have as a long term goal. Regards, Sergei
Hi, Alex! Continuing the old discussion... On Jan 22, Alex Yurchenko wrote:
1) It is time to drop MASTER/SLAVE mentality. This has nothing to do with replication per se. For example multi-master Galera cluster is turned into master-slave simply by directing all writing transactions to a single node. Without a single change in nodes' configuration, let alone our replication API. So master-slave is purely a load balancer thing - the node that receives writes IS the master even if the slaves think otherwise.
I may still use words "master" and "slave" below, in the sense that the part of the code that takes the changes generated by local clients and sends them out can be called "master" and the part of the code that receives them and applies can be called "slave". Both can be active on the same node though.
2) It is time to drop SYNCHRONOUS/ASYNCHRONOUS mentality. Although Galera cluster currently supports only synchronous operation, it can be turned into asynchronous with rather small changes in the code - again without any changes to API. This is merely a quality of replication engine.
Agree.
So when refactoring replication code and API we suggest to think of replication as of redundancy service and establish a general API for such service that can be utilized by different implementations with different qualities of service. In other words - make a whole replication system a plugin (like storage engines are), not only some measly filters.
Ok, here I describe a possible model of what it can look like in the server: * there are replication _events_ - they represent changes to the data, like creation of a table, or updating of a row. * there are event _generators_ or _producers_ - facilities that generate events, for example "SBR producer" generates a stream of events with the SQL statements - for a statement-based replication. There can also be "RBR producer", or, for example, "MyISAM physical producer" - that generates events in terms of pwrite() calls. * there are event _consumers_ - they connect to producers and consume the generated events. For example, a filter, such as that only allows changes to a certain table to be replicated, is both a consumer and a producer of events. * when events are sent to slaves - it's again just a pair of producer/consumer - events on the master dissapear in the consumer, events on the slave come out from a producer. * events can be _nested_ - one INSERT ... SELECT statement is one SBR event, but it corresponds to many RBR events, and every RBR event may correpond to many "MyISAM pwrite()" events. * not everything can be replicated at every level, for example table creation cannot be replicated row-based, InnoDB changes cannot be replicated with "MyISAM pwrite()" events * it is up to the event generation facility to make sure its stream of events is complete. It is implemented by fetching events from the upper level: for example, RBR producer connects - as a consumer - to the SBR producer, and when there are SBR events without nested RBR events it simply reads the corresponding SBR events and sends it out. * a consumer may know the event format and look at the data fields, or it may not. For example, a filter that adds checksums to events or a consumer that sends events to slaves do not need to care about event format. But a "final consumer" - the one that ultimately applies event on the slave side - apparently should know how to parse the event. * there's no explicit global transaction ID here, but I presume there can be a filter that adds it to events. That would work, as long as replication decides on the commit order (which is does, even now in MySQL/MariaDB). this model seems to allow both native MySQL replication - sbr, rbr, and mixed, with exactly the same protocol on the wire - and different extensions, like semysync or fully synchronous replication, heterogeneous replication, arbitrary transport protocols, and so on. It looks like it can be completely compatible with MySQL replication if necessary or use something absolutely different - depending on what plugins are loaded and how they are connected. but the model itself has no notion of "master node" or "slave node", synchronous or asynchronous, binlog, MySQL protocol, relay log, or even SBR/RBR/MIXED modes. Regards, Sergei
Hi!
On Thu, 13 May 2010 16:36:41 +0200, Sergei Golubchik
I may still use words "master" and "slave" below, in the sense that the part of the code that takes the changes generated by local clients and sends them out can be called "master" and the part of the code that receives them and applies can be called "slave". Both can be active on the same node though.
Indeed. In fact we use them in the same sense internally for lack of better established terminology.
* there's no explicit global transaction ID here, but I presume there can be a filter that adds it to events. That would work, as long as replication decides on the commit order (which is does, even now in MySQL/MariaDB).
Well, I cannot really argue against what you're proposing, as it is all nice. However, as a side note, my impression is that you're proposing generic message passing architecture, not a replication/redundancy model. The difference is that you can implement all these producer-consumer interfaces, but it is not enough to make a redundancy service - you'll be just passing messages around. Global transaction ID is not just "nice-to-have". It is a cornerstone of the redundancy model that we proposed in that it defines a state if the database, thus making it possible to talk about a database state, comparing states, recovering a state, transferring a state and so on. There (perhaps) can be other redundancy models, but your proposal seems to deal solely with message passing, filtering and protocol encapsulation, with no redundancy in any recognizable form. This is not to say that your proposal is not good - as a middle-level implementation detail it may be the way to go. Just wanted to point out that it is not enough for redundancy/replication system. And I'd be seriously curious to see how you could make it without pulling in global transaction IDs (at least implicitly, as it is done in the current MySQL replication) Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Hi, Alex! On May 13, Alex Yurchenko wrote:
On Thu, 13 May 2010 16:36:41 +0200, Sergei Golubchik
wrote: * there's no explicit global transaction ID here, but I presume there can be a filter that adds it to events. That would work, as long as replication decides on the commit order (which is does, even now in MySQL/MariaDB).
Well, I cannot really argue against what you're proposing, as it is all nice.
However, as a side note, my impression is that you're proposing generic message passing architecture, not a replication/redundancy model. The difference is that you can implement all these producer-consumer interfaces, but it is not enough to make a redundancy service - you'll be just passing messages around. Global transaction ID is not just "nice-to-have". It is a cornerstone of the redundancy model that we proposed in that it defines a state if the database, thus making it possible to talk about a database state, comparing states, recovering a state, transferring a state and so on. There (perhaps) can be other redundancy models, but your proposal seems to deal solely with message passing, filtering and protocol encapsulation, with no redundancy in any recognizable form.
This is not to say that your proposal is not good - as a middle-level implementation detail it may be the way to go. Just wanted to point out that it is not enough for redundancy/replication system. And I'd be seriously curious to see how you could make it without pulling in global transaction IDs (at least implicitly, as it is done in the current MySQL replication)
Yes, it only describes how the data get to the redundancy service, but not what happens there. I intentionally kept the details of redundancy out, to be able to satisfy a wide range of different implementations. For example, if I'd put a global transaction ID explicitly in the model, then MySQL replication would not fit into it - it has such a concept only implicitly, as you have noted. So, what I did was, as Robert Hodges put it, "pushed the can down the road", and let the redundancy service to take care of the transaction ids. But perhaps I'm biased and the model I've described is influenced by MySQL replication more than it should've been ? Regards, Sergei
Hi!
On Wed, 19 May 2010 15:05:55 +0200, Sergei Golubchik
Yes, it only describes how the data get to the redundancy service, but not what happens there. I intentionally kept the details of redundancy out, to be able to satisfy a wide range of different implementations.
For example, if I'd put a global transaction ID explicitly in the model, then MySQL replication would not fit into it - it has such a concept only implicitly, as you have noted.
So, what I did was, as Robert Hodges put it, "pushed the can down the road", and let the redundancy service to take care of the transaction ids.
But perhaps I'm biased and the model I've described is influenced by MySQL replication more than it should've been ?
Oh, not really. I just wanted to note that while you were proposing a useful framework, you did not touch actual replication/redundancy specifics. Speaking of current MySQL replication, I was skeptical from the beginning that it will fit into new redundancy service in its current unmodified form. It is simply too integrated with the server for that (just think of all those HAVE_REPLICATION ifdefs). That's why I proposed to keep them side by side and not try to unify them. It would be a shame to adapt new replication architecture to existing MySQL replication. I believe it should go the other way around. And seriously, mapping an integer to a file offset is a student's play and is a minor tweak compared to making MySQL replication a module of a new redundancy service. So it is just a matter of administrative decision - will _MariaDB_ replication have explicit global transaction IDs or not. Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Alex Yurchenko
On Wed, 19 May 2010 15:05:55 +0200, Sergei Golubchik
wrote: Yes, it only describes how the data get to the redundancy service, but not what happens there. I intentionally kept the details of redundancy out, to be able to satisfy a wide range of different implementations.
For example, if I'd put a global transaction ID explicitly in the model, then MySQL replication would not fit into it - it has such a concept only implicitly, as you have noted.
So, what I did was, as Robert Hodges put it, "pushed the can down the road", and let the redundancy service to take care of the transaction ids.
But perhaps I'm biased and the model I've described is influenced by MySQL replication more than it should've been ?
Oh, not really. I just wanted to note that while you were proposing a useful framework, you did not touch actual replication/redundancy specifics.
Yes, I agree. I think what we need to do is have several layers in the API. So far I have identified three different layers: 1. Event generators and consumers. This is what Sergei discussed. The essentials of this layer is hooks in handler::write_row() and similer places that provides data about changes (row values for row-based replication, query texts for statement-based replication, etc). There is no binlog or global transaction ID at this layer, I think there may not even be a defined event format as such, just an API for consumer plugins to get the information (and for generator plugins to provide it). 2. Primary redundancy service and TC manager. There will be exactly one of these in a server. It controls the 2-phase commit among different engines and binlogs etc (and handles recovery of these after crash). And it controls the commit order, so would be the place to implement the global transaction ID. 3. Default event format. I think it will be useful to have a standard replication event format at a high level. This would be optional, so plugins at level 1 and 2 would be free to define their own format, but having a standard format at some level would allow to re-use more code and not have to re-invent the wheel in every plugin. Maybe at this level there could also be some API for defining the encapsulation of a specific event format, so that a generic binlog or network transport could be written supporting multiple event formats.
Speaking of current MySQL replication, I was skeptical from the beginning that it will fit into new redundancy service in its current unmodified form. It is simply too integrated with the server for that (just think of all those HAVE_REPLICATION ifdefs). That's why I proposed to keep them side by side and not try to unify them.
Yes. So with respect to the above levels, I think the current binlog implementation can be built upon a generic layer 1 API without problems. But for layer 2, the existing binlog implementation would be side-by-side with other alternatives. And for level 3 I think it would also be side-by-side. The existing binlog format is really not very extensible, and a more flexible format (maybe based on Google protobuffers like Drizzle does) sounds like a more likely way forward). So for something like Galera, I think it would hook into the layer 1 API to get the events from statements. At layer 2, it would implement its own TC manager, which controls the commit process and recovery, and handles the synchronous replication algorithm. And for level 3, maybe it would implement its own event format, or maybe it could use the default event format (and re-use the code to package such events on a master and apply such events on a slave), but implement its own transport for the events. Sounds reasonable? - Kristian.
Kristian Nielsen
1. Event generators and consumers. This is what Sergei discussed. The essentials of this layer is hooks in handler::write_row() and similer places that provides data about changes (row values for row-based replication, query texts for statement-based replication, etc). There is no binlog or global transaction ID at this layer, I think there may not even be a defined event format as such, just an API for consumer plugins to get the information (and for generator plugins to provide it).
I started to write up some more concrete design for this part. Here is the link to the worklog (text also included below): http://askmonty.org/worklog/Server-Sprint/?tid=120 This part describes the lowest layer with the generation of events, and also has a little bit of discussion on some possible layers above (replicated event stream and binlog/transport APIs). This API aims to be useful for something like Tungsten to implement its own binlog format. It might also be usable for something like Galera, or alternatively Galera might want to hook into a higher layer providing default replication stream and slave applier thread (I'm not sure which). Also this API would be used to implement legacy MySQL binlog format for compatibility. This is far from a final design, I plan to work much more on the details. However, I think it is important to start discussing more concretely the overall shape of the API. There are several important overall decisions I made already, which I _did_ consider carefully, but which are still open to be discussed. And any feedback in general would be most welcome. This project is a major addition to the server with the potential to greatly influence the future direction of server development (or not, if we get it wrong). And it's impossible for one person to think of everything on his own. So any feedback welcomed; meanwhile I'll continue expanding this and other parts of the design. - Kristian. ----------------------------------------------------------------------- High-Level Specification Generators and consumbers ------------------------- We have the two concepts: 1. Event _generators_, that produce events describing all changes to data in a server. 2. Event consumers, that receive such events and use them in various ways. Examples of event generators is execution of SQL statements, which generates events like those used for statement-based replication. Another example is PBXT engine-level replication. An example of an event consumer is the writing of the binlog on a master. Event generators are not really plugins. Rather, there are specific points in the server where events are generated. However, a generator can be part of a plugin, for example a PBXT engine-level replication event generator would be part of the PBXT storage engine plugin. Event consumers on the other hand could be a plugin. One generator can be stacked on top of another. This means that a generator on top (for example row-based events) will handle some events itself (eg. non-deterministic update in mixed-mode binlogging). Other events that it does not want to or cannot handle (for example deterministic delete or DDL) will be defered to the generator below (for example statement-based events). Materialisation (or not) ------------------------ A central decision is how to represent events that are generated in the API at the point of generation. I want to avoid making the API require that events are materialised. By "Materialised" I mean that all (or most) of the data for the event is written into memory in a struct/class used inside the server or serialised in a data buffer (byte buffer) in a format suitable for network transport or disk storage. Using a non-materialised event means storing just a reference to appropriate context that allows to retrieve all information for the event using accessors. Ie. typically this would be based on getting the event information from the THD pointer. Some reasons to avoid using materialised events in the API: - Replication events have a _lot_ of detailed context information that can be needed in events: user-defined variables, random seed, character sets, table column names and types, etc. etc. If we make the API based on materialisation, then the initial decision about which context information to include with which events will have to be done in the API, while ideally we want this decision to be done by the individual consumer plugin. There will this be a conflict between what to include (to allow consumers access) and what to exclude (to avoid excessive needless work). - Materialising means defining a very specific format, which will tend to make the API less generic and flexible. - Unless the materialised format is made _very_ specific (and thus very inflexible), it is unlikely to be directly useful for transport (eg. binlog), so it will need to be re-materialised into a different format anyway, wasting work. - If a generator on top handles an event, then we want to avoid wasting work materialising an event in a generator below which would be completely unused. Thus there would be a need for the upper generator to somehow notify the lower generator ahead of event generation time to not fire an event, complicating the API. Some advantages for materialisation: - Using an API based on passing around some well-defined struct event (or byte buffer) will be simpler than the complex class hierarchy proposed here with no requirement for materialisation. - Defining a materialised format would allow an easy way to use the same consumer code on a generator that produces events at the source of execution and on a generator that produces events from eg. reading them from an event log. Note that there can be some middle way, where some data is materialised and some is kept as reference to context (eg. THD) only. This however looses most of the mentioned advantages for materialisation. The design proposed here aims for as little materialisation as possible. Default materialisation format ------------------------------ While the proposed API doesn't _require_ materialisation, we can still think about providing the _option_ for built-in materialisation. This could be useful if such materialisation is made suitable for transport to a different server (eg. no endian-dependance etc). If there is a facility for such materialisation built-in to the API, it becomes possible to write something like a generic binlog plugin or generic network transport plugin. This would be really useful for eg. PBXT engine-level replication, as it could be implemented without having to re-invent a binlog format. I added in the proposed API a simple facility to materialise every event as a string of bytes. To use this, I still need to add a suitable facility to de-materialise the event. However, it is still an open question whether such a facility will be at all useful. It still has some of the problems with materialisation mentioned above. And I think it is likely that a good binlog implementation will need to do more than just blindly copy opaque events from one endpoint to another. For example, it might need different event boundaries (merge and/or split events); it might need to augment or modify events, or inject new events, etc. So I think maybe it is better to add such a generic materialisation facility on top of the basic event generator API. Such a facility would provide materialisation of an replication event stream, not of individual events, so would be more flexible in providing a good implementation. It would be implemented for all generators. It would separate from both the event generator API (so we have flexibility to put a filter class in-between generator and materialisation), and could also be separate from the actual transport handling stuff like fsync() of binlog files and socket connections etc. It would be paired with a corresponding applier API which would handle executing events on a slave. Then we can have a default materialised event format, which is available, but not mandatory. So there can still be other formats alongside (like legacy MySQL 5.1 binlog event format and maybe Tungsten would have its own format). Encapsulation ------------- Another fundamental question about the design is the level of encapsulation used for the API. At the implementation level, a lot of the work is basically to pull out all of the needed information from the THD object/context. The API I propose tries to _not_ expose the THD to consumers. Instead it provides accessor functions for all the bits and pieces relevant to each replication event, while the event class itself likely will be more or less just an encapsulated THD. So an alternative would be to have a generic event that was just (type, THD). Then consumers could just pull out whatever information they want from the THD. The THD implementation is already exposed to storage engines. This would of course greatly reduce the size of the API, eliminating lots of class definitions and accessor functions. Though arguably it wouldn't really simplify the API, as the complexity would just be in understanding the THD class. Note that we do not have to take any performance hit from using encapsulated accessors since compilers can inline them (though if inlining then we do not get any ABI stability with respect to THD implemetation). For now, the API is proposed without exposing the THD class. (Similar encapsulation could be added in actual implementation to also not expose TABLE and similar classes). ----------------------------------------------------------------------- Low-Level Design A consumer is implented as a virtual class (interface). There is one virtual function for every event that can be received. A consumer would derive from the base class and override methods for the events it wants to receive. There is one consumer interface for each generator. When a generator A is stacked on B, the consumer interface for A inherits from the interface for B. This way, when A defers an event to B, the consumer for A will receive the corresponding event from B. There are methods for a consumer to register itself to receive events from each generator. I still need to find a way for a consumer in one plugin to register itself with a generator implemented in another plugin (eg. PBXT engine-level replication). I also need to add a way for consumers to de-register themselves. The current design has consumer callbacks return 0 for success and error code otherwise. I still need to think more about whether this is useful (ie. what is the semantics of returning an error from a consumer callback). Each event passed to consumers is defined as a class with public accessor methods to a private context (which is mostly the THD). My intension is to make all events passed around const, so that the same event can be passed to each of multiple registered consumers (and to emphasise that consumers do not have the ability to modify events). It still needs to be seen whether that const-ness will be feasible in practise without very heavy modification/constification of exiting code. What follows is a partial draft of a possible definition of the API as concrete C++ class definitions. ----------------------------------------------------------------------- /* Virtual base class for generated replication events. This is the parent of events generated from all kinds of generators. Only child classes can be instantiated. This class can be used by code that wants to treat events in a generic way, without any knowledge of event details. I still need to decide whether such generic code is sensible. */ class rpl_event_base { /* Maybe we will want the ability to materialise an event to a standard binary format. This could be achieved with a base method like this. The actual materialisation would be implemented in each deriving class. The public methods would provide different interfaces for specifying the buffer or for writing directly into IO_CACHE or file. */ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */ int materialise(uchar *buffer, size_t buflen) const; /* Returns NULL on error or else malloc()ed buffer with materialised event, caller must free(). */ uchar *materialise() const; /* Same but using passed in memroot. */ uchar *materialise(mem_root *memroot) const; /* Materialise to user-supplied writer function (could write directly to file or the like). */ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const; /* As to for what to do with a materialised event, there are a couple of possibilities. One is to have a de_materialise() method somewhere that can construct an rpl_event_base (really a derived class of course) from a buffer or writer function. This would require each accessor function to conditionally read its data from either THD context or buffer (GCC is able to optimise several such conditionals in multiple accessor function calls into one conditional), or we can make all accessors virtual if the performance hit is acceptable. Another is to have different classes for accessing events read from materialised event data. Also, I still need to think about whether it is at all useful to be able to generically materialise an event at this level. It may be that any binlog/transport will in any case need to undertand more of the format of events, so that such materialisation/transport is better done at a different layer. */ protected: /* Implementation which is the basis for materialise(). */ virtual int do_materialise(int (*writer)(uchar *data, size_t len, void *context)) const = 0; private: /* Virtual base class, private constructor to prevent instantiation. */ rpl_event_base(); }; /* These are the event types output from the transaction event generator. This generator is not stacked on anything. The transaction event generator marks the start and end (commit or rollback) of transactions. It also gives information about whether the transaction was a full transaction or autocommitted statement, whether transactional tables were involved, whether non-transactional tables were involved, and XA information (ToDo). */ /* Base class for transaction events. */ class rpl_event_transaction_base : public rpl_event_base { public: /* Get the local transaction id. This idea is only unique within one server. It is allocated whenever a new transaction is started. Can be used to identify events belonging to the same transaction in a binlog-like stream of events streamed in parallel among multiple transactions. */ uint64_t get_local_trx_id() const { return thd->local_trx_id; }; bool get_is_autocommit() const; private: /* The context is the THD. */ THD *thd; rpl_event_transaction_base(THD *_thd) : thd(_thd) { }; }; /* Transaction start event. */ class rpl_event_transaction_start : public rpl_event_transaction_base { }; /* Transaction commit. */ class rpl_event_transaction_commit : public rpl_event_transaction_base { public: /* The global transaction id is unique cross-server. It can be used to identify the position from which to start a slave replicating from a master. This global ID is only available once the transaction is decided to commit by the TC manager / primary redundancy service. This TC also allocates the ID and decides the exact semantics (can there be gaps, etc); however the format is fixed (cluster_id, running_counter). */ struct global_transaction_id { uint32_t cluster_id; uint64_t counter; }; const global_transaction_id *get_global_transaction_id() const; }; /* Transaction rollback. */ class rpl_event_transaction_rollback : public rpl_event_transaction_base { }; /* Base class for statement events. */ class rpl_event_statement_base : public rpl_event_base { public: LEX_STRING get_current_db() const; }; class rpl_event_statement_start : public rpl_event_statement_base { }; class rpl_event_statement_end : public rpl_event_statement_base { public: int get_errorcode() const; }; class rpl_event_statement_query : public rpl_event_statement_base { public: LEX_STRING get_query_string(); ulong get_sql_mode(); const CHARSET_INFO *get_character_set_client(); const CHARSET_INFO *get_collation_connection(); const CHARSET_INFO *get_collation_server(); const CHARSET_INFO *get_collation_default_db(); /* Access to relevant flags that affect query execution. Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... } */ enum flag_bits { STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks STMT_UNIQUE_KEY_CHECKS, // @@unique_checks STMT_AUTO_IS_NULL, // @@sql_auto_is_null }; uint32_t get_flags(); ulong get_auto_increment_offset(); ulong get_auto_increment_increment(); // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID; // INSERT_ID; random seed; user variables. // // We probably also need get_uses_temporary_table(), get_used_user_vars(), // get_uses_auto_increment() and so on, so a consumer can get more // information about what kind of context information a query will need when // executed on a slave. }; class rpl_event_statement_load_query : public rpl_event_statement_query { }; /* This event is fired with blocks of data for files read (from server-local file or client connection) for LOAD DATA. */ class rpl_event_statement_load_data_block : public rpl_event_statement_base { public: struct block { const uchar *ptr; size_t size; }; block get_block() const; }; /* Base class for row-based replication events. */ class rpl_event_row_base : public rpl_event_base { public: /* Access to relevant handler extra flags and other flags that affect row operations. Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... } */ enum flag_bits { ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks }; uint32_t get_flags(); /* Access to list of tables modified. */ class table_iterator { public: /* Returns table, NULL after last. */ const TABLE *get_next(); private: // ... }; table_iterator get_modified_tables() const; private: /* Context used to provide accessors. */ THD *thd; protected: rpl_event_row_base(THD *_thd) : thd(_thd) { } }; class rpl_event_row_write : public rpl_event_row_base { public: const BITMAP *get_write_set() const; const uchar *get_after_image() const; }; class rpl_event_row_update : public rpl_event_row_base { public: const BITMAP *get_read_set() const; const BITMAP *get_write_set() const; const uchar *get_before_image() const; const uchar *get_after_image() const; }; class rpl_event_row_delete : public rpl_event_row_base { public: const BITMAP *get_read_set() const; const uchar *get_before_image() const; }; /* Event consumer callbacks. An event consumer registers with an event generator to receive event notifications from that generator. The consumer has callbacks (in the form of virtual functions) for the individual event types the consumer is interested in. Only callbacks that are non-NULL will be invoked. If an event applies to multiple callbacks in a single callback struct, it will only be passed to the most specific non-NULL callback (so events never fire more than once per registration). The lifetime of the memory holding the event is only for the duration of the callback invocation, unless otherwise noted. Callbacks return 0 for success or error code (ToDo: does this make sense?). */ struct rpl_event_consumer_transaction { virtual int trx_start(const rpl_event_transaction_start *) { return 0; } virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; } virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; } }; /* Consuming statement-based events. The statement event generator is stacked on top of the transaction event generator, so we can receive those events as well. */ struct rpl_event_consumer_statement : public rpl_event_consumer_transaction { virtual int stmt_start(const rpl_event_statement_start *) { return 0; } virtual int stmt_end(const rpl_event_statement_end *) { return 0; } virtual int stmt_query(const rpl_event_statement_query *) { return 0; } /* Data for a file used in LOAD DATA [LOCAL] INFILE. */ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *) { return 0; } /* These are specific kinds of statements; if specified they override consume_stmt_query() for the corresponding event. */ virtual int stmt_load_query(const rpl_event_statement_load_query *ev) { return stmt_query(ev); } }; /* Consuming row-based events. The row event generator is stacked on top of the statement event generator. */ struct rpl_event_consumer_row : public rpl_event_consumer_statement { virtual int row_write(const rpl_event_row_write *) { return 0; } virtual int row_update(const rpl_event_row_update *) { return 0; } virtual int row_delete(const rpl_event_row_delete *) { return 0; } }; /* Registration functions. ToDo: Make a way to de-register. ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator registration method. */ int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs); int rpl_event_statement_register(const rpl_event_consumer_statement *cbs); int rpl_event_row_register(const rpl_event_consumer_row *cbs);
Hi, Kristian! On Jun 24, Kristian Nielsen wrote:
----------------------------------------------------------------------- High-Level Specification
Generators and consumbers -------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates events like those used for statement-based replication. Another example is PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Some event generators are not really plugins. Rather, there are specific points in the server where events are generated. However, a generator can be part of a plugin, for example a PBXT engine-level replication event generator would be part of the PBXT storage engine plugin. And for example we could write a filter plugin, which would be stacked on top of an existing generator and provide the same event types and interfaces, but filtered in some way (for example by removing certain events on the master side, or by re-writing events in certain ways).
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on top (for example row-based events) will handle some events itself (eg. non-deterministic update in mixed-mode binlogging). Other events that it does not want to or cannot handle (for example deterministic delete or DDL) will be defered to the generator below (for example statement-based events).
There's a problem with this idea. Say, Event B is nested in Event A: ... ... |<- Event A ... .. .. ->| .. .. .. * * * * * |<- Event B ->| * * * * This is fine. But what about ... ... |<- Event A ... ->| .. .. .. * * * * * |<- Event B ->| * * * * In the latter case no event is nested in the other, and no level can simply dever to the other. I don't know a solution for this, I'm just hoping the above situation is impossible. At least, I could not find an example of "overlapping" events.
Default materialisation format ------------------------------ While the proposed API doesn't _require_ materialisation, we can still think about providing the _option_ for built-in materialisation. This could be useful if such materialisation is made suitable for transport to a different server (eg. no endian-dependance etc). If there is a facility for such materialisation built-in to the API, it becomes possible to write something like a generic binlog plugin or generic network transport plugin. This would be really useful for eg. PBXT engine-level replication, as it could be implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a string of bytes. To use this, I still need to add a suitable facility to de-materialise the event.
Couldn't that be done not in the API or generator, but as a filter somewhere up the chain ?
So I think maybe it is better to add such a generic materialisation facility on top of the basic event generator API.
Ah, right.
Encapsulation -------------
Another fundamental question about the design is the level of encapsulation used for the API.
At the implementation level, a lot of the work is basically to pull out all of the needed information from the THD object/context. The API I propose tries to _not_ expose the THD to consumers. Instead it provides accessor functions for all the bits and pieces relevant to
of course
each replication event, while the event class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD). Then consumers could just pull out whatever information they want from the THD. The THD implementation is already exposed to storage engines. This would of course greatly reduce the size of the
no, it's not. THD is not exposed to engines (unless they define MYSQL_SERVER but then it's not our problem), they use accessor functions.
API, eliminating lots of class definitions and accessor functions. Though arguably it wouldn't really simplify the API, as the complexity would just be in understanding the THD class.
For now, the API is proposed without exposing the THD class. (Similar encapsulation could be added in actual implementation to also not expose TABLE and similar classes).
completely agree
----------------------------------------------------------------------- Low-Level Design
A consumer is implented as a virtual class (interface). There is one virtual function for every event that can be received. A consumer would derive from
hm. This part I don't understand. How would that work ? A consumer want to see a uniform stream of events, perhaps for sending them to a slave. Why would you need different consimers and different methods for different events ? I'd just have one method, receive_event(rpl_event_base *)
the base class and override methods for the events it wants to receive.
There are methods for a consumer to register itself to receive events from each generator. I still need to find a way for a consumer in one plugin to register itself with a generator implemented in another plugin (eg. PBXT engine-level replication). I also need to add a way for consumers to de-register themselves.
Let's say that all generators are hard-coded and statically compiled in. You can think about how to dynamically register them (e.g. pbxt) later.
The current design has consumer callbacks return 0 for success and error code otherwise. I still need to think more about whether this is useful (ie. what is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event can be passed to each of multiple registered consumers (and to emphasise that consumers do not have the ability to modify events). It still needs to be seen whether that const-ness will be feasible in practise without very heavy modification/constification of exiting code.
----------------------------------------------------------------------- /* Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way, without any knowledge of event details. I still need to decide whether such generic code is sensible.
sure it is. write event to binlog. send it to a slave. add a checksum, encrypt, compress - all these consumers can treat an event as an opaque stream of bytes.
*/ class rpl_event_base { ... int materialise(int (*writer)(uchar *data, size_t len, void *context)) const; ... Also, I still need to think about whether it is at all useful to be able to generically materialise an event at this level. It may be that any binlog/transport will in any case need to undertand more of the format of events, so that such materialisation/transport is better done at a different layer.
Right, I'm doubful too. Say, to materialize a statement level event you need to know what exactly bits of the context you want to include. When replicating to MariaDB it's one set, when repicating to identically configured MariaDB of the same version it's another set, and when replicating to, say, DB2, it's probably a different (larger) set.
};
/* The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave replicating from a master.
This global ID is only available once the transaction is decided to commit by the TC manager / primary redundancy service. This TC also allocates the ID and decides the exact semantics (can there be gaps, etc); however the format is fixed (cluster_id, running_counter).
uhm. XID format is defined by the XA standard. An XID consists of - format ID (unsigned long) - global transaction ID - up to 64 bytes - branch qualifier - up to 64 bytes as your transaction id is smaller, you will need to consider XID a part of the "context" - in cases where XID was generated externally. Same about binlog position - which is a "transaction id" in the MySQL replication. It doesn't fit into your scheme, so it will have to be a part of the context. And unless the redundancy service will be allowed to ignore your transaction ids, MySQL native replication will not fit into the API. Regards, Sergei
Sergei Golubchik
Hi, Kristian!
Hi, thanks for your comments! A couple of questions inline, and some comments/thoughts.
On Jun 24, Kristian Nielsen wrote:
At the implementation level, a lot of the work is basically to pull out all of the needed information from the THD object/context. The API I propose tries to _not_ expose the THD to consumers. Instead it provides accessor functions for all the bits and pieces relevant to
of course
each replication event, while the event class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD). Then consumers could just pull out whatever information they want from the THD. The THD implementation is already exposed to storage engines. This would of course greatly reduce the size of the
no, it's not. THD is not exposed to engines (unless they define MYSQL_SERVER but then it's not our problem), they use accessor functions.
Ah, I see. Ok good, so it makes sense to use accessor functions in the replication APIs also, with no trace of THD.
API, eliminating lots of class definitions and accessor functions. Though arguably it wouldn't really simplify the API, as the complexity would just be in understanding the THD class.
For now, the API is proposed without exposing the THD class. (Similar encapsulation could be added in actual implementation to also not expose TABLE and similar classes).
completely agree
Ok, so some follow up questions: 1. Do I understand correctly that you agree that the API should also encapsulate TABLE and similar classes? These _are_ exposed to storage engines as far as I can see. 2. If TABLE and so on should be encapsulated, there will be the issue of having iterators to run over columns, etc. Do we already have standard classes for this that could be used? Or should I do this modelled using the iterators of the Stardard C++ library, for example? (I would like to make the new API fit in as well as possible with the existing MySQL/MariaDB code, which you know much better).
A consumer is implented as a virtual class (interface). There is one virtual function for every event that can be received. A consumer would derive from
hm. This part I don't understand. How would that work ? A consumer want to see a uniform stream of events, perhaps for sending them to a slave. Why would you need different consimers and different methods for different events ?
I'd just have one method, receive_event(rpl_event_base *)
Ok, so do I understand you correctly that class rpl_event_base would have a
type field, and the consumer could then down-cast to the appropriate specific
event class based on the type?
receive_event(const rpl_event_base *generic_event)
{
switch (generic_event->type)
{
case rpl_event_base::RPL_EVENT_STATEMENT_QUERY:
const rpl_event_statement_query *ev=
static_cast
/* The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave replicating from a master.
This global ID is only available once the transaction is decided to commit by the TC manager / primary redundancy service. This TC also allocates the ID and decides the exact semantics (can there be gaps, etc); however the format is fixed (cluster_id, running_counter).
uhm. XID format is defined by the XA standard. An XID consists of - format ID (unsigned long) - global transaction ID - up to 64 bytes - branch qualifier - up to 64 bytes
as your transaction id is smaller, you will need to consider XID a part of the "context" - in cases where XID was generated externally.
Same about binlog position - which is a "transaction id" in the MySQL replication. It doesn't fit into your scheme, so it will have to be a part of the context. And unless the redundancy service will be allowed to ignore your transaction ids, MySQL native replication will not fit into the API.
Yes, good points. Ok, so my idea with the global transaction ID is following the previous discussion, that there can be a primary redundancy plugin, and this gets to control the commit order and create the global transaction IDs. And the global transaction ID is used to allow slaves to easily synchronise to any master. As long as a slave commits the last global transaction ID applied, it can connect to any master and know where to start replicating (or determine if the slave is actually ahead of the would-be master). Etc. (I do not know if XID can be used for this purpose, but even if not your point is still valid). So maybe it is wrong to fix a particular global transaction ID format at this level of API. One option is to have only the local transaction ID at this level of API. Then the primary redundancy plugin / TC manager should expose an API that allows consumers (and others) to look up the global transaction ID from the local transaction ID (I believe it will need to maintain such mapping anyway). Another option is to expose a global transaction ID of generic format at this layer (we could even use the XA standard XID format).
class rpl_event_base { ... int materialise(int (*writer)(uchar *data, size_t len, void *context)) const; ... Also, I still need to think about whether it is at all useful to be able to generically materialise an event at this level. It may be that any binlog/transport will in any case need to undertand more of the format of events, so that such materialisation/transport is better done at a different layer.
Right, I'm doubful too. Say, to materialize a statement level event you need to know what exactly bits of the context you want to include. When replicating to MariaDB it's one set, when repicating to identically configured MariaDB of the same version it's another set, and when replicating to, say, DB2, it's probably a different (larger) set.
Yes, exactly. So that's the main reason I'd like to have a non-materialised API, and them possibly build materialsation on top. What I have been thinking is to have a default (but not mandatory) event format. I am thinking to maybe use Google protocol buffers (they seem fairly good for this purpose, and they are quite popular, eg. Monty is planning to use them for dynamic columns). With such a format, it would be possible to write generic plugins for a binlog implementation, direct transport to slave, checksum/encrypt/compress etc. etc. Which I agree would be nice (and such plugins don't really want to have to handle complete materialisation of any possible event themselves from scratch). Incidentally, I think the existing binlog format is really hopeless to use with such generic plugins, it seems intricately tied to a particular binlog format (like including master binlog file names and file offsets inside of events).
One generator can be stacked on top of another. This means that a generator on top (for example row-based events) will handle some events itself (eg. non-deterministic update in mixed-mode binlogging). Other events that it does not want to or cannot handle (for example deterministic delete or DDL) will be defered to the generator below (for example statement-based events).
There's a problem with this idea. Say, Event B is nested in Event A:
... ... |<- Event A ... .. .. ->| .. .. .. * * * * * |<- Event B ->| * * * *
This is fine. But what about
... ... |<- Event A ... ->| .. .. .. * * * * * |<- Event B ->| * * * *
In the latter case no event is nested in the other, and no level can simply dever to the other.
I don't know a solution for this, I'm just hoping the above situation is impossible. At least, I could not find an example of "overlapping" events.
Another way of thinking about this is that we have one layer above handling (or not handling) an event that can be generated below. So if a statement is handled using row-based replication events, the row-based replication event generator on top will choose to discard the corrosponding event from the statement-based generator below. If it is not handled, it the row-based will pass through the event from statement-based. (This is one reason I wanted event generation to be very cheap (no materialisation); I prefer this way of generating below and discarding above to having the layer above set and clear flags (or whatever) for the layer below about whether to generate events or not.) So one case where this becomes a problem is if we have a multi-table update where one table is PBXT and another is not, and we are using PBXT engine-level replication on top of statement-based replication. In this case, one half of the statement-based event is handled by the layer above, but the other is not. So we cannot deal with this situation. (We could of course think of ways to handle this. For example, modify the statement event to include a flag to only touch the non-PBXT tables when applied on the slave. This would correspond to slicing up the events to make them be nested properly in one-another in the nested-event description. Probably it is better just to not support such a scenario, trowing an error.)
I added in the proposed API a simple facility to materialise every event as a string of bytes. To use this, I still need to add a suitable facility to de-materialise the event.
Couldn't that be done not in the API or generator, but as a filter somewhere up the chain ?
Yes. It's interesting that it could be a filter/generator higher in the stack, I had not thought about that. - Kristian.
Hi, Kristian!
API, eliminating lots of class definitions and accessor functions. Though arguably it wouldn't really simplify the API, as the complexity would just be in understanding the THD class.
For now, the API is proposed without exposing the THD class. (Similar encapsulation could be added in actual implementation to also not expose TABLE and similar classes).
completely agree
Ok, so some follow up questions:
1. Do I understand correctly that you agree that the API should also encapsulate TABLE and similar classes? These _are_ exposed to storage engines as far as I can see.
I think it's ok to use TABLE and Field as storage engines are using them. It would be good to encapsulate them, of course, but I'd say there's no need to try to do it at all costs.
2. If TABLE and so on should be encapsulated, there will be the issue of having iterators to run over columns, etc. Do we already have standard classes for this that could be used? Or should I do this modelled using the iterators of the Stardard C++ library, for example?
We have List and an iterator over it. Alternatively, you can return an array and let the caller iterate it any way it wants.
(I would like to make the new API fit in as well as possible with the existing MySQL/MariaDB code, which you know much better).
A consumer is implented as a virtual class (interface). There is one virtual function for every event that can be received. A consumer would derive from
hm. This part I don't understand. How would that work ? A consumer want to see a uniform stream of events, perhaps for sending them to a slave. Why would you need different consimers and different methods for different events ?
I'd just have one method, receive_event(rpl_event_base *)
Ok, so do I understand you correctly that class rpl_event_base would have a type field, and the consumer could then down-cast to the appropriate specific event class based on the type?
receive_event(const rpl_event_base *generic_event) { switch (generic_event->type) { case rpl_event_base::RPL_EVENT_STATEMENT_QUERY: const rpl_event_statement_query *ev= static_cast
(generic_event); do_stuff(ev->get_query_string(), ...); break; case rpl_event_base::RPL_EVENT_ROW_UPDATE: const rpl_event_row_update *ev= static_cast (generic_event); do_stuff(ev->get_after_image(), ...); break; ... } } I have always disliked having such type field and upcasting. So I tried to make an API where it was not needed. Like this:
class my_event_consumer { int stmt_query(const rpl_event_statement_query *ev) { do_stuff(ev->get_query_string(), ...); } int row_update(const rpl_event_row_update *ev) { do_stuff(ev->get_after_image(), ...); } ... };
Okay, now I see what you mean. I don't like downcasting either. On the other hand, I don't want to force plugins that work on an event as a whole to implement methods for every particular type of an event. It may be possible to do both. Like - virtual methods for every event type, as you proposed, but not abstract - the default implementation calls receive_event() - a generic one. And a plugin can either implement a family of receive_event* methods or a generic. But if the above wouldn't work and we'll have to choose, I'd prefer a simpler interface with one generic receive_event().
One generator can be stacked on top of another. This means that a generator on top (for example row-based events) will handle some events itself (eg. non-deterministic update in mixed-mode binlogging). Other events that it does not want to or cannot handle (for example deterministic delete or DDL) will be defered to the generator below (for example statement-based events).
There's a problem with this idea. Say, Event B is nested in Event A:
... ... |<- Event A ... .. .. ->| .. .. .. * * * * * |<- Event B ->| * * * *
This is fine. But what about
... ... |<- Event A ... ->| .. .. .. * * * * * |<- Event B ->| * * * *
In the latter case no event is nested in the other, and no level can simply dever to the other.
I don't know a solution for this, I'm just hoping the above situation is impossible. At least, I could not find an example of "overlapping" events.
Another way of thinking about this is that we have one layer above handling (or not handling) an event that can be generated below. ... So one case where this becomes a problem is if we have a multi-table update where one table is PBXT and another is not, and we are using PBXT engine-level replication on top of statement-based replication. In this case, one half of the statement-based event is handled by the layer above, but the other is not. So we cannot deal with this situation.
On the opposite, this is quite easy. Even a CREATE ... SELECT is a mix of statement-based and row-based. A simple solution would be to replicate it completely statement-based - that is, to discard the row-based part of the event. We can do that, because statement level description of the event is sufficient - the row based evetn is completely nested within a statement-based one (other, more complex solutions are possible too). I was describing a case when events overlap, but neither one is completely nested within the other. This case I know no solution for, but I hope it is never possible in practice. Regards, Sergei
On Thu, 13 May 2010 16:36:41 +0200, Sergei Golubchik
* not everything can be replicated at every level, for example table creation cannot be replicated row-based, InnoDB changes cannot be replicated with "MyISAM pwrite()" events
We're doing it as kinda row based ("structure based" ?) with using the table protobuf message as a standard interchange format for it. This enables some really nice things - like replicating to non-SQL systems. Or even just sanely replicating to a SQL system that doesn't share your exact syntax and defaults for DDL (i.e. anybody that isn't you, at the same version). -- Stewart Smith
Hi, Stewart! On May 14, Stewart Smith wrote:
On Thu, 13 May 2010 16:36:41 +0200, Sergei Golubchik
wrote: * not everything can be replicated at every level, for example table creation cannot be replicated row-based, InnoDB changes cannot be replicated with "MyISAM pwrite()" events
We're doing it as kinda row based ("structure based" ?) with using the table protobuf message as a standard interchange format for it.
Right, I know. What I described does not specify that it should be row based or should use protobuf - but it can do that too, probably could replicate to and from drizzle using native drizzle replication protocol, row based with protobufs. With an appropriate plugin, of course.
This enables some really nice things - like replicating to non-SQL systems. Or even just sanely replicating to a SQL system that doesn't share your exact syntax and defaults for DDL (i.e. anybody that isn't you, at the same version).
Yup. Regards, Sergei
On Fri, Jan 22, 2010 at 6:21 AM, Kristian Nielsen
Let the discussion begin!
The global transaction ID project done by Justin at Google is worth reviewing. In addition to supporting automated slave failover it also has options to make slave state crash-proof and add binlog event checksums. I doubt the patch should be reused, as the MySQL replication interface must be improved if we are to innovate -- but the wiki has a lot of details. * http://code.google.com/p/google-mysql-tools/wiki/GlobalTransactionIds * http://code.launchpad.net/~jtolmer/mysql-server/global-trx-ids -- Mark Callaghan mdcallag@gmail.com
On Sun, 24 Jan 2010 14:27:05 -0800, MARK CALLAGHAN
On Fri, Jan 22, 2010 at 6:21 AM, Kristian Nielsen
wrote: Let the discussion begin!
The global transaction ID project done by Justin at Google is worth reviewing. In addition to supporting automated slave failover it also has options to make slave state crash-proof and add binlog event checksums. I doubt the patch should be reused, as the MySQL replication interface must be improved if we are to innovate -- but the wiki has a lot of details. * http://code.google.com/p/google-mysql-tools/wiki/GlobalTransactionIds * http://code.launchpad.net/~jtolmer/mysql-server/global-trx-ids
Hi, The global transaction ID is a cornerstone concept of a any replication system which aspires to be pluggable, extensible and go beyond basic master-slave. It is hardly possible to even start designing the rest of the API without first setting on global transaction ID. This is one of the reasons why http://forge.mysql.com/wiki/ReplicationFeatures/ReplicationInterface can be dissed without much consideration. What's good about http://code.google.com/p/google-mysql-tools/wiki/GlobalTransactionIds: 1) It introduces the concept of atomic database changesets. (Which, ironically, it calls "groups" due to dreaded binlog heritage) 2) It correctly identifies that (the part of) ID should be a monotonic ordinal number. 3) It correctly identifies that the global transaction ID is generated by redundancy service - in that case "MYSQL_LOG::write(Log_event *) (sql/log.cc, line 1708)" What's bad about it: 1) It fails to explicitly recognize that IDs should be a continuous sequence. In the implementation they are, but it is never stated explicitly, the only explicit requirement is monotonicity. Perhaps this is a minor omission. 2) It fails to address multi-master: (server_id, group_id) is not going to work - such pairs cannot be linearly ordered and, therefore, compared. And from the perspective of the node that needs to apply the changeset - does server_id really matter? It may be good for debugging, but it can't be a part of a global transaction ID. 3) No general theory is put behind it, it is just an attempt to fix concrete binlog implementation. In fact it is just one huge implementation detail. Inability to address mutl-master case is a direct consequence of this. In the end it is not very useful. Whatever good points are there are trivial. They and even more can be achieved by 15 minutes of abstract thinking. You don't need to know MySQL binlog format for that. In fact, you should forget about it unless you want to end up with something like (server_id, group_id). I'll take this opportunity to put forth some theory behind the global transaction IDs as we see it at Codership. 1. We have an abstract set of data subject to replication/logging. It can be a whole database, a schema, a table, a row. Lets call it a Replication Set (RS). 2. RS is undergoing changes in time which can be represented as a series of atomic changes. Let's call it RS History. That it is a _series_ is trivial but important - otherwise we can't reproduce historical RS state evolution. Each RS change is represented by a changeset. Since it is a series, RS changesets can be enumerated with a sequence of natural numbers without gaps within a given RS History. Here comes the first component of a global transaction ID: sequence number (seqno). 3. However there can be more than one RS. Moreover, the same RS can end up in different clusters and undergo different changes. So, to achieve truly global unambiguity each changeset, in addition to seqno, should be marked with a RS History ID. Obviously seqnos from different histories are logically incomparable. Therefore RS History ID can be any globally unique identifier, with no need for < or > operations. This is the second component of global transaction ID. One possible implementation for that can be (UUID, long long) pair. How redundancy service will generate those IDs is an implementation detail. For binlog/master-slave replication it is obviously trivial, even in its current state. Changing binlog format and mapping seqnos to file offsets is no big feat. What is not so obvious here is that since global transaction ID is generated by logging/replication service, it is that service that defines the order of commits, not vice versa. As a result transaction should first be passed to that service and only then committed. For one-way master-slave replication the order of operations is not so important. However for multi-master it is crucial. Note that the actual replication/logging can still happen asynchronously, but replication service must generate transaction ID before it is committed. Thanks, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Alex Yurchenko
The global transaction ID is a cornerstone concept of a any replication system which aspires to be pluggable, extensible and go beyond basic master-slave. It is hardly possible to even start designing the rest of the API without first setting on global transaction ID. This is one of the
Some very interesting thoughts, thanks! (and sorry for the delay in continuing the discussion). Some questions and comments below.
What's good about http://code.google.com/p/google-mysql-tools/wiki/GlobalTransactionIds:
1) It introduces the concept of atomic database changesets. (Which, ironically, it calls "groups" due to dreaded binlog heritage)
Right. So this is what I think of as "transaction", though it need not be (eg. NDB replication uses "epoch", which is a collection of multiple transactions replicated as a whole).
2) It correctly identifies that (the part of) ID should be a monotonic ordinal number.
Ok. But should it increase in order of transaction start? Or in order of transaction commit? Or maybe your point is that it should be monotonic, but that the actual order of changesets is an implementation detail of the particular replication system? I'd really like to hear your thoughts on this.
3) It correctly identifies that the global transaction ID is generated by redundancy service - in that case "MYSQL_LOG::write(Log_event *) (sql/log.cc, line 1708)"
Right. So the _concept_ of global transaction ID needs to be part of the framework. But the actual _implementation_ of that global transaction ID is an implementation detail of the redundancy service. Correct?
What's bad about it:
2) It fails to address multi-master: (server_id, group_id) is not going to work - such pairs cannot be linearly ordered and, therefore, compared. And from the perspective of the node that needs to apply the changeset - does server_id really matter? It may be good for debugging, but it can't be a part of a global transaction ID.
I don't understand why it would not work for multi-master. Can you elaborate? Especially given that you later suggest yourself:
One possible implementation for that can be (UUID, long long) pair.
How is this different from (server_id, group_id)? (I'd like to understand). My thoughts on multi-master is that there are multiple scenarios, depending on how tightly coupled the nodes of the clusters are with respect to transactions: 1. Tightly-coupled cluster, that maintain a cluster-wide serialised transaction concept. NDB replication is like this (for epochs, which are based on NDB global checkpoint IDs AFAIR). In this case, the entire cluster effectively is a single server wrt. replication, so it's not really multi-master for this discussion. 2. One can imagine a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions. Transaction T1 only needs to be ordered after T2 if it can be proven (in terms of communication within the system) that T1 committed before T2. If there is no evidence one way or the other to the actual order of T1 and T2 (eg. they ran in parallel against disjoint data), there is no a priori reason they should have to be linearly ordered. If on the other hand there is such evidence (eg. T2 waited for a lock released by T1 commit), then there would have to be such linear ordering. I think NDB individual transactions are a bit like this. So this would make transaction IDs only partially ordered (but any arbitrary extension to a total order would be valid). 3. Uncoupled multiple masters, like multiple nodes in MySQL asynchronous replication. In this case there really is no total linear order, as the transactions are not coordinated between servers. At most there would be an approximate order like a time stamp, but that would be not 100% reliable due to time skew and such. So I don't understand what issue you had in mind for multi-master replication here?
I'll take this opportunity to put forth some theory behind the global transaction IDs as we see it at Codership.
1. We have an abstract set of data subject to replication/logging. It can be a whole database, a schema, a table, a row. Lets call it a Replication Set (RS).
2. RS is undergoing changes in time which can be represented as a series of atomic changes. Let's call it RS History. That it is a _series_ is trivial but important - otherwise we can't reproduce historical RS state evolution. Each RS change is represented by a changeset. Since it is a series, RS changesets can be enumerated with a sequence of natural numbers without gaps within a given RS History. Here comes the first component of a global transaction ID: sequence number (seqno).
Right, the series _can_ be linearly ordered. But does it have to be? Suppose we enforce an artificial linear order among some changesets that have no a priori data dependencies between them (either order of applying them would be correct). It seems to me that this needlessly discards an opportunity for parallelism. Given that lack of parallelism (on the slave end) is arguably the biggest performance bottleneck of current MySQL replication, this is something to avoid if possible. Isn't it enough that changesets are partially ordered, as long as that partial order encodes all dependencies that need to be respected? In other words, is it necessary to enforce such total order at the level of the API, instead of just introducing it at the point where it becomes needed, and if so, why?
3. However there can be more than one RS. Moreover, the same RS can end up in different clusters and undergo different changes. So, to achieve truly global unambiguity each changeset, in addition to seqno, should be marked with a RS History ID. Obviously seqnos from different histories are logically incomparable. Therefore RS History ID can be any globally unique identifier, with no need for < or > operations. This is the second component of global transaction ID.
Can you give an example of what an "RS History" would be? It was not 100% clear to me.
What is not so obvious here is that since global transaction ID is generated by logging/replication service, it is that service that defines the order of commits, not vice versa. As a result transaction should first be passed to that service and only then committed. For one-way master-slave replication the order of operations is not so important. However for multi-master it is crucial. Note that the actual replication/logging can still happen asynchronously, but replication service must generate transaction ID before it is committed.
Right. But is this really possible? After all, the commit may fail, so we don't know before if we will actually commit or rollback... we probably don't even know the actual order of commits in parallel threads of execution until after actual commit, unless we take a global lock around (global transaction id, commit); and then it sounds like we get into the issue with broken group commit again? So we need two-phase commit here between the replication service and the storage engines, right? And in this case, it shouldn't matter which comes first, commit or replication service, does it? We need to be very careful with the interface here, two-phase commit can be really expensive needing multiple disk flushes for every transaction to ensure reliable crash recovery. And we really do not want to re-do the mistake of breaking group commit that we had for years after two-phase commit was introduced for the MySQL binlog. (I am also thinking that generating the global transaction ID at commit time is too late in the general case. The current MySQL replication buffers everything until commit time, but I would like to see us support also logging events during transactions. And this will require an ID generated at transaction start. But maybe this is a different ID than "global transaction ID", which will still be needed to know commit order of course.) - Kristian.
On Mon, 15 Mar 2010 12:29:14 +0100, Kristian Nielsen
2) It correctly identifies that (the part of) ID should be a monotonic ordinal number.
Ok. But should it increase in order of transaction start? Or in order of transaction commit?
I think this is easy: until transaction is committed, it does not exist. Even with MyISAM a certain time passes from query parsing to actually modifying table data so that the changes become visible to others. So this moment, when changes become visible to others defines the order of database modifications. Transaction start order is hardly of any interest to anybody except for the very curious.
Or maybe your point is that it should be monotonic, but that the actual order of changesets is an implementation detail of the particular replication system? I'd really like to hear your thoughts on this.
3) It correctly identifies that the global transaction ID is generated by redundancy service - in that case "MYSQL_LOG::write(Log_event *) (sql/log.cc, line 1708)"
Right. So the _concept_ of global transaction ID needs to be part of the framework. But the actual _implementation_ of that global transaction ID is an implementation detail of the redundancy service. Correct?
Point 3) actually answers the questions above, but, perhaps, not in an obvious manner. There are two things: - global transaction ID format and comparison operations are defined by the API. I elaborated on it in the previous e-mail. But the values are filled by the redundancy service. That's one of its "services". - commit order should follow the global transaction ID order. So the ORDER of commit operations is effectively decided by redundancy service. And here maybe lies the biggest controversy of all.
What's bad about it:
2) It fails to address multi-master: (server_id, group_id) is not going to work - such pairs cannot be linearly ordered and, therefore, compared. And from the perspective of the node that needs to apply the changeset - does server_id really matter? It may be good for debugging, but it can't be a part of a global transaction ID.
I don't understand why it would not work for multi-master. Can you elaborate?
Suppose you have node1, node2 and node3. Node1 and node2 are acting as masters. So node1 sends (node1, 11), (node1, 12). Node2 sends (node2, 2), (node2, 3). Partial order of commits is well established, but node3 still won't be able to tell in what general order it should apply those transactions. Does (node1, 11) go before (node2, 3)? why?
Especially given that you later suggest yourself:
One possible implementation for that can be (UUID, long long) pair.
How is this different from (server_id, group_id)? (I'd like to understand).
My thoughts on multi-master is that there are multiple scenarios, depending on how tightly coupled the nodes of the clusters are with respect to transactions:
1. Tightly-coupled cluster, that maintain a cluster-wide serialised transaction concept. NDB replication is like this (for epochs, which are based on NDB global checkpoint IDs AFAIR). In this case, the entire cluster effectively is a single server wrt. replication, so it's not really multi-master for this discussion.
2. One can imagine a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions. Transaction T1 only needs to be ordered after T2 if it can be proven (in terms of communication within the system) that T1 committed before T2. If
is no evidence one way or the other to the actual order of T1 and T2 (eg.
It is different in that UUID it that proposal is a replication sequence UUID. It is one per cluster. All events generated by cluster nodes have the same UUID. And it serves to distinguish replication events from different clusters, explicitly making them incomparable (the same way that (server_id, group_id) pairs with different server_id's are incomparable). So that you can compare only transaction IDs with the same UUID. there they
ran in parallel against disjoint data), there is no a priori reason they should have to be linearly ordered. If on the other hand there is such evidence (eg. T2 waited for a lock released by T1 commit), then there would have to be such linear ordering. I think NDB individual transactions are a bit like this. So this would make transaction IDs only partially ordered (but any arbitrary extension to a total order would be valid).
I think "a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions" is an internally contradictory concept. Suppose node1 committed T1, but not T2 yet, and node2 committed T2, but not T1 yet. Now do database dumps from those nodes. Which one will represent a consistent transactional view? I guess you're trying to make a cause for parallel transaction applying on slaves here. You can do parallel out of order applying alright. It is just when you have committed T2 but not T1 yet, the node does not have a T2 state. And it does not have a T0 state. It has an undefined state. Only after T1 is committed it suddenly becomes T2. There can be several ways to deal with it. For example, you do apply T2, but you don't commit it. And after you have applied T1, you commit both of them in one go. The point is that if you want _replication_, you gotta have a linear order. Without it you can't compare db states and therefore can't say that this db is a replica of that db. What you end up with is a so called "eventual consistency", and "eventual" may never happen.
3. Uncoupled multiple masters, like multiple nodes in MySQL asynchronous replication. In this case there really is no total linear order, as the transactions are not coordinated between servers. At most there would be an approximate order like a time stamp, but that would be not 100% reliable due to time skew and such.
That's what the concept of Replication Set was about below. In the scenario described above you effectively have multiple replication sets and multiple logical clusters. So each one of them will have its own global transaction ID sequence. If they are so much uncoupled there is no point in considering them a single cluster.
So I don't understand what issue you had in mind for multi-master replication here?
I'll take this opportunity to put forth some theory behind the global transaction IDs as we see it at Codership.
1. We have an abstract set of data subject to replication/logging. It can be a whole database, a schema, a table, a row. Lets call it a Replication Set (RS).
2. RS is undergoing changes in time which can be represented as a series of atomic changes. Let's call it RS History. That it is a _series_ is trivial but important - otherwise we can't reproduce historical RS state evolution. Each RS change is represented by a changeset. Since it is a series, RS changesets can be enumerated with a sequence of natural numbers without gaps within a given RS History. Here comes the first component of a global transaction ID: sequence number (seqno).
Right, the series _can_ be linearly ordered. But does it have to be?
Suppose we enforce an artificial linear order among some changesets that have no a priori data dependencies between them (either order of applying
An ability to unambiguously identify a database state by a global transaction ID. them
would be correct). It seems to me that this needlessly discards an opportunity for parallelism. Given that lack of parallelism (on the slave end) is arguably the biggest performance bottleneck of current MySQL replication, this is something to avoid if possible.
I guess I have addressed this above already. In short, parallel applying does not need to mean out-of-order commit.
Isn't it enough that changesets are partially ordered, as long as that partial order encodes all dependencies that need to be respected?
In other words, is it necessary to enforce such total order at the level of the API, instead of just introducing it at the point where it becomes needed, and if so, why?
3. However there can be more than one RS. Moreover, the same RS can end up in different clusters and undergo different changes. So, to achieve
There are at least 2 problems with not having linearly ordered commits: 1) Database states are incomparable. It is impossible (ok, impractically hard) to tell if two databases have identical contents and determine which transactions are missing. How will you join a node to a cluster in this case? 2) Causality preservation. Suppose you commit something on master and now want to select results from slave. How will you tell that slave already has committed these results? I don't exactly understand how you separate "level of the API" and "the point where is becomes needed", do you mean code separation or making linear ordering an optional feature of the redundancy service? In any case, I think we all understand quite well what IS database replication and what are redundancy guarantees when commits are linearly ordered. Id like to hear a more detailed concept of redundancy in the absence of linearly ordered commits. What is redundancy in that case, what are the guarantees? How recovery will be performed and so on. truly
global unambiguity each changeset, in addition to seqno, should be marked with a RS History ID. Obviously seqnos from different histories are logically incomparable. Therefore RS History ID can be any globally unique identifier, with no need for < or > operations. This is the second component of global transaction ID.
Can you give an example of what an "RS History" would be? It was not 100% clear to me.
It is the sequence of changes that happens to RS. Like UPDATE t1 WHERE...; INSERT INTO t2 VALUES...; etc. Perhaps you could hint at what is not clear about it?
What is not so obvious here is that since global transaction ID is generated by logging/replication service, it is that service that defines the order of commits, not vice versa. As a result transaction should first be passed to that service and only then committed. For one-way master-slave replication the order of operations is not so important. However for multi-master it is crucial. Note that the actual replication/logging can still happen asynchronously, but replication service must generate transaction ID before it is committed.
Right. But is this really possible? After all, the commit may fail, so we don't know before if we will actually commit or rollback... we probably don't even know the actual order of commits in parallel threads of execution until after actual commit, unless we take a global lock around (global transaction id, commit); and then it sounds like we get into the issue with broken group commit again?
Why, it is possible alright. Galera replication works this way. There always is a point after which commit just can't fail. When it's been decided that it is committed.
So we need two-phase commit here between the replication service and the storage engines, right? And in this case, it shouldn't matter which comes first, commit or replication service, does it?
We need to be very careful with the interface here, two-phase commit can be really expensive needing multiple disk flushes for every transaction to ensure reliable crash recovery. And we really do not want to re-do the mistake of breaking group commit that we had for years after two-phase commit was introduced for the MySQL binlog.
I don't think that you need 2PC between redundancy service and the storage engines, because redundancy service never fails. Well, when it fails, you have something more important to worry about than disk flushes anyways. As for the order, the problem is that in general you never know in what order redundancy service will log/replicate transaction until you ask it. It is crucial for multi-master replication as you never know what will be the total order of transactions until you replicate and receive them. It is important for (semi)synchronous replication. It also makes sense in regular asynchronous master-slave as you can do replication concurrently with committing. So asking redundancy service first is good all around.
(I am also thinking that generating the global transaction ID at commit time is too late in the general case. The current MySQL replication buffers everything until commit time, but I would like to see us support also logging events during transactions. And this will require an ID generated at transaction start. But maybe this is a different ID than "global transaction ID", which will still be needed to know commit order of course.)
This is exactly the problem we're facing now with support for LOAD DATA (and other potentially huge transactions). There clearly is the need for two IDs in that case. But as you have noticed, the semantics of these two IDs is rather different. Global transaction ID we've been talking about so far needs to be linearly ordered (according to my understanding at least) and does not need to have a server_id, while the ID generated at transaction start does not have to be linearly ordered and SHOULD have a server_id as part of it (to ensure uniqueness, manageability and simplify debugging). Perhaps it is a good time to settle on terminology. Besides 'global transaction ID' is just prohibitively long for intensive discussions and variable bames. Does anyone have better suggestions on how to call these IDs?
- Kristian.
Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Alex Yurchenko
On Mon, 15 Mar 2010 12:29:14 +0100, Kristian Nielsen
wrote:
One possible implementation for that can be (UUID, long long) pair.
How is this different from (server_id, group_id)? (I'd like to understand).
It is different in that UUID it that proposal is a replication sequence UUID. It is one per cluster. All events generated by cluster nodes have the same UUID. And it serves to distinguish replication events from different clusters, explicitly making them incomparable (the same way that (server_id, group_id) pairs with different server_id's are incomparable). So that you can compare only transaction IDs with the same UUID.
Ok, thanks for the clarification, I think I see now. "server_id" does not work, as there can be multiple different server_ids in a single cluster. We need every server in the same cluster (NDB, Galera, ...) to use the same UUID in the global transaction ID.
I guess you're trying to make a cause for parallel transaction applying on slaves here. You can do parallel out of order applying alright. It is just when you have committed T2 but not T1 yet, the node does not have a T2 state. And it does not have a T0 state. It has an undefined state. Only after T1 is committed it suddenly becomes T2. There can be several ways to deal with it. For example, you do apply T2, but you don't commit it. And after you have applied T1, you commit both of them in one go.
Agree. Even with parallel transaction application, one would want to commit in some fixed (from the origin) order. (Following above example, even if there is no evidence on master as to whether T1 or T2 came first, we will want the order to be consistent among multiple slaves, so that there is no possibility to see T1-without-T2 on one slave and T2-without-T1 on another).
There are at least 2 problems with not having linearly ordered commits:
1) Database states are incomparable. It is impossible (ok, impractically hard) to tell if two databases have identical contents and determine which transactions are missing. How will you join a node to a cluster in this case?
Yes. The issue of joining a cluster is also a good point. After reading your comments, I tend to agree that global transaction IDs should be totally ordered.
I don't exactly understand how you separate "level of the API" and "the point where is becomes needed", do you mean code separation or making linear ordering an optional feature of the redundancy service?
(I meant that the comparison privided by the API could be a partial order, and if it returned T1 and T2 as "equivalent", any plugin that needs a total order could just pick T1 or T2 arbitrarily as the first. But I agree now that even if the choice is arbitrary, it should be enforced in the API for consistency.)
3. However there can be more than one RS. Moreover, the same RS can end up in different clusters and undergo different changes. So, to achieve truly global unambiguity each changeset, in addition to seqno, should be marked with a RS History ID. Obviously seqnos from different histories are logically incomparable. Therefore RS History ID can be any globally unique identifier, with no need for < or > operations. This is the second component of global transaction ID.
Can you give an example of what an "RS History" would be? It was not 100% clear to me.
It is the sequence of changes that happens to RS. Like UPDATE t1 WHERE...; INSERT INTO t2 VALUES...; etc. Perhaps you could hint at what is not clear about it?
I think about an RS as for example a set of tables in a set of schemas. And of an RS History as for example a binlog. So what is not clear to me is how the IDs get assigned to an RS, and to an RS History. Is it assigned by the DBA in server configuration? Is it assigned automatically by the server? Or assigned somehow by the redundancy service? Also, it is not clear to me when the RS ID and RS History ID will differ. So if I have a simple MySQL-style master->slave replication, will I have one or two replication sets? If I have one master and two slaves, which of the three servers will have the same RS ID, and which will have a different one? Likewise, if I have a two-level MySQL-style replication master->slave1->slave2, how many RS Histories will there be, and how many of those will have different/same RS History IDs?
What is not so obvious here is that since global transaction ID is generated by logging/replication service, it is that service that defines the order of commits, not vice versa. As a result transaction should first be passed to that service and only then committed. For one-way master-slave replication the order of operations is not so important. However for multi-master it is crucial. Note that the actual replication/logging can still happen asynchronously, but replication service must generate transaction ID before it is committed.
I don't think that you need 2PC between redundancy service and the storage engines, because redundancy service never fails. Well, when it fails, you have something more important to worry about than disk flushes anyways.
Hm, I don't understand... What if the redundancy service is invoked, it logs (or whatever) the changeset. Then the machine crashes before the engine(s) have time to commit. When the server comes back up, how do we avoid that the redundancy service, (and hence the slaves) will have the change, but the master will not? The ability to implement reliable crash-recovery is something I see requested a lot, and I think it is really important. In current MySQL replication, there is 2-phase commit between storage engine and binlog. I do not see how this can be avoided in the general case (without loosing the ability to recover after crash).
As for the order, the problem is that in general you never know in what order redundancy service will log/replicate transaction until you ask it. It is crucial for multi-master replication as you never know what will be the total order of transactions until you replicate and receive them. It is important for (semi)synchronous replication. It also makes sense in regular asynchronous master-slave as you can do replication concurrently with committing. So asking redundancy service first is good all around.
Of course, this means that there can only be one redundancy service plugin at a time, doesn't it? But true, it is essential that storage engine(s) and redundancy service(s) commit in the same order. It is however not yet clear to me that asking redundancy service first is sufficient for this, nor that it is necessary. So let me write up the steps to commit a transaction, and maybe the answer will be clearer to me: It seems to me that to ensure same order in participating engine(s) and redundancy service(s), we will need a server-global commit lock, and a call into each participant while that lock is taken: lock(commit_order_lock); redundancy_service->txn_fix_order(thd); engine->txn_fix_order(thd); unlock(commit_order_lock); The idea is that in this call, each participant will do the minimum amount of work to ensure that the transaction `thd' will be the next transaction committed, *if* it is committed. Eg. insert the transaction data into the transaction log/binlog buffer (but not necessary write it to disk yet). Next, the transaction needs to be prepared in all participants, ie. two-phase commit for crash recovery. I think this can be done without a global lock: redundancy_service->txn_prepare(thd); engine->txn_prepare(thd); In this step, each participant must ensure that the transaction is persistent (eg. fflush()). Also note that both this and the previous step may fail in either engine, because of some internal error, or because of a crash. Thus each participant must be prepared up to here to roll back the transaction, either immediately or during crash recovery after bringing the server back up. However, once this step is successfully completed for all participants, the transaction can no longer be rolled back. The group commit facility comes in here. Since this step is outside of the global lock commit_order_lock, there could be multiple transactions waiting in txn_prepare(). And each participant is free to use a single fflush() to persist all of them at once. This is crucial for good performance. (For maximum flexibility, we might even want this: redundancy_service->txn_prepare_start(thd); engine->txn_prepare_start(thd); redundancy_service->txn_prepare_wait(thd); engine->txn_prepare_wait(thd); This would allow both participants to run their fflush() or whatever in parallel, which seems to be desirable.) (Note that participants would be free to implement all actions in txn_fix_order() and have txn_prepare_{start,wait} be empty, at the cost of loosing group commit functionality. Likewise, txn_prepare_wait() could be empty, just the API leaves the option of having the split open.) Finally, if all participants prepared their transactions without problems, they will be really committed: redundancy_service->txn_commit(thd); engine->txn_commit(thd); (I am not sure if we need to split into _start and _wait here. I seem to recall that this step also needs to persist to disk (fflush()), but just now I do not see why it would need it). In the case of crash recovery, the server will ask each participant for the list of transactions that had successfully completed the txn_prepare_wait() step. For each transaction, if it completed in all participants it will be txn_commit()'ed, otherwise it will be rolled back. So with this, is there a need to invoke the redundancy service first? If we do, it will be possible to pass the global transaction ID on to all the subsequent steps. So yes, that seems desirable. Any other reasons? Another point: It seems to me that in the general case, it will limit us to require that no gaps in the global transaction ID can be introduced. I do not see how to make group commit possible in this case (following above steps)? So, any comments/objections to the above sketch of the commit part of an API?
(I am also thinking that generating the global transaction ID at commit time is too late in the general case. The current MySQL replication buffers everything until commit time, but I would like to see us support also logging events during transactions. And this will require an ID generated at transaction start. But maybe this is a different ID than "global transaction ID", which will still be needed to know commit order of course.)
This is exactly the problem we're facing now with support for LOAD DATA (and other potentially huge transactions). There clearly is the need for two IDs in that case. But as you have noticed, the semantics of these two IDs is rather different. Global transaction ID we've been talking about so far needs to be linearly ordered (according to my understanding at least) and does not need to have a server_id, while the ID generated at transaction start does not have to be linearly ordered and SHOULD have a server_id as part of it (to ensure uniqueness, manageability and simplify debugging).
Yes, I agree, these are different. Basically, at transaction start one will probably make a local transaction ID, and tag all events with this ID to know they are from the same transaction (changeset). But for the final commit event, one needs to assign the global transaction ID. The local IDs just need to be unique, the global ID needs to be totally ordered. I think I was confusing these two a bit previously.
I think "a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions" is an internally contradictory concept. Suppose node1 committed T1, but not T2 yet, and node2 committed T2, but not T1 yet. Now do database dumps from those nodes. Which one will represent a consistent transactional view?
I am thinking about shared-nothing storage engines like NDB, where tables are
partitioned, with partitions being located on different machines.
If T1 and T2 both involve node1 and node2, they will need to be linearly
ordered.
However, if T1 involves only node1, and T2 involves only node2, and there is
no communication between node1 and node2 while T1 and T2 runs, then there is
no way to distinguish whether T1 or T2 committed first.
So we can use (engine local) transaction IDs (T1, node1) and (T2, node2). And
if T1==T2, then that would mean that no outside entity has evidence as to
whether T1 or T2 was committed first, so we do not actually need a defined
ordering in this case. On the other hand, if such evidence could exist, then
we will need T1
On Wed, 17 Mar 2010 10:48:50 +0100, Kristian Nielsen
Can you give an example of what an "RS History" would be? It was not 100% clear to me.
It is the sequence of changes that happens to RS. Like UPDATE t1 WHERE...; INSERT INTO t2 VALUES...; etc. Perhaps you could hint at what is not clear about it?
I think about an RS as for example a set of tables in a set of schemas. And of an RS History as for example a binlog.
So what is not clear to me is how the IDs get assigned to an RS, and to an RS History. Is it assigned by the DBA in server configuration? Is it assigned automatically by the server? Or assigned somehow by the redundancy service?
Also, it is not clear to me when the RS ID and RS History ID will differ.
So if I have a simple MySQL-style master->slave replication, will I have one or two replication sets? If I have one master and two slaves, which of
Ok, that's a good question. The way I see it is 1) RS ID. Obviously RS is an object and as such needs some way to refer to it. Who decides what goes into RS and that we even need one? - Redundancy service user. So he creates it and he names it. This ID has to be unique just to the extent that there'd be no confusion between two RS's should they meet in the same context. Say, it can be a human-readable name. In trivial cases, when RS is just a whole database, like in our wsrep API, it could be a cluster name. Galera nodes refuse to connect to nodes with a different cluster name. In general case, when there could be several RS's on one physical box (e.g. one for replication and one for binlogging), we'll have introduce a concept of a logical cluster - i.e. all processes which participate in replication of a given RS. 2) RS History ID. Suppose you have a cluster that replicates "RS1". And currently it is at T12345. A node joins the cluster and says: "I have 'RS1' at T12344". What we can have here: - the node was previously a member of this cluster, went down for maintenance, meanwhile cluster committed one more trx and all that is left is to send T12345 to this node. - the node never was a member of this cluster, 'RS1' name stands there by operator mistake. This node needs a total state snapshot. - the node was previously a member of this cluster, lost connectivity to it at T12000, but went on accepting local queries with hopes for subsequent conflict resolution (i.e. effectively spawned a new cluster of its own). There needs to be conflict resolution phase (hopefully this redundancy plugin supports it) - something else happened So in all cases but the first one we need to be able to tell that T12344 is not really comparable to T12345, because the state of that node does not really correspond to the state of the cluster at T12344 because they underwent different sequences of changes. So here comes RS History ID. It has to be globally unique and is generated by the redundancy service (because it is the one who knows what's going on), just like global transaction sequence number. E.g. in Galera it is a time-based UUID. So RS ID is just an RS name. RS History ID is a part of a global transaction ID. There does not have to be 1-to-1 relation between the two. the
three servers will have the same RS ID, and which will have a different one?
All of the will have the same RS ID as they participate in replication if the same RS. Moreover they'll have the same RS History ID as they are obviously have the save history of RS changes which is defined singlehandedly by master.
Likewise, if I have a two-level MySQL-style replication master->slave1->slave2, how many RS Histories will there be, and how many of those will have different/same RS History IDs?
Same as above. In this example - one.
I don't think that you need 2PC between redundancy service and the storage engines, because redundancy service never fails. Well, when it fails, you have something more important to worry about than disk flushes anyways.
Hm, I don't understand...
What if the redundancy service is invoked, it logs (or whatever) the changeset. Then the machine crashes before the engine(s) have time to commit. When the server comes back up, how do we avoid that the redundancy service, (and hence the slaves) will have the change, but the master will not?
The ability to implement reliable crash-recovery is something I see requested a lot, and I think it is really important.
Well, this is what redundancy service is for: when you crash, you can recover, isn't it? My understanding is that you do want the redundancy service to have the change and that you can take it from there and commit uncommitted. This is how Galera works and I believe this is how semi-sync in MySQL works (there's hardly any other way for synchronous replication to work) In a sense you can see the redundancy service as a glorified transaction journal.
In current MySQL replication, there is 2-phase commit between storage engine and binlog. I do not see how this can be avoided in the general case (without loosing the ability to recover after crash).
As for the order, the problem is that in general you never know in what order redundancy service will log/replicate transaction until you ask it. It is crucial for multi-master replication as you never know what will be the total order of transactions until you replicate and receive them. It is important for (semi)synchronous replication. It also makes sense in regular asynchronous master-slave as you can do replication concurrently with committing. So asking redundancy service first is good all around.
Of course, this means that there can only be one redundancy service
I might be not qualified enough to comment on this particular solution as I don't know that code, perhaps other developers can. AFAIK, binlog in MySQL is treated as some kind of a storage engine and perhaps this is why 2PC is used, but it's a big mistake. Binlog is not a storage engine (yes, they both write to disk, but that's where similarities end). It needs different interface - exactly what we're discussing right now. My thinking is that while the engines need to have a guarantee that binlog write was successful, binlog (or any other redundancy service) does not need the same guarantee from the engines. So instead of full blown 2PC you need a simple if()/else clause: if (redundancy_service->pre_commit() == SUCCESS) { /* commit */ } else { /* rollback */ } More on that below plugin
at a time, doesn't it?
It sounds like that in the general case. But realistically, there'd be few plugins which have to decide on the global transaction ID, others can be told. E.g. unlike multi-master replication, binlog plugin can easily log events in the order it is told to. So you can effectively stack redundancy plugins putting the one which decides on global transaction ID on top. The others just follow the defined order. So you can have the following stack: multi-master synchronous plugin (for HA) ---- master-slave asynchronous plugin (for fanning out) ---- binlog plugin (for local recovery after crash)
But true, it is essential that storage engine(s) and redundancy service(s) commit in the same order. It is however not yet clear to me that asking redundancy service first is sufficient for this, nor that it is necessary.
So let me write up the steps to commit a transaction, and maybe the answer will be clearer to me:
It seems to me that to ensure same order in participating engine(s) and redundancy service(s), we will need a server-global commit lock, and a call into each participant while that lock is taken:
lock(commit_order_lock); redundancy_service->txn_fix_order(thd); engine->txn_fix_order(thd); unlock(commit_order_lock);
The idea is that in this call, each participant will do the minimum amount of work to ensure that the transaction `thd' will be the next transaction committed, *if* it is committed. Eg. insert the transaction data into
transaction log/binlog buffer (but not necessary write it to disk yet).
Next, the transaction needs to be prepared in all participants, ie. two-phase commit for crash recovery. I think this can be done without a global lock:
redundancy_service->txn_prepare(thd); engine->txn_prepare(thd);
In this step, each participant must ensure that the transaction is persistent (eg. fflush()). Also note that both this and the previous step may fail in either engine, because of some internal error, or because of a crash. Thus each participant must be prepared up to here to roll back the
either immediately or during crash recovery after bringing the server back up.
However, once this step is successfully completed for all participants,
transaction can no longer be rolled back.
The group commit facility comes in here. Since this step is outside of
global lock commit_order_lock, there could be multiple transactions waiting in txn_prepare(). And each participant is free to use a single fflush() to persist all of them at once. This is crucial for good performance.
(For maximum flexibility, we might even want this:
redundancy_service->txn_prepare_start(thd); engine->txn_prepare_start(thd);
redundancy_service->txn_prepare_wait(thd); engine->txn_prepare_wait(thd);
This would allow both participants to run their fflush() or whatever in parallel, which seems to be desirable.)
(Note that participants would be free to implement all actions in txn_fix_order() and have txn_prepare_{start,wait} be empty, at the cost of loosing group commit functionality. Likewise, txn_prepare_wait() could be empty, just the API leaves the option of having the split open.)
Finally, if all participants prepared their transactions without
they will be really committed:
redundancy_service->txn_commit(thd); engine->txn_commit(thd);
(I am not sure if we need to split into _start and _wait here. I seem to recall that this step also needs to persist to disk (fflush()), but just now I do not see why it would need it).
In the case of crash recovery, the server will ask each participant for
As far as I understand, any (semi)synchronous replication requires that. the transaction, the the problems, the
list of transactions that had successfully completed the txn_prepare_wait() step. For each transaction, if it completed in all participants it will be txn_commit()'ed, otherwise it will be rolled back.
So with this, is there a need to invoke the redundancy service first? If we do, it will be possible to pass the global transaction ID on to all the subsequent steps. So yes, that seems desirable. Any other reasons?
Another point: It seems to me that in the general case, it will limit us to require that no gaps in the global transaction ID can be introduced. I do not see how to make group commit possible in this case (following above steps)?
So, any comments/objections to the above sketch of the commit part of an API?
Hm. I'm not well versed in MySQL engine interfaces, so I might be missing something here. But why can't we have something like this instead: engine->txn_prepare(thd); if (redundancy_service->pre_commit(thd) == SUCCESS) { engine->txn_commit(thd); redundancy_service->post_commit(thd); } else { engine->txn_rollback(thd); } pre_commit() call locks commit_order_lock in the right order. post_commit() releases commit_order_lock and the resources that might be allocated by transaction in the redundancy service. Or, if txn_commit() is to heavy an operation, then maybe like that?: engine->txn_prepare(thd); if (redundancy_service->pre_commit(thd) == SUCCESS) { engine->txn_fix_order(thd); redundancy_service->post_commit(thd); engine->txn_commit(thd); } else { engine->txn_rollback(thd); }
I think "a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions" is an internally contradictory concept. Suppose node1 committed T1, but not
yet, and node2 committed T2, but not T1 yet. Now do database dumps from those nodes. Which one will represent a consistent transactional view?
I am thinking about shared-nothing storage engines like NDB, where
are partitioned, with partitions being located on different machines.
If T1 and T2 both involve node1 and node2, they will need to be linearly ordered.
However, if T1 involves only node1, and T2 involves only node2, and
is no communication between node1 and node2 while T1 and T2 runs, then
T2 tables there there
is no way to distinguish whether T1 or T2 committed first.
So we can use (engine local) transaction IDs (T1, node1) and (T2, node2). And if T1==T2, then that would mean that no outside entity has evidence as to whether T1 or T2 was committed first, so we do not actually need a defined ordering in this case. On the other hand, if such evidence could exist, then we will need T1
So in this case we avoid having to do extra communication across all nodes just to syncronise the "next transaction ID" counter if it is not needed.
Oh, yes, this is a corner case which is not easily described by the suggested model. But I guess it is highly unconventional and does not really qualify as redundancy or replication in the defined sense. My understanding is that one model never fits all. If we need something like NDB, it is better to develop a separate model and API for that, rather than try to fit everything in one.
... But I think you are right that in practise one would impose a total order anyway, to get a consistent view. So if T1==T2, one would arbitrarily define that T1 came before T2, since node1
- Kristian.
Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Alex, I think this discussion is getting really interesting, and I understand
your points much better now, thanks for your many comments!
Alex Yurchenko
On Wed, 17 Mar 2010 10:48:50 +0100, Kristian Nielsen
wrote:
So what is not clear to me is how the IDs get assigned to an RS, and to an RS History. Is it assigned by the DBA in server configuration? Is it assigned automatically by the server? Or assigned somehow by the redundancy service?
Ok, that's a good question. The way I see it is
<snip>
underwent different sequences of changes. So here comes RS History ID. It has to be globally unique and is generated by the redundancy service (because it is the one who knows what's going on), just like global transaction sequence number. E.g. in Galera it is a time-based UUID.
Ok, so I think this means that the redundancy service will generate a new RS History ID when it starts a new RS History. Exactly when and how this happens will then depend on the particular plugin.
What if the redundancy service is invoked, it logs (or whatever) the changeset. Then the machine crashes before the engine(s) have time to commit. When the server comes back up, how do we avoid that the redundancy service, (and hence the slaves) will have the change, but the master will not?
The ability to implement reliable crash-recovery is something I see requested a lot, and I think it is really important.
Well, this is what redundancy service is for: when you crash, you can recover, isn't it? My understanding is that you do want the redundancy service to have the change and that you can take it from there and commit uncommitted. This is how Galera works and I believe this is how semi-sync in MySQL works (there's hardly any other way for synchronous replication to work)
Aha, now I see what you mean! So the redundancy service would provide the durability, using its transaction log or whatever. This is quite different from current MySQL, where every engine, as well as the binlog code, is individually responsible for durability, and then consequently the 2-phase commit is needed to ensure that everyone has the same idea of what should be durable. So this is actually quite an appealing idea! With this system, each storage engine (and secondary replication plugins like binlog) would no longer need to ensure that transactions are durable, just that they are able to recover after a crash into a consistent state. This could translate concretely into (for example) setting innodb_flush_log_at_txn_commit=0 without risking to loose transactions. So no fflush() per commit in the engine (and I think therefore no need for group commit). So only the (primary) redundancy service would need to fflush() (or whatever) to achieve durability. So this is quite interesting! The whole issue of coordinating the state between engines, master binlog thread, and slave threads, in the case of crash recovery, is quite a mess in current MySQL, and this indeed sounds much cleaner. So let me list some potential issues with this that spring to my mind, and see how they could be resolved: 1. On crash recovery, the engine must be able to tell where in the RS History they have recovered themselves to, so that they can correctly be supplied with the remaining transactions that they must apply. The best way to do this is probably to extend the storage engine API (and the engines) with a facility for this, so that the engines can log it in their transaction log. However, it seems simple enough to emulate it for engines that do not implement such extensions. It should be sufficient to create a hidden table inside the engine with a single "transaction_id" column, and just insert a new row with the ID just before committing in the engine. Now, engine commit must be expected to fail from time to time. So it seems hard to supply the engine with the global transaction ID (as that will not be generated until engine commit_prepare has succeeded). But we don't actually need the _global_, linearly ordered ID for this. It should be sufficient with the local transaction ID generated at the start of the transaction, as long as the redundancy service records the mapping between this and the global transaction IDs. And of course if we choose to rollback later then there will be no matching global transaction ID (which is fine). So this sounds doable. 2. Multi-engine transactions. So, if a transaction spans multiple engines, during crash recovery each engine may recover to a different point in time. In this case there will be transactions that need to be replayed against engine E1 but not against E2. For row-based events, this seems doable. It should be possible to filter the events to include only those for tables in the relevant engine. Now for statement-based events, this is a problem. Eg. a multi-table updates against multiple engines will be hard to apply to just one engine ... Anyway, I think this is not a showstopper. It seems ok to say that statement-based replication on multi-engine transactions will not guarantee reliable crash recovery. Especially since InnoDB currently is the only transactional engine that supports statement-based replication ... non-transactional engines are in any case impossible to reliably crash-recover of course. So this is definitely an interesting idea. Quite different from how current MySQL replication/binlogging works.
My thinking is that while the engines need to have a guarantee that binlog write was successful, binlog (or any other redundancy service) does not need the same guarantee from the engines. So instead of full blown 2PC you need a simple if()/else clause:
if (redundancy_service->pre_commit() == SUCCESS) { /* commit */ } else { /* rollback */ }
Yes. Until now, I did not consider the possibility that crashed engines could be recovered from the local log in the redundancy service.
Of course, this means that there can only be one redundancy service plugin at a time, doesn't it?
It sounds like that in the general case. But realistically, there'd be few plugins which have to decide on the global transaction ID, others can be told. E.g. unlike multi-master replication, binlog plugin can easily log events in the order it is told to. So you can effectively stack redundancy plugins putting the one which decides on global transaction ID on top. The others just follow the defined order. So you can have the following stack:
multi-master synchronous plugin (for HA) ---- master-slave asynchronous plugin (for fanning out) ---- binlog plugin (for local recovery after crash)
Yes, agree.
But true, it is essential that storage engine(s) and redundancy service(s) commit in the same order. It is however not yet clear to me that asking redundancy service first is sufficient for this, nor that it is necessary.
As far as I understand, any (semi)synchronous replication requires that.
Right ... I have to study (semi)synchronous replication techniques a bit more, I'm not too familiar with them.
Hm. I'm not well versed in MySQL engine interfaces, so I might be missing something here. But why can't we have something like this instead:
engine->txn_prepare(thd);
if (redundancy_service->pre_commit(thd) == SUCCESS) { engine->txn_commit(thd); redundancy_service->post_commit(thd); } else { engine->txn_rollback(thd); }
pre_commit() call locks commit_order_lock in the right order. post_commit() releases commit_order_lock and the resources that might be allocated by transaction in the redundancy service. Or, if txn_commit() is to heavy an operation, then maybe like that?:
engine->txn_prepare(thd);
if (redundancy_service->pre_commit(thd) == SUCCESS) { engine->txn_fix_order(thd); redundancy_service->post_commit(thd); engine->txn_commit(thd); } else { engine->txn_rollback(thd); }
There are (at least) two issues of interest: 1. Can engines commit in different order from redundancy service (and if they can, does it matter) 2. Can we support group commit (or obtain equivalent performance in some other way)? To give durability, the redundancy service will typically need to fflush() when it commits and generates the global transaction ID. So having group commit means that while such fflush() for transaction T1 is running, other transactions T2, T3, ... Tn are free to start (though not complete of course) committing in the redundancy service. So that they can queue up, and once T1 fflush() is done, a single fflush() will be sufficient to commit all of T2, ... Tn. So this means that T2 must be able to complete engine->txn_prepare() while T1 is still running inside redundancy_service->pre_commit(). There cannot be a global lock around all of the above code. So this in turn means that T3 will be able to complete engine->txn_prepare() _before_ T2, even though T3 will commit _after_ T2 (at least in terms of redundancy service). So the question is if T3 can complete engine->txn_prepare() before T2, yet complete txn_commit() after T2. I do not see why not, but I do not know if this is actually the case, I will need to check. By the way, is it even necessary to have redundancy_service->post_commit()? It seems to me that it probably is not needed? (Though something like pre_commit() and post_commit() will probably be needed to support the SQL-level 2-phase commit/XA. This is the SQL syntax the user can use to do two-phase commit between multiple systems used in the application). - Kristian.
On Thu, 18 Mar 2010 15:18:40 +0100, Kristian Nielsen
Alex, I think this discussion is getting really interesting, and I understand your points much better now, thanks for your many comments!
Glad to hear that. It'd be very sad otherwise ;)
Ok, so I think this means that the redundancy service will generate a
new
RS History ID when it starts a new RS History. Exactly when and how this happens will then depend on the particular plugin.
That's precisely the way to put it: "starts a new RS History". When Galera node starts with undefined initial state, it generates a new history UUID. With that, when it will never join a cluster without taking state snapshot first. Likewise, if it is the first node of the cluster, no other node will join it without state snapshot. This works remarkably well in preventing states inconsistency due to operator errors.
Well, this is what redundancy service is for: when you crash, you can recover, isn't it? My understanding is that you do want the redundancy service to have the change and that you can take it from there and commit uncommitted. This is how Galera works and I believe this is how semi-sync in MySQL works (there's hardly any other way for synchronous replication to work)
Aha, now I see what you mean!
So the redundancy service would provide the durability, using its transaction log or whatever. This is quite different from current MySQL, where every engine, as well as the binlog code, is individually responsible for durability, and then consequently the 2-phase commit is needed to ensure that everyone has the same idea of what should be durable.
So this is actually quite an appealing idea! With this system, each storage engine (and secondary replication plugins like binlog) would no longer need to ensure that transactions are durable, just that they are able to recover after a crash into a consistent state.
Yes, the idea of this model is that the main purpose of redundancy is durability, which, depending on a plugin, can be of much higher degree than flush to disk (e.g. binlog to a remote machine with the point-in-time recovery ability). There is a subtle moment here however: asynchronous redundancy plugin won't give you that durability. So, again there are several way to go about that: some applications may be willing to trade some durability (few last transactions) for speed. Or you can still have an option to ensure durability on engine level. Maybe something else.
This could translate concretely into (for example) setting innodb_flush_log_at_txn_commit=0 without risking to loose transactions. So no fflush() per commit in the engine (and I think therefore no need for group commit).
So only the (primary) redundancy service would need to fflush() (or whatever) to achieve durability.
So this is quite interesting! The whole issue of coordinating the state between engines, master binlog thread, and slave threads, in the case of crash recovery, is quite a mess in current MySQL, and this indeed sounds much cleaner.
So let me list some potential issues with this that spring to my mind, and see how they could be resolved:
1. On crash recovery, the engine must be able to tell where in the RS History they have recovered themselves to, so that they can correctly be supplied with the remaining transactions that they must apply.
The best way to do this is probably to extend the storage engine API (and the engines) with a facility for this, so that the engines can log it in
Exactly. That's our default setting for Galera. their
transaction log. However, it seems simple enough to emulate it for engines that do not implement such extensions. It should be sufficient to create a hidden table inside the engine with a single "transaction_id" column, and just insert a new row with the ID just before committing in the engine.
Now, engine commit must be expected to fail from time to time. So it seems hard to supply the engine with the global transaction ID (as that will not be generated until engine commit_prepare has succeeded).
But we don't actually need the _global_, linearly ordered ID for this. It should be sufficient with the local transaction ID generated at the start of the transaction, as long as the redundancy service records the mapping between this and the global transaction IDs. And of course if we choose to rollback later then there will be no matching global transaction ID (which is fine).
But true, it is essential that storage engine(s) and redundancy service(s) commit in the same order. It is however not yet clear to me that asking redundancy service first is sufficient for this, nor that it is necessary.
As far as I understand, any (semi)synchronous replication requires
This is an interesting option indeed. However 1) the mapping itself should be durable, so every plugin must design the way to recover it in the case of crash. 2) global transaction ID better be integral part of the server state, otherwise it will complicate state snapshot transfer. E.g. it will be insufficient to just use mysqldump or copy db files, you'll need to carry global transaction ID along. The above issues are of course resolvable one way or another. It is just not obvious what is easier: fix the engine or implement a workaround (if you consider it in all entirety) This does not end here. If we want flexibility to define Redundancy Sets (like limiting the scope of replication to certain tables or have several RS's per server) we'll need to be able to maintain several global transaction IDs and map them to RS somehow, as well as store the description of Redundancy Sets themselves. Probably in a table like (rs_name, table_list, plugin_stack, history_uuid, global_trx_ID). The simplest application: graceful server restart. (Notice that since each RS has a separate plugin stack, there are no complications on plugin side) The above consideration removes the global transaction ID per engine, but brings in a separate table that should implicitly participate in every transaction. I see too many variants here which I can't really assess. I'd like to hear what DB guys would say. that.
Right ... I have to study (semi)synchronous replication techniques a bit more, I'm not too familiar with them.
Ok, I might be too rush here. (Semi)synchronous guarantee is that you don't send OK to client until the change is replicated to (at least some) other nodes. So technically you can do replication after local commit. Still there are 3 pros to call redundancy service before commit as I mentioned before: 1) it is required for consistent ordering in multi-master case 2) it gives you durability with synchronous plugins 3) if you go asynchronous replication, why not start doing it really asynchronously, that is ASAP? Note, that if you call redundancy service last, you still have to hold commit_order_lock for the duration of the call, so you don't win anything this way.
Hm. I'm not well versed in MySQL engine interfaces, so I might be missing something here. But why can't we have something like this instead:
engine->txn_prepare(thd);
if (redundancy_service->pre_commit(thd) == SUCCESS) { engine->txn_commit(thd); redundancy_service->post_commit(thd); } else { engine->txn_rollback(thd); }
pre_commit() call locks commit_order_lock in the right order. post_commit() releases commit_order_lock and the resources that might be allocated by transaction in the redundancy service. Or, if txn_commit() is to heavy an operation, then maybe like that?:
engine->txn_prepare(thd);
if (redundancy_service->pre_commit(thd) == SUCCESS) { engine->txn_fix_order(thd); redundancy_service->post_commit(thd); engine->txn_commit(thd); } else { engine->txn_rollback(thd); }
There are (at least) two issues of interest:
1. Can engines commit in different order from redundancy service (and if they can, does it matter)
2. Can we support group commit (or obtain equivalent performance in some other way)?
To give durability, the redundancy service will typically need to fflush() when it commits and generates the global transaction ID. So having group commit means that while such fflush() for transaction T1 is running, other transactions T2, T3, ... Tn are free to start (though not complete of course) committing in the redundancy service. So that they can queue up, and once T1 fflush() is done, a single fflush() will be sufficient to commit all of T2, ... Tn.
So this means that T2 must be able to complete engine->txn_prepare() while T1 is still running inside redundancy_service->pre_commit(). There cannot be a global lock around all of the above code.
So this in turn means that T3 will be able to complete engine->txn_prepare() _before_ T2, even though T3 will commit _after_ T2 (at least in terms of redundancy service).
So the question is if T3 can complete engine->txn_prepare() before T2, yet complete txn_commit() after T2. I do not see why not, but I do not know if this is actually the case, I will need to check.
Hm, how is it different from how it is done currently in MariaDB? Does txn_commit() have to follow the same order as txn_prepare()? If not, then the commit ordering imposed by redundancy service should not be a problem.
By the way, is it even necessary to have redundancy_service->post_commit()? It seems to me that it probably is not needed?
(Though something like pre_commit() and post_commit() will probably be needed to support the SQL-level 2-phase commit/XA. This is the SQL syntax the user can use to do two-phase commit between multiple systems used in the application).
During local transaction execution some resources will be inevitably allocated for transaction by redundancy service (row modifications, etc.), there might be some buffers shared between the engine and redundancy service (opaque to redundancy service indeed), and finally, in the above code redundancy_service->pre_commit() locks commit_order_lock. redundancy_service->post_commit() purpose is to release those resources and the lock. Moreover you will need redundancy_service->post_rollback() to release resources in case of transaction rollback - didn't show it in pseudocode for simplicity. Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Alex Yurchenko
Yes, the idea of this model is that the main purpose of redundancy is durability, which, depending on a plugin, can be of much higher degree than flush to disk (e.g. binlog to a remote machine with the point-in-time recovery ability).
Yes.
There is a subtle moment here however: asynchronous redundancy plugin won't give you that durability. So, again there are
Well, an asynchronous replication with a local binlog like current MySQL replication could provide such durability, by replaying from local binlog.
several way to go about that: some applications may be willing to trade some durability (few last transactions) for speed. Or you can still have an option to ensure durability on engine level. Maybe something else.
Agree. There will be flexibility and various configuration options.
This is an interesting option indeed. However 1) the mapping itself should be durable, so every plugin must design the way to recover it in the case of crash. 2) global transaction ID better be integral part of the server state, otherwise it will complicate state snapshot transfer. E.g. it will be insufficient to just use mysqldump or copy db files, you'll need to carry global transaction ID along.
The above issues are of course resolvable one way or another. It is just not obvious what is easier: fix the engine or implement a workaround (if you consider it in all entirety)
Well, we want to fix all engines we can, and leave an option for backwards compatibility with engines we cannot. The problem is not so much if it is easy to fix the engine, the question is if it is possible? We have to pre_commit the transaction in the engine before we know the global transaction ID. And once the transaction is pre-committed, we cannot add anything to it, so cannot add the global transaction ID. Or is there a way around this?
Ok, I might be too rush here. (Semi)synchronous guarantee is that you don't send OK to client until the change is replicated to (at least some) other nodes. So technically you can do replication after local commit. Still there are 3 pros to call redundancy service before commit as I mentioned before: 1) it is required for consistent ordering in multi-master case
Right (I think with multi-master you mean synchronous replication like Galera. If so, this has probably been a source of slight confusion for me, as I think of multi-master replication as MySQL-style asynchronous replication with dual masters).
2) it gives you durability with synchronous plugins
Yes.
3) if you go asynchronous replication, why not start doing it really asynchronously, that is ASAP?
Yes, my thought as well ;-)
Note, that if you call redundancy service last, you still have to hold commit_order_lock for the duration of the call, so you don't win anything this way.
Yes. I think it makes sense for a primary redundancy service that provides durability and defines the global transaction ID, to be invoked first.
By the way, is it even necessary to have redundancy_service->post_commit()? It seems to me that it probably is not needed?
During local transaction execution some resources will be inevitably allocated for transaction by redundancy service (row modifications, etc.), there might be some buffers shared between the engine and redundancy service (opaque to redundancy service indeed), and finally, in the above code redundancy_service->pre_commit() locks commit_order_lock. redundancy_service->post_commit() purpose is to release those resources and the lock.
Yes. Makes sense. - Kristian.
Alex Yurchenko
writes: This is an interesting option indeed. However 1) the mapping itself should be durable, so every plugin must design
way to recover it in the case of crash. 2) global transaction ID better be integral part of the server state, otherwise it will complicate state snapshot transfer. E.g. it will be insufficient to just use mysqldump or copy db files, you'll need to carry global transaction ID along.
The above issues are of course resolvable one way or another. It is just not obvious what is easier: fix the engine or implement a workaround (if you consider it in all entirety)
Well, we want to fix all engines we can, and leave an option for backwards compatibility with engines we cannot.
The problem is not so much if it is easy to fix the engine, the question is if it is possible? We have to pre_commit the transaction in the engine before we know the global transaction ID. And once the transaction is
On Fri, 19 Mar 2010 13:16:30 +0100, Kristian Nielsen
we cannot add anything to it, so cannot add the global transaction ID. Or is there a way around this?
I think that you were absolutely right here from the start: redundancy service has to maintain the mapping local_trx_ID->global_trx_ID. I just forgot that we will expose local transaction ID to the redundancy service anyways. Notice however possible many-to-1 relation between redundancy plugins and RS and therefore - global transaction ID. So I'd suggest that a unit other than redundancy plugin would maintain this mapping. Alex
On Mon, Mar 22, 2010 at 2:47 AM, Alex Yurchenko
Notice however possible many-to-1 relation between redundancy plugins and RS and therefore - global transaction ID. So I'd suggest that a unit other than redundancy plugin would maintain this mapping.
Alternatively, a redundancy plugin could do it, and in the case of many plugins the api will designate which one is responsible for it (and this could be configurable). But it is arguably a waste of effort to require all redundancy services to implement the same functionality, and could complicate debugging (each will have different bugs). So in practice you're probably right. henrik -- email: henrik.ingo@avoinelama.fi tel: +358-40-5697354 www: www.avoinelama.fi/~hingo book: www.openlife.cc
Alex Yurchenko
On Thu, 18 Mar 2010 15:18:40 +0100, Kristian Nielsen
wrote:
Hm, how is it different from how it is done currently in MariaDB? Does txn_commit() have to follow the same order as txn_prepare()? If not, then the commit ordering imposed by redundancy service should not be a problem.
Ok, I checked, and indeed there is no requirement that prepare is done in same order as commit. In fact, there seems to be no requirement on ordering of commit among different engines and binlog at all in the server itself! (Since the XA/2pc in MySQL assumes every engine ensures durability by itself, there is not requirement for any ordering. In case of a crash, each engine will be able to recover every transaction successfully prepared, so it is just a matter of deciding which of them to commit and which to rollback.) So agree, there is no problem with the redundancy service imposing some order, with the purpose of enabling recovery even without durability guarantee by each individual engine. ---- Now, InnoDB _does_ have a requirement to commit in the same order as the binlog (due to InnoBackup; if not same commit order, the snapshot made by the backup may not correspond to any position in the binlog, which breaks restore). The way this is implemented in InnoDB is by taking a global mutex in InnoDB prepare(), which is only release in InnoDB commit(). This is a really bad way to do things :-(. It means that only one (InnoDB) transaction can be running the code between prepare() and commit(). Since this is where the binlog is written (and the point where the redundancy service makes the transaction durable in our discussions), this makes it impossible to do group commit! Again, I think a good solution to this is to have an (optional) storage engine callback fix_commit_order(). This will be called after successful prepare(), but before commit(). It should do the minimum amount of work necessary to make sure that transactions are committed in the order that fix_commit_order() is called. The upper layer (/redundancy service) will call fix_commit_order() for all transaction participants under a global mutex, ensuring correct order. lock(global_commit_order_mutex) fix_binlog_or_redundancy_service_commit_order() for (each storage engine) engine->fix_commit_order() unlock(global_commit_order_mutex) (If same commit order is not needed, the fix_commit_order() can be NULL, and if all fix_commit_order() are NULL there is no need to take the muxes). Then InnoDB does not need to hold a global mutex across prepare() / commit(). In fact all it needs to do in fix_commit_order() is to put the transaction into a sorted list of pending commits. Then each transaction in commit() needs only wait until it is first in this list, which is _much_ better than hanging in prepare() waiting for _all_ transactions to commit! (There are other implementation possible also, of course). - Kristian.
Alex Yurchenko
writes: On Thu, 18 Mar 2010 15:18:40 +0100, Kristian Nielsen
wrote: Hm, how is it different from how it is done currently in MariaDB? Does txn_commit() have to follow the same order as txn_prepare()? If not,
On Mon, 29 Mar 2010 00:02:09 +0200, Kristian Nielsen
the commit ordering imposed by redundancy service should not be a problem.
Ok, I checked, and indeed there is no requirement that prepare is done in same order as commit.
In fact, there seems to be no requirement on ordering of commit among different engines and binlog at all in the server itself!
(Since the XA/2pc in MySQL assumes every engine ensures durability by itself, there is not requirement for any ordering. In case of a crash, each engine will be able to recover every transaction successfully prepared, so it is just a matter of deciding which of them to commit and which to rollback.)
So agree, there is no problem with the redundancy service imposing some order, with the purpose of enabling recovery even without durability guarantee by each individual engine.
----
Now, InnoDB _does_ have a requirement to commit in the same order as the binlog (due to InnoBackup; if not same commit order, the snapshot made by the backup may not correspond to any position in the binlog, which breaks restore).
The way this is implemented in InnoDB is by taking a global mutex in InnoDB prepare(), which is only release in InnoDB commit().
This is a really bad way to do things :-(. It means that only one (InnoDB) transaction can be running the code between prepare() and commit(). Since this is where the binlog is written (and the point where the redundancy service makes the transaction durable in our discussions), this makes it impossible to do group commit!
The way I understood the above is that global mutex is taken in InnoDB prepare() solely to synchronize binlog and InnoDB commits. Is that so? If it is, than it is precisely the thing we want to achieve, but instead of locking global mutex in Innodb prepare() we'll be doing it in redundancy_service->pre_commit() as discussed earlier: innodb->prepare(); if (redundancy_service->pre_commit() == SUCCESS) // locks commit_order mtx { innodb->commit(); redundancy_service->post_commit(); // unlocks commit_order mtx } ... This way global lock in innnodb->prepare() can be naturally removed without any additional provisions. Am I missing something? On the other hand, if we can reduce the amount of commit ordering operations to the absolute minimum, as you suggest below, it would only benefit performance. I'm just not sure about names. Essentially this means splitting commit() into 2 parts: the one that absolutely must be run under commit_order mutex protection and another that can be run outside of the critical section. I guess in that setup all actual IO can easily go into the 2nd part.
Again, I think a good solution to this is to have an (optional) storage engine callback fix_commit_order(). This will be called after successful prepare(), but before commit(). It should do the minimum amount of work necessary to make sure that transactions are committed in the order that fix_commit_order() is called. The upper layer (/redundancy service) will call fix_commit_order() for all transaction participants under a global mutex, ensuring correct order.
lock(global_commit_order_mutex) fix_binlog_or_redundancy_service_commit_order() for (each storage engine) engine->fix_commit_order() unlock(global_commit_order_mutex)
(If same commit order is not needed, the fix_commit_order() can be NULL, and if all fix_commit_order() are NULL there is no need to take the muxes).
Then InnoDB does not need to hold a global mutex across prepare() / commit(). In fact all it needs to do in fix_commit_order() is to put the
What I'd like to correct here is that ordering is needed at least in redundancy service. You need global trx ID. And I believe storage engines won't be able to do without it either - otherwise we'll need to deal with holes in commit sequence during recovery. Also, I'd suggest to move the global_commit_order_mutex into what goes by "fix_binlog_or_redundancy_service_commit_order()" (the name is misleading - redundancy service determines the order, it does not have to fix it) in the above pseudocode. Locking it outside may seriously reduce concurrency. transaction
into a sorted list of pending commits. Then each transaction in commit() needs only wait until it is first in this list, which is _much_ better than hanging in prepare() waiting for _all_ transactions to commit!
(There are other implementation possible also, of course).
- Kristian.
Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Alex Yurchenko
On Mon, 29 Mar 2010 00:02:09 +0200, Kristian Nielsen
wrote:
The way I understood the above is that global mutex is taken in InnoDB prepare() solely to synchronize binlog and InnoDB commits. Is that so? If
Yes.
it is, than it is precisely the thing we want to achieve, but instead of locking global mutex in Innodb prepare() we'll be doing it in redundancy_service->pre_commit() as discussed earlier:
innodb->prepare();
if (redundancy_service->pre_commit() == SUCCESS) // locks commit_order mtx { innodb->commit(); redundancy_service->post_commit(); // unlocks commit_order mtx } ...
Yes. This way will prevent group commit in InnoDB, as here innodb->commit() does fsync() under a global mutex.
This way global lock in innnodb->prepare() can be naturally removed without any additional provisions. Am I missing something?
Agree that this removes the need for innodb to take its lock in prepare() and release in commit().
On the other hand, if we can reduce the amount of commit ordering operations to the absolute minimum, as you suggest below, it would only benefit performance. I'm just not sure about names. Essentially this means splitting commit() into 2 parts: the one that absolutely must be run under commit_order mutex protection and another that can be run outside of the critical section. I guess in that setup all actual IO can easily go into the 2nd part.
Yes (I did not think long about the names, probably better names can be devised).
lock(global_commit_order_mutex) fix_binlog_or_redundancy_service_commit_order() for (each storage engine) engine->fix_commit_order() unlock(global_commit_order_mutex)
What I'd like to correct here is that ordering is needed at least in redundancy service. You need global trx ID. And I believe storage engines won't be able to do without it either - otherwise we'll need to deal with holes in commit sequence during recovery.
Yes.
Also, I'd suggest to move the global_commit_order_mutex into what goes by "fix_binlog_or_redundancy_service_commit_order()" (the name is misleading - redundancy service determines the order, it does not have to fix it) in the above pseudocode. Locking it outside may seriously reduce concurrency.
Agree (in fact, though I did not say so explicitly, I thought of the entire pseudo code above as being in fact implemented inside the redundancy service plugin). - Kristian.
This is a really long thread so a summary elsewhere would be great for
people like me.
I think Alex mentioned that he needs the commit protocol to be changed
so that the binlog/commit-log/commit-service/redundancy-service
guarantees commit and the storage engine does not. If that is the
case, the storage engine can do async commits. As long as it recovers
to some point in time and tells the binlog what the point in time was
(must know XID), then the binlog can give it the transactions it lost
during crash recovery. Here 'binlog' is what guarantees commit and
could be something other than a file on the master. I want something
like this. It means that we don't need to use XA internally which
currently costs 3 fsyncs per commit (2 shared, 1 not). We are changing
MySQL to really do group commit and that will change the cost to 3
shared fsyncs. But what I think you have described here is a huge
improvement.
As a further optimization, I want a callback that is called after the
binlog entries are written for a transaction and before the wait for
group commit on the fsync is done. That callback will be used to
release row locks (optionally) held by the transaction.
On Tue, Mar 30, 2010 at 11:40 AM, Kristian Nielsen
Alex Yurchenko
writes: On Mon, 29 Mar 2010 00:02:09 +0200, Kristian Nielsen
wrote: The way I understood the above is that global mutex is taken in InnoDB prepare() solely to synchronize binlog and InnoDB commits. Is that so? If
Yes.
it is, than it is precisely the thing we want to achieve, but instead of locking global mutex in Innodb prepare() we'll be doing it in redundancy_service->pre_commit() as discussed earlier:
innodb->prepare();
if (redundancy_service->pre_commit() == SUCCESS) // locks commit_order mtx { innodb->commit(); redundancy_service->post_commit(); // unlocks commit_order mtx } ...
Yes. This way will prevent group commit in InnoDB, as here innodb->commit() does fsync() under a global mutex.
This way global lock in innnodb->prepare() can be naturally removed without any additional provisions. Am I missing something?
Agree that this removes the need for innodb to take its lock in prepare() and release in commit().
On the other hand, if we can reduce the amount of commit ordering operations to the absolute minimum, as you suggest below, it would only benefit performance. I'm just not sure about names. Essentially this means splitting commit() into 2 parts: the one that absolutely must be run under commit_order mutex protection and another that can be run outside of the critical section. I guess in that setup all actual IO can easily go into the 2nd part.
Yes (I did not think long about the names, probably better names can be devised).
lock(global_commit_order_mutex) fix_binlog_or_redundancy_service_commit_order() for (each storage engine) engine->fix_commit_order() unlock(global_commit_order_mutex)
What I'd like to correct here is that ordering is needed at least in redundancy service. You need global trx ID. And I believe storage engines won't be able to do without it either - otherwise we'll need to deal with holes in commit sequence during recovery.
Yes.
Also, I'd suggest to move the global_commit_order_mutex into what goes by "fix_binlog_or_redundancy_service_commit_order()" (the name is misleading - redundancy service determines the order, it does not have to fix it) in the above pseudocode. Locking it outside may seriously reduce concurrency.
Agree (in fact, though I did not say so explicitly, I thought of the entire pseudo code above as being in fact implemented inside the redundancy service plugin).
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
-- Mark Callaghan mdcallag@gmail.com
MARK CALLAGHAN
This is a really long thread so a summary elsewhere would be great for people like me.
I agree that the discussion has become quite long. I summarised the group commit part of it on my blog: http://kristiannielsen.livejournal.com/12254.html http://kristiannielsen.livejournal.com/12408.html http://kristiannielsen.livejournal.com/12553.html
I think Alex mentioned that he needs the commit protocol to be changed so that the binlog/commit-log/commit-service/redundancy-service guarantees commit and the storage engine does not. If that is the case, the storage engine can do async commits. As long as it recovers to some point in time and tells the binlog what the point in time was (must know XID), then the binlog can give it the transactions it lost during crash recovery. Here 'binlog' is what guarantees commit and could be something other than a file on the master. I want something like this. It means that we don't need to use XA internally which currently costs 3 fsyncs per commit (2 shared, 1 not). We are changing MySQL to really do group commit and that will change the cost to 3 shared fsyncs. But what I think you have described here is a huge improvement.
Yes, it sounds quite promising.
As a further optimization, I want a callback that is called after the binlog entries are written for a transaction and before the wait for group commit on the fsync is done. That callback will be used to release row locks (optionally) held by the transaction.
I think the point here is that the locks must not be released until the order in the binlog has been determined, right? So that any transaction order enforced by the log will be the same on the slave. So the callback might be called before or after the actual write of the binlog, but only after (not before) determining the order of such write? I think this could be handled by the xa_prepare_fast() and/or the commit_fast() callbacks that I propose in the third article referenced above. BTW, it was great to discuss these issues with you at the MySQL Conference! - Kristian.
On Fri, Apr 23, 2010 at 2:35 AM, Kristian Nielsen
MARK CALLAGHAN
writes: This is a really long thread so a summary elsewhere would be great for people like me.
I agree that the discussion has become quite long. I summarised the group commit part of it on my blog:
http://kristiannielsen.livejournal.com/12254.html http://kristiannielsen.livejournal.com/12408.html http://kristiannielsen.livejournal.com/12553.html
Hopefully I sent a few more people to your posts -- http://www.facebook.com/MySQLatFacebook Someone even asked about about MariaDB
I think the point here is that the locks must not be released until the order in the binlog has been determined, right? So that any transaction order enforced by the log will be the same on the slave. So the callback might be called before or after the actual write of the binlog, but only after (not before) determining the order of such write?
I think this could be handled by the xa_prepare_fast() and/or the commit_fast() callbacks that I propose in the third article referenced above.
I think it can. For us, row lock release order determines the order in which the binlog is written. You are trying to fix this the right way. We must make compromises to get this running soon in production.
BTW, it was great to discuss these issues with you at the MySQL Conference!
I enjoyed that too. I hope to read more about the changes that MariaDB will do for replication and especially the new replication projects done by the Galera and Tungsten teams. -- Mark Callaghan mdcallag@gmail.com
MARK CALLAGHAN
As a further optimization, I want a callback that is called after the binlog entries are written for a transaction and before the wait for group commit on the fsync is done. That callback will be used to release row locks (optionally) held by the transaction.
I was thinking about this idea of releasing row locks early. Specifically about two scenarios: 1) releasing row locks early before having fully decided whether to commit or roll back a transaction; 2) releasing row locks early before changes in the transaction are made visible to other transactions. For issue 1: Transaction A releases row locks at end of innodb_xa_prepare(). Transaction B starts modifying rows touched by transaction A (eg. UPDATE t SET a=a+1). Transaction A then needs to rollback due to eg. failure during binlog write/fsync. Will transaction B now end up with the wrong value for a? For issue 2: Transaction A inserts a row and releases locks early before making the row visible to other transactions. Transaction B inserts the same row. Will we get a duplicate key in the database (corruption) due to uniqueness check in transaction B not seeing the row from A? Or will we get a bogus uniqueness violation in transaction B if A rolls back? Have you already found an answer for issues such as these? One way to avoid these kind of issues seems to me to be A. Do not release row locks until changes made by the transaction have been made visible. B. Do not make changes visible until we are fully decided to commit the transaction. So basically, what I am asking is if you are doing A+B, or if you found a way to release row locks even earlier. Or put another way, exactly how early are you able to release row locks? --- What I currently have is two callbacks for the engine: - One that is called after engine prepare() fsync, and before binlog write/fsync. - One that is called after binlog write/fsync, and before engine commit() fsync. These callbacks are guaranteed to be called in the same order that transactions are written to the binlog. So we can probably release row locks in the second callback. The question is if we can release earlier, in the first callback? And if not, do I need to add a callback that is called after the binlog is successfully written, but before it is fsync'ed to disk? - Kristian.
Hi Kristian
I don't know why I'm reading this on a Sunday morning, but just a
comment without thinking much:
On Fri, Apr 30, 2010 at 10:32 PM, Kristian Nielsen
I was thinking about this idea of releasing row locks early. Specifically about two scenarios: 1) releasing row locks early before having fully decided whether to commit or roll back a transaction; 2) releasing row locks early before changes in the transaction are made visible to other transactions.
For issue 1: Transaction A releases row locks at end of innodb_xa_prepare(). Transaction B starts modifying rows touched by transaction A (eg. UPDATE t SET a=a+1). Transaction A then needs to rollback due to eg. failure during binlog write/fsync. Will transaction B now end up with the wrong value for a?
Yes? It seems to me you could still allow this with the condition that transaction B cannot commit before A has committed. So in the implementation, A needs to maintain a list of dependant transactions, and if A is rolled back, also B will be rolled back. What is completely unclear to me, and I will not spend time thinking about this today, is whether this maps cleanly to any of the ANSI SQL isolation levels, since we are kind of seeing uncommitted data, kind of not... henrik -- email: henrik.ingo@avoinelama.fi tel: +358-40-5697354 www: www.avoinelama.fi/~hingo book: www.openlife.cc
Hi!
"Henrik" == Henrik Ingo
writes:
Henrik> Hi Kristian
Henrik> I don't know why I'm reading this on a Sunday morning, but just a
Henrik> comment without thinking much:
Henrik> On Fri, Apr 30, 2010 at 10:32 PM, Kristian Nielsen
Henrik>
I was thinking about this idea of releasing row locks early. Specifically about two scenarios: 1) releasing row locks early before having fully decided whether to commit or roll back a transaction; 2) releasing row locks early before changes in the transaction are made visible to other transactions.
For issue 1: Transaction A releases row locks at end of innodb_xa_prepare(). Transaction B starts modifying rows touched by transaction A (eg. UPDATE t SET a=a+1). Transaction A then needs to rollback due to eg. failure during binlog write/fsync. Will transaction B now end up with the wrong value for a?
It depends on how InnoDB works when it comes to MVCC. In no case should InnoDB be allowed to change a row that was changed by another not committed transaction. This would effectively stop InnoDB from being able to rollback the not committed transaction. You can't roll back something that is changed by anyone else, at least the way I think InnoDB (and Maria) is implementing MVCC. However, one should be able to release all read locks early. One could theoretically be able to remove the write locks too, if InnoDB is using the version id to check if the row is visible/updated or not before accepting another transaction to change it. This is how Falcon detected conflicts. Regards, Monty
On Tue, Mar 16, 2010 at 7:32 AM, Alex Yurchenko
I think "a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions" is an internally contradictory concept. Suppose node1 committed T1, but not T2 yet, and node2 committed T2, but not T1 yet. Now do database dumps from those nodes. Which one will represent a consistent transactional view?
I'm not sure if this is the same as Kristian is thinking about, but I don't see it as internally contradictory. See below...
I guess you're trying to make a cause for parallel transaction applying on slaves here. You can do parallel out of order applying alright. It is just
This is relevant to being able to do parallel slave execution, but not restricted to that. Actually, what Kristian is describing can even happen on a single-node, multi-threaded database. It is not strictly an issue with clusters at all. I'll borrow my old example from http://forge.mysql.com/worklog/task.php?id=4648 We have 2 tables t1 and t2, and 4 transactions applied in a sequence. Assume that all 4 transactions are done "at the same time" by 4 different threads. The sequence randomly ends up being: /* Assume all columns are "0" to start */ /* 0 */ (SET AUTOCOMMIT=1) /* 1 */ UPDATE t1 SET f1=1, f2=2, timestamp=001 WHERE id=1; /* 2 */ UPDATE t2 SET f1=3, f2=4, timestamp=002 WHERE id=1; /* 3 */ UPDATE t1 SET f1=5, f2=6, timestamp=003 WHERE id=1; /* 4 */ UPDATE t2 SET f1=7, f2=8, timestamp=004 WHERE id=1; Now, depending on how the database internally commits transactions and writes a transaction log, it may impose a total ordering on these transactions or not. If you now replay these transactions (such as apply them on a slave), what happens if you do: /* 1 */ UPDATE t1 SET f1=1, f2=2, timestamp=001 WHERE id=1; /* 3 */ UPDATE t1 SET f1=5, f2=6, timestamp=003 WHERE id=1; /* 4 */ UPDATE t2 SET f1=7, f2=8, timestamp=004 WHERE id=1; /* The following statement is discarded since NEW.timestamp < OLD.timestamp */ /* 2 */ UPDATE t2 SET f1=3, f2=4, timestamp=002 WHERE id=1 Depending on what you want to achieve, skipping transaction #2 is ok or bad. The (slave) state between #3 and #4 is now something that never existed in the original (master) database. However, taken the assumption that transactions were independently executed "at the same time", the ordering was determined by chance, so it is a state that *could have existed* on the original database. Whether it actually could have existed, depends on our assumptions of what the application is doing. If transactions 1-4 are really independent, it could have existed. If they are interrelated - such as tables t1 and t2 depending on each other - the application could never actually produce such a state. I agree with Alex that if a 100% total ordering is not imposed on the transactions, it is not correct to say that the second replay (the slave) is a *replica* of the first run (the master), since it can end up in states that never existed on the master. However the original sentence a cluster that outwards presents a consistent transactional view, yet internally does not have a total ordering on transactions" is not contradictory in itself, it is something that can happen even on a single node database. The argument in favor of "allowing" above like behavior is that we should be allowed to assume that an application always commits transactions that produce consistent end states. So if some "nearby" transactions are then replicated in a different order, each state is still meaningful to the application, even if some states may not have actually existed on the master if you look at the database as a whole. As an example, a bank has a database that reflects how much money we have in our accounts. If I transfer 100€ to you, then the withdrawal and deposit are done in the same transaction and the database state is consitent at the end of the transaction. But if I give 100€ to charity and you win 100€ on the lottery, these are independent transactions. For practical purposes no application should care which way these transactions were committed. Note that also other properties, for instance the total amount of money the bank has, would still always be correct regardless of which way those 2 individual transactions were committed. So if we now replicate the bank database, it is not in practice important to replicate the transactions exactly in the same sequence as they happened to execute on the master database. It is only in theory that it is a problem that the slave may have states (when looking at the database as a whole) that didn't actually exist on the master and therefore are not replicas. So personally I see that in theory any replication should of course exactly produce the same states of the "original" database, in practice if you can get significant performance gains by not imposing a total ordering between each and every transactions, it is a practical approach that should be allowed. (At least by an option, since ultimately it is the app developer or DBA that can decide if such replication is "safe" or not.)
The point is that if you want _replication_, you gotta have a linear order. Without it you can't compare db states and therefore can't say that this db is a replica of that db. What you end up with is a so called "eventual consistency", and "eventual" may never happen.
Theoretically speaking, yes.
3. Uncoupled multiple masters, like multiple nodes in MySQL asynchronous replication. In this case there really is no total linear order, as the transactions are not coordinated between servers. At most there would be an approximate order like a time stamp, but that would be not 100% reliable due to time skew and such.
That's what the concept of Replication Set was about below. In the scenario described above you effectively have multiple replication sets and multiple logical clusters. So each one of them will have its own global transaction ID sequence. If they are so much uncoupled there is no point in considering them a single cluster.
What you say above is significant. At least for myself, when speaking of "replication" I think of solving both the problem of running a cluster (as you define as tightly coupled above) and replication between "uncoupled" databases (replication sets), such as 2 datacenters across 2 continents. Trying to force a strict total ordering across 2 datacenters seemed like a bad idea, now I understand you never intended to do that.
I guess I have addressed this above already. In short, parallel applying does not need to mean out-of-order commit.
You still end up with some form of group commit. Continuing from above, there are also applications that do multi-master replication but the masters are "uncoupled" out of necessity. An example is when you want to have geographical redundancy and allow writes to all datacenters. As we have agreed above, it is not practical to force a total ordering globally across the changesets in each datacenter, but they still replicate to each other. (Using conflict resolution such as based on a timestamp, or trying to guarantee that one datacenter is always the single master for a certain subset of records.) Apparently our mobile phone networks work this way (the HLR). In such a setup, you can do parallel applying all you want, since there is no "one true" ordering of the transactions anyway.
In other words, is it necessary to enforce such total order at the level of the API, instead of just introducing it at the point where it becomes needed, and if so, why?
There are at least 2 problems with not having linearly ordered commits:
1) Database states are incomparable. It is impossible (ok, impractically hard) to tell if two databases have identical contents and determine which transactions are missing. How will you join a node to a cluster in this case?
2) Causality preservation. Suppose you commit something on master and now want to select results from slave. How will you tell that slave already has committed these results?
I don't exactly understand how you separate "level of the API" and "the point where is becomes needed", do you mean code separation or making linear ordering an optional feature of the redundancy service?
In any case, I think we all understand quite well what IS database replication and what are redundancy guarantees when commits are linearly ordered. Id like to hear a more detailed concept of redundancy in the absence of linearly ordered commits. What is redundancy in that case, what are the guarantees? How recovery will be performed and so on.
I think you yourself also admitted that in some cases the "replicas" cannot be so tightly coupled that a globally consistent linear ordering would be possible/desirable. It is true that some operations like joining a new node become simple and convenient when you have a linear ordering to refer to. But there are situations when you don't have that, for some reason, the most obvious use case being geographical replication.
I don't think that you need 2PC between redundancy service and the storage engines, because redundancy service never fails. Well, when it fails, you have something more important to worry about than disk flushes anyways.
How does synchronous replication happen without 2PC? henrik (choosing to take part in something interesting rather than doing my real work :-) -- email: henrik.ingo@avoinelama.fi tel: +358-40-5697354 www: www.avoinelama.fi/~hingo book: www.openlife.cc
Hi Ingo!
Your e-mail is totally relevant and I have almost nothing there to respond
to in particular - its all as you say, I have no essential remarks. Instead
I want to respond to it in whole, thus I'll omit a lengthy quote, suffuce
say that it is a direct response.
The problem is that you cannot really design and program by use cases,
unorthodox as it may sound. You cannot throw an arbitrary bunch of use
cases as input and get code as output (that is in a finite time and of
finite quality). Whether you like it or not, you always program some model.
It is by definition that a program is a description of some model. If you
have not settled on a model, you're in trouble - and that's where mysql
replication is. This is a direct consequence of trying to satisfy a bunch
of use cases without first putting them in a perspective of some general
abstract model. I mention this not to belittle anything or anyone -
everybody makes mistakes. But the subject of this thread is "Ideas for
improving MariaDB/MySQL replication", and so mistakes should be learned
upon, but not repeated.
Let me refer to the following analogy: suppose you want to create a
transport agency. To transport stuff. You know, people, animals, cargo -
stuff. There's a billion of use cases. But when you get to it you have
models to choose (thankfully there are already models for that, you don't
have to develop one). E.g. you can transport by air or by land. And each of
these models has it own laws and limitations. Like you reliably cannot
transport by land faster than at 200 km/h. You cannot transport a lot of
cargo by air, as well as you can't have stops every 10km to pick up
passengers. So you gotta settle on the model that suits you most.
Now you can say that why? Why not choose both models? Well, notice that
they are still models. There is a whole lot of other use cases that you
cannot satisfy by them. Next, do you know many companies that do both land
and air transportation? You can own both of them indeed, but for the sake
of efficiency they'll be different companies because aside from
load()/unload() functions their interfaces, internals and logistics are
likely to be very different.
This is a clumsy analogy indeed, but I hope it helps.
So now we have a proposed model based on Redundancy Sets, linearly ordered
global transaction IDs and ordered commits. We pretty much understand how
it will work, what sort of redundancy it will provide and, as you agreed,
is easy to use for recovery and node joining. It satisfies a whole bunch of
use cases, even those where ordering of commits is not strictly required.
Perhaps we won't be able to have some optimizations where we could have had
them without ordering of commits, but the benefit of such optimizations is
highly questionable IMO. MySQL/Galera is a practical implementation of such
model, may be not exactly what we want to achieve here, but it gives a good
estimate of performance and performance is good.
Now this model may not fit, for instance, NDB-like use case. What options
do we have here?
1) Extend somehow the proposed model to satisfy NDB use case. I don't see
it likely. Because, as you agreed, NDB is not really about redundancy, it
is about performance. Redundancy is quite specific there. And it is not by
chance that it is hard to migrate applications to use it.
2) Develop a totally different model to describe NDB use case and have it
as a different API. Which is exactly what it is right now if I'm not
mistaken. So that it just falls out of scope of today's topic.
There is one more option - just forget about NDB use case which may be
there only because there is nothing better. There are other ways to get
partitioning and replication to work together without pushing them behind
the same interface. E.g. you can have "replication cluster" of "partition
clusters" - or "partition cluster" of "replication clusters" (i.e. each
replication cluster replicating a single partition)
Disclaimer: NDB use case was taken as an example.
The bottom line - you can just say that sometimes we don't need total
ordering of commits. You gotta put it in the model.
On Wed, 17 Mar 2010 13:03:02 +0200, Henrik Ingo
I don't think that you need 2PC between redundancy service and the storage engines, because redundancy service never fails. Well, when it fails,
you
have something more important to worry about than disk flushes anyways.
How does synchronous replication happen without 2PC?
It does, it does. E.g. it does so in MySQL/Galera, see my response to Kristian. Actually how can it work otherwise? What is the meaning of prepare() in replication step? How can engine commit fail at this point except for the crash? Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
On Wed, Mar 17, 2010 at 9:01 PM, Alex Yurchenko
The problem is that you cannot really design and program by use cases, unorthodox as it may sound. You cannot throw an arbitrary bunch of use cases as input and get code as output (that is in a finite time and of finite quality). Whether you like it or not, you always program some model.
Uh, I'm not sure I can accept this proposition. At least it seems contradictory to MariaDB's vision of being a practical, user and customer driven, database. As I see it, for real world applications, you should always start with use cases. But it is ok if you want to come back to me and say that a subset of use cases should be discarded because they are too difficult to service, or even contradict each other. But just saying that you'd like to implement an abstract model without connection to any use cases sounds dangerous to me. I'm also a fan of abstract thinking though. Sometimes you can get great innovations from starting with a nice abstract model, and then ask yourself which real world problems it would (and would not) solve. Either way, you end up with anchoring yourself in real world use cases.
It is by definition that a program is a description of some model. If you have not settled on a model, you're in trouble - and that's where mysql replication is. This is a direct consequence of trying to satisfy a bunch of use cases without first putting them in a perspective of some general abstract model.
Yes. It is ok to say that just use cases without some "umbrella" like an abstract model will just lead to chaos.
So now we have a proposed model based on Redundancy Sets, linearly ordered global transaction IDs and ordered commits. We pretty much understand how it will work, what sort of redundancy it will provide and, as you agreed, is easy to use for recovery and node joining. It satisfies a whole bunch of use cases, even those where ordering of commits is not strictly required. Perhaps we won't be able to have some optimizations where we could have had them without ordering of commits, but the benefit of such optimizations is highly questionable IMO. MySQL/Galera is a practical implementation of such model, may be not exactly what we want to achieve here, but it gives a good estimate of performance and performance is good.
Back on track: So the API should of course implement something which has as broad applicability as possible. This is the whole point of questioning you, since now you have just suggested a model which happens to nicely satisfy Galera's needs :-) But another real-world argument you can make is that we don't need parallel replication for speed, because at least Galera does well without it. That should then be benchmarked by someone. The real-world requirement here is after all "speed", not "parallel replication".
Now this model may not fit, for instance, NDB-like use case. What options do we have here?
1) Extend somehow the proposed model to satisfy NDB use case. I don't see it likely. Because, as you agreed, NDB is not really about redundancy, it is about performance. Redundancy is quite specific there. And it is not by chance that it is hard to migrate applications to use it. <cut>
Actually, I don't think the issues with migration/performance has anything at all to do with how it does replication. (It does have to do with the partitioning/sharding and just limitations of the MySQL storage engine interfae.) But we should distinguish 2 things here: How NDB does it's own cluster internal (node-to-node) replication can for our purposes be considered as an engine-internal issue. Otoh MySQL Cluster also uses the standard MySQL replication and binlog. From there we can derive some interesting behavior that we should certainly support in the replication API. Ie hypothetically MySQL Cluster could use our replication api for geographical replication, as it uses MySQL replication today, but there could also be some other engine with these same requirements. The requirements I can think of is: 1) As Kristian explained, transactions are often committed on only one pair or a few pairs of nodes, but not all nodes (partitions) in the cluster. The only cluster-global (or database global) sync point is the epoch, which will collect several transactions packed between cluster-wide heartbeats. To restore to a consistent cluster wide state, you must choose the border between 2 epochs, not just any transaction. -> A consequence for the mysql binlog and replication is that the whole epoch is today considered one large transaction. I don't know if this has any consequence for our discussion, other than the "transactions" (epochs) being large. A nice feature here could be support for "groups of transactions" (not to be confused with group commit) or sub-transactions, whichever way you prefer to look at it. This way an engine like NDB could send information about both the epoch and each individual transaction inside the epoch to the redundancy services. (The redundancy services then may or may not use that info, but the API could support it.) 2) Sending of transactions to mysql binlog is asynchronous, totally decoupled from the actual commit that happens in the datanodes. The reason is that a central binlog would otherwise become a bottleneck in an otherwise distributed cluster. -> This is ok also in our current discussion. If the engine doesn't want to include the replication api in a commit, it just doesn't do so and there's nothing we can or need to do about it. For instance in the case of NDB it is NDB who gives you adequate guarantees for redundancy, the use of mysql binlog is for other reasons. (Asynchronous geographical replication, and potentially playback and point-in-time restoring of transactions.) 3) Transactions arrive at the mysql binlog in a somewhat random order, and it is impossible to know which order they actually committed in. Due to (2) NDB does not want to sync with a central provider of global transaction ID's either. -> When transactions arrive to the replication api, the NDB side may just act as if they are being committed, even if they already have been committed in the engine. The replication api would then happily assign global transaction id's to the transactions. As in (2), this makes redundancy services behind this api unusable for database recovery or node recovery, the engine must guarantee that functionality (which they do today anyway, in particular NDB). -> Transactions "committed" to the replication api become linearly ordered, even if this order does not 100% correspond to the real order of how the engine committed them originally. However, I don't see a problem with this at this point. -> Assuming that there would be benefit on an asynchronous slave to do parallel replication, it would be advantageous to be able to commit transactions "out of order". For instance if we introduce the concept of transaction groups (or sub-transactions), a slave could decide to commit transactions in random order inside a group, but would have to sync at the boundary of a transaction group. (This requirement may in fact worsen performance, since in every epoch you would still have to wait for the longest running transaction.) So those are the requirements I could derive from having NDB use our to-be-implemented API. My conclusion from the above is that we should consider adding to the model the concept of a transaction group, which: -> the engine (or MariaDB server, for multi-engine transactions?) MAY provide information of which transactions had been committed within the same group. -> If such information was provided, a redundancy service MAY process transactions inside a group in parallel or out of order, but MUST make sure that all transactions in transaction group G1 are processed/committed before the first transaction in G2 is processed/comitted.
2) Develop a totally different model to describe NDB use case and have it as a different API. Which is exactly what it is right now if I'm not mistaken. So that it just falls out of scope of today's topic.
We should not include the NDB internal replication in this discussion. Or we might in the sense that real world examples can give could ideas on implementation details and corner cases. But it is not a requirement. How NDB uses the MySQL row based replication is imho an interesting topic to take into account. henrik -- email: henrik.ingo@avoinelama.fi tel: +358-40-5697354 www: www.avoinelama.fi/~hingo book: www.openlife.cc
On Sat, 20 Mar 2010 13:52:47 +0200, Henrik Ingo
On Wed, Mar 17, 2010 at 9:01 PM, Alex Yurchenko
wrote: The problem is that you cannot really design and program by use cases, unorthodox as it may sound. You cannot throw an arbitrary bunch of use cases as input and get code as output (that is in a finite time and of finite quality). Whether you like it or not, you always program some model.
Uh, I'm not sure I can accept this proposition. At least it seems contradictory to MariaDB's vision of being a practical, user and customer driven, database.
I do understand the desire to marry marketing to software design, but they are simply unrelated areas of human activity. "Computer science" is called "science" because there are real laws which no marketing genius can invalidate. So YMMV.
As I see it, for real world applications, you should always start with use cases. But it is ok if you want to come back to me and say that a subset of use cases should be discarded because they are too difficult to service, or even contradict each other. But just saying that you'd like to implement an abstract model without connection to any use cases sounds dangerous to me.
I never suggested to implement a model without connection to use cases, and I believe I went to sufficient lengths to explain how proposed model can satisfy a broad range of use cases. What I was saying, that you're always programming a model, not use cases and therefore anything that you want to implement must be expressed in terms of the model. In this connection saying that you have a use case that does not need linearly ordered commits really means nothing. Either you need to propose another model, live with linearly ordered commits or drop the case. Either way it has no effect on the design of this model implementation, because linearly ordered commits IS the model. You cannot throw them out without breaking the rest of the concept. So much for the usefulness of use cases in high-level design: some of them fit, some of them don't.
I'm also a fan of abstract thinking though. Sometimes you can get great innovations from starting with a nice abstract model, and then ask yourself which real world problems it would (and would not) solve.
And that's exactly what I'm trying to do in this thread - start with a model, not use cases.
Either way, you end up with anchoring yourself in real world use cases.
Well, when you start with a model, it means that you use it as a reference stick to accept or reject use cases, doesn't it? So that makes the model an anchor. And leaves use cases only as means to see how practical the model is. And there is another curious property to models: the more abstract is the model (i.e. the less it is rooted in use cases), the more use cases it can satisfy. Once you stop designing specifically for asynchronous replication, you find out that the same scheme works for synchronous too.
So now we have a proposed model based on Redundancy Sets, linearly ordered global transaction IDs and ordered commits. We pretty much understand how it will work, what sort of redundancy it will provide and, as you agreed, is easy to use for recovery and node joining. It satisfies a whole bunch of use cases, even those where ordering of commits is not strictly required. Perhaps we won't be able to have some optimizations where we could have had them without ordering of commits, but the benefit of such optimizations is highly questionable IMO. MySQL/Galera is a practical implementation of such model, may be not exactly what we want to achieve here, but it gives a good estimate of performance and performance is good.
Back on track: So the API should of course implement something which has as broad applicability as possible. This is the whole point of questioning you, since now you have just suggested a model which happens to nicely satisfy Galera's needs :-)
Well, this may seem like it because Galera is the only explicit implementation of that model. But the truth is Galera is possible only because this model was explicitly followed. And this model didn't come out of thin air. It is a result of years of research and experience - not only ours. For example, MySQL|MariaDB is already implementing large portion of the proposed model by representing evolution of a database as a _series_ of atomic changes recorded in a binlog. In fact it had global transaction IDs from day one. They are just expressed in the way that makes sense only in the context of a given file on a given server. Had they been recognized as global transaction IDs, implementing a mapping from a file offset to an ordinal number is below trivial. Then we would not be having 3rd party patches applicable only to MySQL 5.0. (Let's face it, global transaction IDs in master-slave replication are so trivial they are practically built in.) The reason why there is no nice replication API in MariaDB yet is that this model was never explicitly recognized. And API is a description of a model. You cannot describe what you don't recognize ;) So in reality I am not proposing anything new or specific to Galera. I'm just suggesting to recognize what you already have there (and proposing the abstractions to express it). <cut>
So those are the requirements I could derive from having NDB use our to-be-implemented API. My conclusion from the above is that we should consider adding to the model the concept of a transaction group, which: -> the engine (or MariaDB server, for multi-engine transactions?) MAY provide information of which transactions had been committed within the same group. -> If such information was provided, a redundancy service MAY process transactions inside a group in parallel or out of order, but MUST make sure that all transactions in transaction group G1 are processed/committed before the first transaction in G2 is processed/comitted.
Well, that's a pretty cool concept. One way to call it is "controlled eventual consistency". But does redundancy service have to know about it? First of all, these groups are just superpositions of individual atomic transactions. That is, this CAN be implemented on top of the current model. Secondly, transaction applying is done by the engine, so the engine or the server HAS to have a support for this, both on the master and on the slave side. So why not keep the redundancy service API free from that at all? Consider this scheme: Database Server | Redundancy Service (database data) | (redundancy information) | Redundancy API The task of redundancy service is to store and provide redundancy information that can be used in restoring the database to a desired state. Keeping the information and using it - two different things. The purpose of API is to separate one part of the program from the logic of another. So I'd keep the model and the API as simple as free from the server details as possible. What it means here: redundancy service stores atomic database changes in a certain order and it guarantees that it will return these changes in the same order. This is sufficient to restore the database to any state it had. It is up to the server in what order it will apply these changes and if it wants to skip some states. (This assumes that the changesets are opaque to redundancy service and the server can include whatever information it wants in them, including ordering prefixes)
We should not include the NDB internal replication in this discussion.
It was taken solely as an example of a real world use case where you may not have linearly ordered commits. Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Meta discussion first, replication discussion below :-)
On Mon, Mar 22, 2010 at 4:41 PM, Alex Yurchenko
Uh, I'm not sure I can accept this proposition. At least it seems contradictory to MariaDB's vision of being a practical, user and customer driven, database.
I do understand the desire to marry marketing to software design, but they are simply unrelated areas of human activity. "Computer science" is called "science" because there are real laws which no marketing genius can invalidate. So YMMV.
It is not marketing. Science can produced things with practical value, and things with little or no practical value. We want to produce things with practical value.
As I see it, for real world applications, you should always start with I never suggested to implement a model without connection to use cases, and I believe I went to sufficient lengths to explain how proposed model can satisfy a broad range of use cases. What I was saying, that you're always programming a model, not use cases and therefore anything that you want to implement must be expressed in terms of the model.
This is true. Skipping the part where you create a model leads to chaos.
In this connection saying that you have a use case that does not need linearly ordered commits really means nothing. Either you need to propose another model, live with linearly ordered commits or drop the case. Either way it has no effect on the design of this model implementation, because linearly ordered commits IS the model. You cannot throw them out without breaking the rest of the concept. So much for the usefulness of use cases in high-level design: some of them fit, some of them don't.
I'm not sure about where Kristian is, but at least my participation is based on the assumption that we are still exploring the proposed model to see if we like it or whether we should modify it or have a different model. This assessment is based on asking what use case are served well by the model.
I'm also a fan of abstract thinking though. Sometimes you can get great innovations from starting with a nice abstract model, and then ask yourself which real world problems it would (and would not) solve.
And that's exactly what I'm trying to do in this thread - start with a model, not use cases.
Either way, you end up with anchoring yourself in real world use cases.
Well, when you start with a model, it means that you use it as a reference stick to accept or reject use cases, doesn't it? So that makes the model an anchor. And leaves use cases only as means to see how practical the model is.
No, this is what I disagree with. You could propose a model that is sound in a theoretical sense, but useless in practice because it doesn't serve any use cases that real world users are interested in. So the use cases are there reference stick to accept or reject the model. But also the full set of use cases are not set in stone. We can decide that we like a model because it serves many use cases and then we reject the use cases not served by it.
And there is another curious property to models: the more abstract is the model (i.e. the less it is rooted in use cases), the more use cases it can satisfy. Once you stop designing specifically for asynchronous replication, you find out that the same scheme works for synchronous too.
True. Abstract thinking sure is a win, there's no question about that. But universities are also full of those scientists who produce little of practical value. I worked one year at HUT - it was the most relaxed job I ever had, there is no requirement to produce anything useful unless you really want to. My masters thesis contributes something new to the field of eLearning, that nobody had researched before. But if I had to explain the main points of it in a business world, I could do so in 60 seconds. The rest is just "scientific fluff". Good science has practical value (sometimes apparent only after decades). But not everything that happens in science is good science.
Back on track: So the API should of course implement something which has as broad applicability as possible. This is the whole point of questioning you, since now you have just suggested a model which happens to nicely satisfy Galera's needs :-)
Well, this may seem like it because Galera is the only explicit implementation of that model. But the truth is Galera is possible only because this model was explicitly followed. And this model didn't come out of thin air. It is a result of years of research and experience - not only ours.
Yes. The model certainly looks sound and promising, no question about that. I think the discussion is more about corner cases.
For example, MySQL|MariaDB is already implementing large portion of the proposed model by representing evolution of a database as a _series_ of atomic changes recorded in a binlog. In fact it had global transaction IDs from day one. They are just expressed in the way that makes sense only in the context of a given file on a given server. Had they been recognized as global transaction IDs, implementing a mapping from a file offset to an ordinal number is below trivial. Then we would not be having 3rd party patches applicable only to MySQL 5.0. (Let's face it, global transaction IDs in master-slave replication are so trivial they are practically built in.) The reason why there is no nice replication API in MariaDB yet is that this model was never explicitly recognized. And API is a description of a model. You cannot describe what you don't recognize ;)
Yes.
So in reality I am not proposing anything new or specific to Galera. I'm just suggesting to recognize what you already have there (and proposing the abstractions to express it).
And imho this joint effort is looking really promising all in all, since so many experts are exchanging their wisdom. (Not really counting myself here, although I've read many white papers about replication :-)
<cut>
So those are the requirements I could derive from having NDB use our to-be-implemented API. My conclusion from the above is that we should consider adding to the model the concept of a transaction group, which: -> the engine (or MariaDB server, for multi-engine transactions?) MAY provide information of which transactions had been committed within the same group. -> If such information was provided, a redundancy service MAY process transactions inside a group in parallel or out of order, but MUST make sure that all transactions in transaction group G1 are processed/committed before the first transaction in G2 is processed/comitted.
Well, that's a pretty cool concept. One way to call it is "controlled eventual consistency". But does redundancy service have to know about it?
If the redundancy service does not know about it, how would the information be transmitted by it??? For instance take the example of the binlog, which is a redundancy service in this model. If it supported this information (which it MAY do), it of course has to save it in some format in the binlog file.
First of all, these groups are just superpositions of individual atomic transactions. That is, this CAN be implemented on top of the current model.
Yes, this is the intent.
Secondly, transaction applying is done by the engine, so the engine or the server HAS to have a support for this, both on the master and on the slave side. So why not keep the redundancy service API free from that at all? Consider this scheme:
Database Server | Redundancy Service (database data) | (redundancy information) | Redundancy API
The task of redundancy service is to store and provide redundancy information that can be used in restoring the database to a desired state. Keeping the information and using it - two different things. The purpose of API is to separate one part of the program from the logic of another. So I'd keep the model and the API as simple as free from the server details as possible.
What it means here: redundancy service stores atomic database changes in a certain order and it guarantees that it will return these changes in the same order. This is sufficient to restore the database to any state it had. It is up to the server in what order it will apply these changes and if it wants to skip some states. (This assumes that the changesets are opaque to redundancy service and the server can include whatever information it wants in them, including ordering prefixes)
Ok, this is an interesting distinction you make. So in current MySQL/MariaDB, one place where transactions are applied to a replica is the slave SQL thread. Conceptually I've always thought of this as "part of replication code". You propose here that this should be a common module on the MariaDB server side of the API, rather than part of each redundancy service. I guess this may make sense. This opens up a new field of questions related to the user interface of all this. Typically, or "how things are today", a user will initiate replication/redundancy related events from the side of the redundancy service. Eg if I want to setup mysql statement based replication, there is a set of commands to do that. If I want to recover the database by replaying the binlog file, there is a set of binlog specific tools to do that. Each redundancy service solves some problems from its own specific approach, and provides a user interface for those tasks. So I guess at some point it will be interesting to see what the command interface to all this will look like and whether I use something specific to the redundancy service or some general MariaDB command set to make replication happen. At least the application of replicated transactions certainly should not be part of each storage engine. From the engine point of view, applying a set of replicated transactions should be "just another transaction". For the engine it should not matter if a transaction comes from the application, mysqldump, or a redundancy service. (There may be small details: when the application does a transaction, we need a new global txn id, but when applying a replicated transaction, the id is already there.) henrik -- email: henrik.ingo@avoinelama.fi tel: +358-40-5697354 www: www.avoinelama.fi/~hingo book: www.openlife.cc
Quoting Henrik Ingo
Meta discussion first, replication discussion below :-)
On Mon, Mar 22, 2010 at 4:41 PM, Alex Yurchenko
wrote: Uh, I'm not sure I can accept this proposition. At least it seems contradictory to MariaDB's vision of being a practical, user and customer driven, database.
I do understand the desire to marry marketing to software design, but they are simply unrelated areas of human activity. "Computer science" is called "science" because there are real laws which no marketing genius can invalidate. So YMMV.
It is not marketing. Science can produced things with practical value, and things with little or no practical value. We want to produce things with practical value.
As I see it, for real world applications, you should always start with I never suggested to implement a model without connection to use cases, and I believe I went to sufficient lengths to explain how proposed model can satisfy a broad range of use cases. What I was saying, that you're always programming a model, not use cases and therefore anything that you want to implement must be expressed in terms of the model.
This is true. Skipping the part where you create a model leads to chaos.
In this connection saying that you have a use case that does not need linearly ordered commits really means nothing. Either you need to propose another model, live with linearly ordered commits or drop the case. Either way it has no effect on the design of this model implementation, because linearly ordered commits IS the model. You cannot throw them out without breaking the rest of the concept. So much for the usefulness of use cases in high-level design: some of them fit, some of them don't.
I'm not sure about where Kristian is, but at least my participation is based on the assumption that we are still exploring the proposed model to see if we like it or whether we should modify it or have a different model. This assessment is based on asking what use case are served well by the model.
I'm also a fan of abstract thinking though. Sometimes you can get great innovations from starting with a nice abstract model, and then ask yourself which real world problems it would (and would not) solve.
And that's exactly what I'm trying to do in this thread - start with a model, not use cases.
Either way, you end up with anchoring yourself in real world use cases.
Well, when you start with a model, it means that you use it as a reference stick to accept or reject use cases, doesn't it? So that makes the model an anchor. And leaves use cases only as means to see how practical the model is.
No, this is what I disagree with. You could propose a model that is sound in a theoretical sense, but useless in practice because it doesn't serve any use cases that real world users are interested in. So the use cases are there reference stick to accept or reject the model. But also the full set of use cases are not set in stone. We can decide that we like a model because it serves many use cases and then we reject the use cases not served by it.
And there is another curious property to models: the more abstract is the model (i.e. the less it is rooted in use cases), the more use cases it can satisfy. Once you stop designing specifically for asynchronous replication, you find out that the same scheme works for synchronous too.
True. Abstract thinking sure is a win, there's no question about that. But universities are also full of those scientists who produce little of practical value. I worked one year at HUT - it was the most relaxed job I ever had, there is no requirement to produce anything useful unless you really want to. My masters thesis contributes something new to the field of eLearning, that nobody had researched before. But if I had to explain the main points of it in a business world, I could do so in 60 seconds. The rest is just "scientific fluff".
Good science has practical value (sometimes apparent only after decades). But not everything that happens in science is good science.
Back on track: So the API should of course implement something which has as broad applicability as possible. This is the whole point of questioning you, since now you have just suggested a model which happens to nicely satisfy Galera's needs :-)
Well, this may seem like it because Galera is the only explicit implementation of that model. But the truth is Galera is possible only because this model was explicitly followed. And this model didn't come out of thin air. It is a result of years of research and experience - not only ours.
Yes. The model certainly looks sound and promising, no question about that. I think the discussion is more about corner cases.
For example, MySQL|MariaDB is already implementing large portion of the proposed model by representing evolution of a database as a _series_ of atomic changes recorded in a binlog. In fact it had global transaction IDs from day one. They are just expressed in the way that makes sense only in the context of a given file on a given server. Had they been recognized as global transaction IDs, implementing a mapping from a file offset to an ordinal number is below trivial. Then we would not be having 3rd party patches applicable only to MySQL 5.0. (Let's face it, global transaction IDs in master-slave replication are so trivial they are practically built in.) The reason why there is no nice replication API in MariaDB yet is that this model was never explicitly recognized. And API is a description of a model. You cannot describe what you don't recognize ;)
Yes.
So in reality I am not proposing anything new or specific to Galera. I'm just suggesting to recognize what you already have there (and proposing the abstractions to express it).
And imho this joint effort is looking really promising all in all, since so many experts are exchanging their wisdom. (Not really counting myself here, although I've read many white papers about replication :-)
<cut>
So those are the requirements I could derive from having NDB use our to-be-implemented API. My conclusion from the above is that we should consider adding to the model the concept of a transaction group, which: -> the engine (or MariaDB server, for multi-engine transactions?) MAY provide information of which transactions had been committed within the same group. -> If such information was provided, a redundancy service MAY process transactions inside a group in parallel or out of order, but MUST make sure that all transactions in transaction group G1 are processed/committed before the first transaction in G2 is processed/comitted.
Well, that's a pretty cool concept. One way to call it is "controlled eventual consistency". But does redundancy service have to know about it?
If the redundancy service does not know about it, how would the information be transmitted by it??? For instance take the example of the binlog, which is a redundancy service in this model. If it supported this information (which it MAY do), it of course has to save it in some format in the binlog file.
First of all, these groups are just superpositions of individual atomic transactions. That is, this CAN be implemented on top of the current model.
Yes, this is the intent.
Secondly, transaction applying is done by the engine, so the engine or the server HAS to have a support for this, both on the master and on the slave side. So why not keep the redundancy service API free from that at all? Consider this scheme:
Database Server | Redundancy Service (database data) | (redundancy information) | Redundancy API
The task of redundancy service is to store and provide redundancy information that can be used in restoring the database to a desired state. Keeping the information and using it - two different things. The purpose of API is to separate one part of the program from the logic of another. So I'd keep the model and the API as simple as free from the server details as possible.
What it means here: redundancy service stores atomic database changes in a certain order and it guarantees that it will return these changes in the same order. This is sufficient to restore the database to any state it had. It is up to the server in what order it will apply these changes and if it wants to skip some states. (This assumes that the changesets are opaque to redundancy service and the server can include whatever information it wants in them, including ordering prefixes)
Ok, this is an interesting distinction you make.
So in current MySQL/MariaDB, one place where transactions are applied to a replica is the slave SQL thread. Conceptually I've always thought of this as "part of replication code". You propose here that this should be a common module on the MariaDB server side of the API, rather than part of each redundancy service. I guess this may make sense.
This opens up a new field of questions related to the user interface of all this. Typically, or "how things are today", a user will initiate replication/redundancy related events from the side of the redundancy service. Eg if I want to setup mysql statement based replication, there is a set of commands to do that. If I want to recover the database by replaying the binlog file, there is a set of binlog specific tools to do that. Each redundancy service solves some problems from its own specific approach, and provides a user interface for those tasks. So I guess at some point it will be interesting to see what the command interface to all this will look like and whether I use something specific to the redundancy service or some general MariaDB command set to make replication happen.
This replication model will eventually influence the user interface. So far, in Galera project, we have postponed user interface changes for the future. Partly because, our intention is to be transparent to native MySQL, and partly because we wanted to get end user requirements for the management first. For us, this MariaDB replication project comes just in right time to lay the grounds for replication management syntax.
At least the application of replicated transactions certainly should not be part of each storage engine. From the engine point of view, applying a set of replicated transactions should be "just another transaction". For the engine it should not matter if a transaction comes from the application, mysqldump, or a redundancy service. (There may be small details: when the application does a transaction, we need a new global txn id, but when applying a replicated transaction, the id is already there.)
yes, but no. .e.g. Galera replication has this strange need to use prioritized transactions for applying. DBMS should have the responsibility to provide high priority sessions for replication appliers.
On Tue, Mar 23, 2010 at 10:40 AM,
At least the application of replicated transactions certainly should not be part of each storage engine. From the engine point of view, applying a set of replicated transactions should be "just another transaction". For the engine it should not matter if a transaction comes from the application, mysqldump, or a redundancy service. (There may be small details: when the application does a transaction, we need a new global txn id, but when applying a replicated transaction, the id is already there.)
yes, but no. .e.g. Galera replication has this strange need to use prioritized transactions for applying. DBMS should have the responsibility to provide high priority sessions for replication appliers.
So Galera only works with the engines you have hacked to support it? (Which I fear is InnoDB only?) Like I said, we can still list a set of details the engine needs to support, but in general of course we should be able to replicate from any table/engine into any table/engine without having "replication support" in each engine separately. But on the face of it, your requirement does make sense for synchronous replication. (And for the traditional asynchronous replication, users generally want the lag to be a small as possible too.) henrik -- email: henrik.ingo@avoinelama.fi tel: +358-40-5697354 www: www.avoinelama.fi/~hingo book: www.openlife.cc
On Tue, 23 Mar 2010 13:03:34 +0200, Henrik Ingo
On Tue, Mar 23, 2010 at 10:40 AM,
wrote: At least the application of replicated transactions certainly should not be part of each storage engine. From the engine point of view, applying a set of replicated transactions should be "just another transaction". For the engine it should not matter if a transaction comes from the application, mysqldump, or a redundancy service. (There may be small details: when the application does a transaction, we need a new global txn id, but when applying a replicated transaction, the id is already there.)
yes, but no. .e.g. Galera replication has this strange need to use prioritized transactions for applying. DBMS should have the responsibility to provide high priority sessions for replication appliers.
So Galera only works with the engines you have hacked to support it? (Which I fear is InnoDB only?)
Like I said, we can still list a set of details the engine needs to support, but in general of course we should be able to replicate from any table/engine into any table/engine without having "replication support" in each engine separately.
But on the face of it, your requirement does make sense for synchronous replication. (And for the traditional asynchronous replication, users generally want the lag to be a small as possible too.)
henrik
High-priority here means that this transaction can abort any other that stands in its way. Actually this is a requirement only to have Galera working in multi-master mode. And it does not look like an engine requirement, rather a lock manager feature (rather useful though - on slave it guarantees that transactions received from master will never deadlock with local transactions). So I think the server may or may not provide such a feature, it should not really affect the redundancy API. I'd say that it belongs to the integration layer. Alex
On Tue, 23 Mar 2010 10:12:53 +0200, Henrik Ingo
Meta discussion first, replication discussion below :-)
<cut>
So those are the requirements I could derive from having NDB use our to-be-implemented API. My conclusion from the above is that we should consider adding to the model the concept of a transaction group, which: -> the engine (or MariaDB server, for multi-engine transactions?) MAY provide information of which transactions had been committed within the same group. -> If such information was provided, a redundancy service MAY process transactions inside a group in parallel or out of order, but MUST make sure that all transactions in transaction group G1 are processed/committed before the first transaction in G2 is processed/comitted.
Well, that's a pretty cool concept. One way to call it is "controlled eventual consistency". But does redundancy service have to know about it?
If the redundancy service does not know about it, how would the information be transmitted by it??? For instance take the example of the binlog, which is a redundancy service in this model. If it supported this information (which it MAY do), it of course has to save it in some format in the binlog file.
First of all, these groups are just superpositions of individual atomic transactions. That is, this CAN be implemented on top of the current model.
Yes, this is the intent.
Secondly, transaction applying is done by the engine, so the engine or the server HAS to have a support for this, both on the master and on the slave side. So why not keep the redundancy service API free from that at all? Consider this scheme:
Database Server | Redundancy Service (database data) | (redundancy information) | Redundancy API
The task of redundancy service is to store and provide redundancy information that can be used in restoring the database to a desired state. Keeping the information and using it - two different things. The
of API is to separate one part of the program from the logic of another. So I'd keep the model and the API as simple as free from the server
as possible.
What it means here: redundancy service stores atomic database changes in a certain order and it guarantees that it will return these changes in
I guess we can consider meta-discussion closed for now unless someone wants to add to it. I'm content ;) purpose details the
same order. This is sufficient to restore the database to any state it had. It is up to the server in what order it will apply these changes and if it wants to skip some states. (This assumes that the changesets are opaque to redundancy service and the server can include whatever information it wants in them, including ordering prefixes)
Ok, this is an interesting distinction you make.
So in current MySQL/MariaDB, one place where transactions are applied to a replica is the slave SQL thread. Conceptually I've always thought of this as "part of replication code". You propose here that this should be a common module on the MariaDB server side of the API, rather than part of each redundancy service.
Yes.
I guess this may make sense.
Well, it is of course a matter of debate, but not all of the redundancy-related code has to be encompassed by the redundancy API. The main purpose of API is to hide implementation details and it goes both ways: we want to hide the redundancy details form the server, and likewise we want to hide the server details from the redundancy service. Thus flexibility and maintainability is achieved. And the thinner is the API, the better. That is one of the reasons of identifying the model - this is the best way to see what this API should contain. To put it another way, there are APIs and there is an integration code that holds them together. Like, for example, the code that we exchanged with Kristian.
This opens up a new field of questions related to the user interface of all this. Typically, or "how things are today", a user will initiate replication/redundancy related events from the side of the redundancy service. Eg if I want to setup mysql statement based replication, there is a set of commands to do that. If I want to recover the database by replaying the binlog file, there is a set of binlog specific tools to do that. Each redundancy service solves some problems from its own specific approach, and provides a user interface for those tasks. So I guess at some point it will be interesting to see what the command interface to all this will look like and whether I use something specific to the redundancy service or some general MariaDB command set to make replication happen.
It does not so much depend on where you draw the API line, but more on what aspects of the model you want to expose to the user. Most probably - all. Thus we'll need the ability to create a replication set, add plugins to its stack (perhaps first create the stack) and configure individual plugin instances. Setting variables is definitely not enough for that, so you'll need either a special set of commands, something along the GRANT line, or, considering that replication configuration tends to be highly structured and you'll keep it in the tables, a special (don't laugh yet) storage engine where you will be able to modify table contents using regular SQL, and this engine will in turn call corresponding API calls. I think there could be a number of benefits in such arrangement, although I'm not sure about performance.
At least the application of replicated transactions certainly should not be part of each storage engine. From the engine point of view, applying a set of replicated transactions should be "just another transaction". For the engine it should not matter if a transaction comes from the application, mysqldump, or a redundancy service. (There may be small details: when the application does a transaction, we need a new global txn id, but when applying a replicated transaction, the id is already there.)
Certainly. I think this goes without question. What I meant back there was that either the engine or the server should be capable of parallel (out-of-order is interesting only if it is parallel, right?) applying and for the purposes of recovery it will be no longer enough for the engine to just miantain the last committed transaction ID, it'll have to keep the list of uncommitted transactions from the last group. -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
+1 on a replication heartbeat w/configurable heartbeat time. We'd want to
issue a heartbeat pules every 1 sec in our environment to measure *true*
replication latency.
Jeremy
On Fri, Jan 22, 2010 at 6:21 AM, Kristian Nielsen
The three companies Continuent, Codership, and Monty Program are planning to start working on some enhancements to the replication system in MariaDB, together with anyone interested in joining in.
At this stage, there are no fixed directions for the project, and to do this in as open a way possible with the maximum community involvement and interest, we agreed to start with an email discussion on the maria-developers@mailing list. So consider it started!
The plan so far is:
1) The parties taking this initiative, MP, Continuent, and Codership, present their own ideas in this thread on maria-developers@ (and everyone else who wants to chime in at this stage).
2) Once we have some concrete suggestions as a starting point, we use this to reach out in a broader way with some blog posts on planetmysql / planetmariadb to encourage further input and discussions for possible directions of the project. Eventually we want to end up with a list of the most important goals and a possible roadmap for replication enhancements.
(It is best to have something concrete as a basis of a broad community discussion/process).
To start of, here are some points of interest that I collected. Everyone please chime in with your own additional points, as well as comments and further details on these one.
Three areas in particular seem to be of high interest in the community currently (not excluding any other areas):
- High Availability * Most seems to focus on building HA solutions on top of MySQL replication, eg. MMM and Tungsten. * For this project, seems mainly to be to implement improvements to replication that help facilitate improving these on-top HA solutions. * Tools to automate (or help automate) failover from a master. * Better facilities to do initial setup of new slave without downtime, or re-sync of an old master or slave that has been outside of the replication topology for some period of time.
- Performance, especially scalability * Multi-threaded slave SQL thread. * Store the binlog inside a transactional engine (eg. InnoDB) to reduce I/O, solve problems like group commit, and simplify crash recovery.
- More pluggable replication * Make the replication code and APIs be more suitable for people to build extra functionality into or on top of the stock MySQL replication. * Better documentation of C++ APIs and binlog format. * Adding extra information to binlog that can be useful for non-standard replication stuff. For example column names (for RBR), checksums. * Refactoring the server code to be more modular with APIs more suitable for external usage. * Add support for replication plugins, types to be determined. For example binlog filtering plugins?
It is also very important to consider the work that the replication team at MySQL is doing (and has done). I found a good deal of interesting information about this here:
http://forge.mysql.com/wiki/MySQL_Replication:_Walk-through_of_the_new_5.1_a... )
This describes a number of 6.0/5.4 and preview features that we could merge and/or contribute to. Here are the highlights that I found:
- Features included in 6.0/5.4 (which are cancelled I think, but presumably this will go in a milestone release): * CHANGE MASTER ... IGNORE_SERVER_IDS for better support of circular replication. * Replication heartbeat. * sync_relay_log_info, sync_master_info, sync_relay_log, relay_log_recovery for crash recovery on slave. * Binlog Performance Optimization (lock contention improvement). * Semi-synchronous Replication, with Pluggable Replication Architecture. http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication
- Feature previews: * Parallel slave application: WL#4648 http://forge.mysql.com/wiki/ReplicationFeatures/ParallelSlave * Time-delayed replication: WL#344 http://forge.mysql.com/wiki/ReplicationFeatures/DelayedReplication * Scriptable Replication: WL#4008 http://forge.mysql.com/wiki/ReplicationFeatures/ScriptableReplication * Synchronous Replication.
Drizzle is also doing work on a new replication system. I read through the series of blog posts that Jay Pipes wrote on this subject. They mostly deal with how this is designed in terms of the Drizzle server code, and is low on detail about how the replication will actually work (the only thing I could really extract was that it is a form of row-based replication). If someone has links to information about this that I missed, it could be interesting.
Let the discussion begin!
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developershttps://launchpad.net/%7Emaria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developershttps://launchpad.net/%7Emaria-developers More help : https://help.launchpad.net/ListHelp
Hi Jeremy,
Thanks for the input. I'm not sure what you mean with true replication
latency here.
Anyway, the replication system can internally measure latencies
from the point where the replication event was passed for replication
until it was received/applied in the receiving end. And these latency
measurements can be averaged and visualized for end user .e.g. through
status variables.
Or is there something more you are looking for...?
Cheers, Seppo
--
http://www.codership.com seppo.jaakola@codership.com
skype: seppo_jaakola
Quoting Jeremy Zawodny
+1 on a replication heartbeat w/configurable heartbeat time. We'd want to issue a heartbeat pules every 1 sec in our environment to measure *true* replication latency.
Jeremy
On Fri, Jan 22, 2010 at 6:21 AM, Kristian Nielsen
wrote: The three companies Continuent, Codership, and Monty Program are planning to start working on some enhancements to the replication system in MariaDB, together with anyone interested in joining in.
At this stage, there are no fixed directions for the project, and to do this in as open a way possible with the maximum community involvement and interest, we agreed to start with an email discussion on the maria-developers@mailing list. So consider it started!
The plan so far is:
1) The parties taking this initiative, MP, Continuent, and Codership, present their own ideas in this thread on maria-developers@ (and everyone else who wants to chime in at this stage).
2) Once we have some concrete suggestions as a starting point, we use this to reach out in a broader way with some blog posts on planetmysql / planetmariadb to encourage further input and discussions for possible directions of the project. Eventually we want to end up with a list of the most important goals and a possible roadmap for replication enhancements.
(It is best to have something concrete as a basis of a broad community discussion/process).
To start of, here are some points of interest that I collected. Everyone please chime in with your own additional points, as well as comments and further details on these one.
Three areas in particular seem to be of high interest in the community currently (not excluding any other areas):
- High Availability * Most seems to focus on building HA solutions on top of MySQL replication, eg. MMM and Tungsten. * For this project, seems mainly to be to implement improvements to replication that help facilitate improving these on-top HA solutions. * Tools to automate (or help automate) failover from a master. * Better facilities to do initial setup of new slave without downtime, or re-sync of an old master or slave that has been outside of the replication topology for some period of time.
- Performance, especially scalability * Multi-threaded slave SQL thread. * Store the binlog inside a transactional engine (eg. InnoDB) to reduce I/O, solve problems like group commit, and simplify crash recovery.
- More pluggable replication * Make the replication code and APIs be more suitable for people to build extra functionality into or on top of the stock MySQL replication. * Better documentation of C++ APIs and binlog format. * Adding extra information to binlog that can be useful for non-standard replication stuff. For example column names (for RBR), checksums. * Refactoring the server code to be more modular with APIs more suitable for external usage. * Add support for replication plugins, types to be determined. For example binlog filtering plugins?
It is also very important to consider the work that the replication team at MySQL is doing (and has done). I found a good deal of interesting information about this here:
http://forge.mysql.com/wiki/MySQL_Replication:_Walk-through_of_the_new_5.1_a... )
This describes a number of 6.0/5.4 and preview features that we could merge and/or contribute to. Here are the highlights that I found:
- Features included in 6.0/5.4 (which are cancelled I think, but presumably this will go in a milestone release): * CHANGE MASTER ... IGNORE_SERVER_IDS for better support of circular replication. * Replication heartbeat. * sync_relay_log_info, sync_master_info, sync_relay_log, relay_log_recovery for crash recovery on slave. * Binlog Performance Optimization (lock contention improvement). * Semi-synchronous Replication, with Pluggable Replication Architecture. http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication
- Feature previews: * Parallel slave application: WL#4648 http://forge.mysql.com/wiki/ReplicationFeatures/ParallelSlave * Time-delayed replication: WL#344 http://forge.mysql.com/wiki/ReplicationFeatures/DelayedReplication * Scriptable Replication: WL#4008 http://forge.mysql.com/wiki/ReplicationFeatures/ScriptableReplication * Synchronous Replication.
Drizzle is also doing work on a new replication system. I read through the series of blog posts that Jay Pipes wrote on this subject. They mostly deal with how this is designed in terms of the Drizzle server code, and is low on detail about how the replication will actually work (the only thing I could really extract was that it is a form of row-based replication). If someone has links to information about this that I missed, it could be interesting.
Let the discussion begin!
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developershttps://launchpad.net/%7Emaria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developershttps://launchpad.net/%7Emaria-developers More help : https://help.launchpad.net/ListHelp
If the connection between a slave and master is interrupted, the slave won't
report itself as being "behind" until the slave's network timeout to the
master expires and it reconnect (assuming it can).
Jeremy
On Mon, Jan 25, 2010 at 10:44 AM,
Hi Jeremy,
Thanks for the input. I'm not sure what you mean with true replication latency here. Anyway, the replication system can internally measure latencies from the point where the replication event was passed for replication until it was received/applied in the receiving end. And these latency measurements can be averaged and visualized for end user .e.g. through status variables.
Or is there something more you are looking for...?
Cheers, Seppo
-- http://www.codership.com seppo.jaakola@codership.com skype: seppo_jaakola
Quoting Jeremy Zawodny
: +1 on a replication heartbeat w/configurable heartbeat time. We'd want to
issue a heartbeat pules every 1 sec in our environment to measure *true* replication latency.
Jeremy
On Fri, Jan 22, 2010 at 6:21 AM, Kristian Nielsen
wrote: The three companies Continuent, Codership, and Monty Program are planning
to start working on some enhancements to the replication system in MariaDB, together with anyone interested in joining in.
At this stage, there are no fixed directions for the project, and to do this in as open a way possible with the maximum community involvement and interest, we agreed to start with an email discussion on the maria-developers@mailing list. So consider it started!
The plan so far is:
1) The parties taking this initiative, MP, Continuent, and Codership, present their own ideas in this thread on maria-developers@ (and everyone else who wants to chime in at this stage).
2) Once we have some concrete suggestions as a starting point, we use this to reach out in a broader way with some blog posts on planetmysql / planetmariadb to encourage further input and discussions for possible directions of the project. Eventually we want to end up with a list of the most important goals and a possible roadmap for replication enhancements.
(It is best to have something concrete as a basis of a broad community discussion/process).
To start of, here are some points of interest that I collected. Everyone please chime in with your own additional points, as well as comments and further details on these one.
Three areas in particular seem to be of high interest in the community currently (not excluding any other areas):
- High Availability * Most seems to focus on building HA solutions on top of MySQL replication, eg. MMM and Tungsten. * For this project, seems mainly to be to implement improvements to replication that help facilitate improving these on-top HA solutions. * Tools to automate (or help automate) failover from a master. * Better facilities to do initial setup of new slave without downtime, or re-sync of an old master or slave that has been outside of the replication topology for some period of time.
- Performance, especially scalability * Multi-threaded slave SQL thread. * Store the binlog inside a transactional engine (eg. InnoDB) to reduce I/O, solve problems like group commit, and simplify crash recovery.
- More pluggable replication * Make the replication code and APIs be more suitable for people to build extra functionality into or on top of the stock MySQL replication. * Better documentation of C++ APIs and binlog format. * Adding extra information to binlog that can be useful for non-standard replication stuff. For example column names (for RBR), checksums. * Refactoring the server code to be more modular with APIs more suitable for external usage. * Add support for replication plugins, types to be determined. For example binlog filtering plugins?
It is also very important to consider the work that the replication team at MySQL is doing (and has done). I found a good deal of interesting information about this here:
http://forge.mysql.com/wiki/MySQL_Replication:_Walk-through_of_the_new_5.1_a... )
This describes a number of 6.0/5.4 and preview features that we could merge and/or contribute to. Here are the highlights that I found:
- Features included in 6.0/5.4 (which are cancelled I think, but presumably this will go in a milestone release): * CHANGE MASTER ... IGNORE_SERVER_IDS for better support of circular replication. * Replication heartbeat. * sync_relay_log_info, sync_master_info, sync_relay_log, relay_log_recovery for crash recovery on slave. * Binlog Performance Optimization (lock contention improvement). * Semi-synchronous Replication, with Pluggable Replication Architecture. http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication
- Feature previews: * Parallel slave application: WL#4648 http://forge.mysql.com/wiki/ReplicationFeatures/ParallelSlave * Time-delayed replication: WL#344 http://forge.mysql.com/wiki/ReplicationFeatures/DelayedReplication * Scriptable Replication: WL#4008
http://forge.mysql.com/wiki/ReplicationFeatures/ScriptableReplication * Synchronous Replication.
Drizzle is also doing work on a new replication system. I read through the series of blog posts that Jay Pipes wrote on this subject. They mostly deal with how this is designed in terms of the Drizzle server code, and is low on detail about how the replication will actually work (the only thing I could really extract was that it is a form of row-based replication). If someone has links to information about this that I missed, it could be interesting.
Let the discussion begin!
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developershttps://launchpad.net/%7Emaria-developers https://launchpad.net/%7Emaria-developers
Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developershttps://launchpad.net/%7Emaria-developers https://launchpad.net/%7Emaria-developers
More help : https://help.launchpad.net/ListHelp
_______________________________________________ Mailing list: https://launchpad.net/~maria-developershttps://launchpad.net/%7Emaria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developershttps://launchpad.net/%7Emaria-developers More help : https://help.launchpad.net/ListHelp
On Mon, 25 Jan 2010 10:47:23 -0800, Jeremy Zawodny
If the connection between a slave and master is interrupted, the slave won't report itself as being "behind" until the slave's network timeout to the master expires and it reconnect (assuming it can).
Jeremy
On Mon, Jan 25, 2010 at 10:44 AM,
wrote: Hi Jeremy,
Thanks for the input. I'm not sure what you mean with true replication latency here. Anyway, the replication system can internally measure latencies from the point where the replication event was passed for replication until it was received/applied in the receiving end. And these latency measurements can be averaged and visualized for end user .e.g. through status variables.
Or is there something more you are looking for...?
Cheers, Seppo
-- http://www.codership.com seppo.jaakola@codership.com skype: seppo_jaakola
Quoting Jeremy Zawodny
: +1 on a replication heartbeat w/configurable heartbeat time. We'd want to
issue a heartbeat pules every 1 sec in our environment to measure *true* replication latency.
Jeremy
On Fri, Jan 22, 2010 at 6:21 AM, Kristian Nielsen
wrote: The three companies Continuent, Codership, and Monty Program are planning
to start working on some enhancements to the replication system in MariaDB, together with anyone interested in joining in.
At this stage, there are no fixed directions for the project, and to do this in as open a way possible with the maximum community involvement and interest, we agreed to start with an email discussion on the maria-developers@mailing list. So consider it started!
The plan so far is:
1) The parties taking this initiative, MP, Continuent, and Codership, present their own ideas in this thread on maria-developers@ (and everyone else who wants to chime in at this stage).
2) Once we have some concrete suggestions as a starting point, we use this to reach out in a broader way with some blog posts on planetmysql / planetmariadb to encourage further input and discussions for possible directions of the project. Eventually we want to end up with a list of the most important goals and a possible roadmap for replication enhancements.
(It is best to have something concrete as a basis of a broad community discussion/process).
To start of, here are some points of interest that I collected. Everyone please chime in with your own additional points, as well as comments and further details on these one.
Three areas in particular seem to be of high interest in the community currently (not excluding any other areas):
- High Availability * Most seems to focus on building HA solutions on top of MySQL replication, eg. MMM and Tungsten. * For this project, seems mainly to be to implement improvements to replication that help facilitate improving these on-top HA solutions. * Tools to automate (or help automate) failover from a master. * Better facilities to do initial setup of new slave without downtime, or re-sync of an old master or slave that has been outside of the replication topology for some period of time.
- Performance, especially scalability * Multi-threaded slave SQL thread. * Store the binlog inside a transactional engine (eg. InnoDB) to reduce I/O, solve problems like group commit, and simplify crash recovery.
- More pluggable replication * Make the replication code and APIs be more suitable for people to build extra functionality into or on top of the stock MySQL replication. * Better documentation of C++ APIs and binlog format. * Adding extra information to binlog that can be useful for non-standard replication stuff. For example column names (for RBR), checksums. * Refactoring the server code to be more modular with APIs more suitable for external usage. * Add support for replication plugins, types to be determined. For example binlog filtering plugins?
It is also very important to consider the work that the replication team at MySQL is doing (and has done). I found a good deal of interesting information about this here:
http://forge.mysql.com/wiki/MySQL_Replication:_Walk-through_of_the_new_5.1_a...
)
This describes a number of 6.0/5.4 and preview features that we could merge and/or contribute to. Here are the highlights that I found:
- Features included in 6.0/5.4 (which are cancelled I think, but presumably this will go in a milestone release): * CHANGE MASTER ... IGNORE_SERVER_IDS for better support of circular replication. * Replication heartbeat. * sync_relay_log_info, sync_master_info, sync_relay_log, relay_log_recovery for crash recovery on slave. * Binlog Performance Optimization (lock contention improvement). * Semi-synchronous Replication, with Pluggable Replication Architecture.
http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication
- Feature previews: * Parallel slave application: WL#4648 http://forge.mysql.com/wiki/ReplicationFeatures/ParallelSlave * Time-delayed replication: WL#344
http://forge.mysql.com/wiki/ReplicationFeatures/DelayedReplication
* Scriptable Replication: WL#4008
http://forge.mysql.com/wiki/ReplicationFeatures/ScriptableReplication * Synchronous Replication.
Drizzle is also doing work on a new replication system. I read
Hi Jeremy, Does it mean that you want to detect network outages using the heartbeat? If so, won't it be easier to just set master-slave connection timeout to the desirable value and let replication system to handle it? I'm asking this not from position of improving current replication in short term, but from position of whether a heartbeat function is needed in replication API? Thanks, Alex through
the series of blog posts that Jay Pipes wrote on this subject. They mostly deal with how this is designed in terms of the Drizzle server code, and is low on detail about how the replication will actually work (the only thing I could really extract was that it is a form of row-based replication). If someone has links to information about this that I missed, it could be interesting.
Let the discussion begin!
- Kristian.
_______________________________________________ Mailing list:
https://launchpad.net/~maria-developershttps://launchpad.net/%7Emaria-developers
https://launchpad.net/%7Emaria-developers
Post to : maria-developers@lists.launchpad.net Unsubscribe :
https://launchpad.net/~maria-developershttps://launchpad.net/%7Emaria-developers
https://launchpad.net/%7Emaria-developers
More help : https://help.launchpad.net/ListHelp
_______________________________________________ Mailing list:
https://launchpad.net/~maria-developershttps://launchpad.net/%7Emaria-developers
Post to : maria-developers@lists.launchpad.net Unsubscribe :
https://launchpad.net/~maria-developershttps://launchpad.net/%7Emaria-developers
More help : https://help.launchpad.net/ListHelp
-- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
On Mon, Jan 25, 2010 at 11:47 AM, Alex Yurchenko < alexey.yurchenko@codership.com> wrote:
On Mon, 25 Jan 2010 10:47:23 -0800, Jeremy Zawodny
wrote: If the connection between a slave and master is interrupted, the slave won't report itself as being "behind" until the slave's network timeout to the master expires and it reconnect (assuming it can).
Jeremy
Hi Jeremy,
Does it mean that you want to detect network outages using the heartbeat? If so, won't it be easier to just set master-slave connection timeout to the desirable value and let replication system to handle it? I'm asking this not from position of improving current replication in short term, but from position of whether a heartbeat function is needed in replication API?
Not at all. My point is that you cannot (today) rely on MySQL slaves to "self report" in all situations. With a heatbeat that the master sends every N seconds and the slave expects to see every N seconds, it's possible to see latency in the replication stream regardless of the cause--even during weird network outages that the slave won't notice until the timeout is reached. Does that make more sense? Jeremy
On Mon, 25 Jan 2010 11:51:24 -0800, Jeremy Zawodny
On Mon, Jan 25, 2010 at 11:47 AM, Alex Yurchenko < alexey.yurchenko@codership.com> wrote:
On Mon, 25 Jan 2010 10:47:23 -0800, Jeremy Zawodny
wrote: If the connection between a slave and master is interrupted, the slave won't report itself as being "behind" until the slave's network timeout to the master expires and it reconnect (assuming it can).
Jeremy
Hi Jeremy,
Does it mean that you want to detect network outages using the heartbeat? If so, won't it be easier to just set master-slave connection timeout to the desirable value and let replication system to handle it? I'm asking this not from position of improving current replication in short term, but from position of whether a heartbeat function is needed in replication API?
Not at all.
My point is that you cannot (today) rely on MySQL slaves to "self report" in all situations. With a heatbeat that the master sends every N seconds and the slave expects to see every N seconds, it's possible to see latency in the replication stream regardless of the cause--even during weird network outages that the slave won't notice until the timeout is reached.
Does that make more sense?
Perhaps, but not for me really ;). Actually I still don't understand what exactly people mean by "heartbeat" here. According to http://dev.mysql.com/doc/refman/5.5/en/change-master-to.html it is simply a transport keepalive. It does not even measure latency. To me it is not a "feature" actually, it is a bugfix. I realize that you are looking for solution to the present day situation. What I'm trying to figure out is whether it should be a part of replication API. So let me ask it differently. Suppose you have a replication system that provides cluster membership service (members come and gone) and provides you with a latency measurement (averaged over a specified window). Would you still need an explicit heartbeat "feature"? Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
Yup; that's the ideal way for WAN based replication .. But for LAN;
there is no point other than adding extra overhead
--
Thanks
Venu
Sent from iPhone
On Jan 25, 2010, at 11:47 AM, Alex Yurchenko wrote: On Mon, 25 Jan 2010 10:47:23 -0800, Jeremy Zawodny
If the connection between a slave and master is interrupted, the
slave
won't
report itself as being "behind" until the slave's network timeout
to the
master expires and it reconnect (assuming it can). Jeremy Hi Jeremy, Does it mean that you want to detect network outages using the
heartbeat?
If so, won't it be easier to just set master-slave connection
timeout to
the desirable value and let replication system to handle it? I'm
asking
this not from position of improving current replication in short
term, but
from position of whether a heartbeat function is needed in
replication API? Thanks,
Alex On Mon, Jan 25, 2010 at 10:44 AM, Hi Jeremy, Thanks for the input. I'm not sure what you mean with true
replication
latency here.
Anyway, the replication system can internally measure latencies
from the point where the replication event was passed for
replication
until it was received/applied in the receiving end. And these
latency
measurements can be averaged and visualized for end user .e.g.
through
status variables. Or is there something more you are looking for...? Cheers, Seppo --
http://www.codership.com seppo.jaakola@codership.com
skype: seppo_jaakola Quoting Jeremy Zawodny issue a heartbeat pules every 1 sec in our environment to measure
*true*
replication latency. Jeremy On Fri, Jan 22, 2010 at 6:21 AM, Kristian Nielsen
The three companies Continuent, Codership, and Monty Program are
planning to
start working on some enhancements to the replication system in
MariaDB,
together with anyone interested in joining in. At this stage, there are no fixed directions for the project,
and to
do
this
in as open a way possible with the maximum community involvement
and
interest,
we agreed to start with an email discussion on the
maria-developers@mailing
list. So consider it started! The plan so far is: 1) The parties taking this initiative, MP, Continuent, and
Codership,
present
their own ideas in this thread on maria-developers@ (and everyone
else
who
wants to chime in at this stage). 2) Once we have some concrete suggestions as a starting point,
we use
this
to
reach out in a broader way with some blog posts on planetmysql /
planetmariadb
to encourage further input and discussions for possible
directions of
the
project. Eventually we want to end up with a list of the most
important
goals
and a possible roadmap for replication enhancements. (It is best to have something concrete as a basis of a broad
community
discussion/process). To start of, here are some points of interest that I collected.
Everyone
please chime in with your own additional points, as well as
comments
and
further details on these one. Three areas in particular seem to be of high interest in the
community
currently (not excluding any other areas): - High Availability
* Most seems to focus on building HA solutions on top of MySQL
replication, eg. MMM and Tungsten.
* For this project, seems mainly to be to implement
improvements to
replication that help facilitate improving these on-top HA
solutions.
* Tools to automate (or help automate) failover from a master.
* Better facilities to do initial setup of new slave without
downtime,
or
re-sync of an old master or slave that has been outside of the
replication topology for some period of time. - Performance, especially scalability
* Multi-threaded slave SQL thread.
* Store the binlog inside a transactional engine (eg. InnoDB) to
reduce
I/O, solve problems like group commit, and simplify crash
recovery. - More pluggable replication
* Make the replication code and APIs be more suitable for
people to
build
extra functionality into or on top of the stock MySQL
replication.
* Better documentation of C++ APIs and binlog format.
* Adding extra information to binlog that can be useful for
non-standard
replication stuff. For example column names (for RBR),
checksums.
* Refactoring the server code to be more modular with APIs more
suitable
for external usage.
* Add support for replication plugins, types to be determined.
For
example
binlog filtering plugins? It is also very important to consider the work that the
replication
team
at
MySQL is doing (and has done). I found a good deal of interesting
information
about this here: http://forge.mysql.com/wiki/MySQL_Replication:_Walk-through_of_the_new_5.1_a... ) This describes a number of 6.0/5.4 and preview features that we
could
merge
and/or contribute to. Here are the highlights that I found: - Features included in 6.0/5.4 (which are cancelled I think, but
presumably
this will go in a milestone release):
* CHANGE MASTER ... IGNORE_SERVER_IDS for better support of
circular
replication.
* Replication heartbeat.
* sync_relay_log_info, sync_master_info, sync_relay_log,
relay_log_recovery
for crash recovery on slave.
* Binlog Performance Optimization (lock contention improvement).
* Semi-synchronous Replication, with Pluggable Replication
Architecture. http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication - Feature previews:
* Parallel slave application: WL#4648
http://forge.mysql.com/wiki/ReplicationFeatures/ParallelSlave
* Time-delayed replication: WL#344 http://forge.mysql.com/wiki/ReplicationFeatures/DelayedReplication * Scriptable Replication: WL#4008 http://forge.mysql.com/wiki/ReplicationFeatures/ScriptableReplication
* Synchronous Replication. Drizzle is also doing work on a new replication system. I read +1 on a replication heartbeat w/configurable heartbeat time. We'd
want
to
through the
series of blog posts that Jay Pipes wrote on this subject. They
mostly
deal
with how this is designed in terms of the Drizzle server code,
and is
low
on
detail about how the replication will actually work (the only
thing I
could
really extract was that it is a form of row-based replication). If
someone
has
links to information about this that I missed, it could be
interesting. Let the discussion begin! - Kristian. _______________________________________________
Mailing list: https://launchpad.net/~maria-developers<https://launchpad.net/%7Emaria-developers https://launchpad.net/%7Emaria-developers Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers<https://launchpad.net/%7Emaria-developers https://launchpad.net/%7Emaria-developers More help : https://help.launchpad.net/ListHelp _______________________________________________
Mailing list: https://launchpad.net/~maria-developers<https://launchpad.net/%7Emaria-developers Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers<https://launchpad.net/%7Emaria-developers More help : https://help.launchpad.net/ListHelp --
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011 _______________________________________________
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help : https://help.launchpad.net/ListHelp
On Jan 25, 2010, at 12:44 PM, Jeremy Zawodny wrote:
+1 on a replication heartbeat w/configurable heartbeat time. We'd want to issue a heartbeat pules every 1 sec in our environment to measure *true* replication latency.
Jeremy
This is in MySQL 5.5 already. See MASTER_HEARTBEAT_PERIOD at: http://dev.mysql.com/doc/refman/5.5/en/change-master-to.html Regards, Harrison -- Harrison C. Fisk, MySQL Staff Support Engineer MySQL @ Sun Microsystems, Inc., http://www.sun.com/mysql/
Hi Kristian,
Thanks for kicking this thread off. I have had a bit of a busy week so it
has taken a while to get around summarizing Continuent thoughts on
improvements.
First of all, we Continuent Tungsten folk have a certain set of problems we
solve with replication. Here are the key use cases:
1. Making full DBMS copies for high availability and to scale reads
(traditional MySQL use pattern). For this we need simple, fast replication
of all databases as well as the ability to provision efficiently from full
database copies.
2. Supporting scalable data topologies for SaaS and other complex database
architectures that have multiple services generating and receiving data.
This can include patterns like:
* Mastering data in two locations and replicating into a single server
* Multi-master replication across sites
* Replicating different sets of data from one master to different slaves
3. Replicating heterogeneously between MySQL and other database like Oracle.
This requires the ability to filter and transform data easily. Another use
case of heterogeneous replication is copying across databases of the same
for application upgrades and migration between database versions.
4. Ensuring full data protection such that data, once committed, are not
lost or corrupted. This includes replicating [semi-]synchronously to
slaves, performing consistency checks on data, performing point-in-time
restoration of data (e.g., using backups + a change log), etc.
We can do these things for the most part with MySQL replication as it
currently stands by reading binlogs and taking things from there. (That's
how Tungsten works). Here are some features that would make it easier to
work with the existing replication implementation:
1.) Synchronous replication. It's enough if replication slaves can hold up
commit notices on the master. The MySQL 5.5 features look like a good start
but I have not started the implementation and have therefore not hit the
sharp corners.
2.) CRCs. CRCs and other built-in features like native consistency checks
seem like the most glaring omission in the current binlog implementation.
It's difficult to ensure correct operation without them, and indeed I'm
confident many MySQL bugs can be traced back to the lack of features to make
data corruption more easily visible.
3.) Self-contained transactions with adequate metadata. Row replication
does not supply column names or keys which makes SQL generation difficult
especially when there are differences in master and slave schema. Also,
session variables like FOREIGN_KEY_CHECKS affect DDL but are not embedded in
the transaction where they are used. Finally, character set support is a
little scary based on my one experience in that area. You have to read code
to get master lists of character sets; semantics are very unclear.
4.) Global transaction IDs. We put can these in ourselves but it would be
even better if MySQL generated them in the logs so we can just use the
binlog directly. Unlike Galera, we don't require sequence numbers to be
without gaps; it's enough if they are comparable and monotonically
increasing. (I don't even think forbidding gaps is a good idea but Alexey
and I need to discuss that over a beer in Helsinki.) Also, transactions IDs
need an unambiguous source ID or epoch number encoded in the ID so that you
can detect diverging serialization histories. This nasty little problem
that can lead to big accidents in the field.
5.) Simple-to-read formats. The binlog structure has grown pretty hairy
over the years. I would love to have protobuf messages like what Drizzle is
doing.
6.) Fewer bugs and weirdnesses. Handling of commits in the binlog has
switched back and forth over the years. Mark Callaghan has documented some
of the strange XA behavior and persistent holes in the protocol in this
area. Statement replication support has led to some kludges make you feel
faint when looking at them. (Example: temp tables, which basically don't
work.)
In fact, you could summarize 2-6 as making the binlog (whether written to
disk or not) into a consistent "database" that you can move elsewhere and
apply without having to add extra metadata, such as global IDs or table
column names. Currently we have to regenerate the log which is a huge waste
of resources and also have to depend external information to derive schema
definitions.
Finally, since there is already talk about rewriting replication from
scratch, I would like to point out for the sake of discussion a few things
that the current MySQL replication in my opinion does well. Any future
system must match them.
1.) Serialized transactions. MySQL provides a list of transactions in
serial order. All robust replication solutions this, and MySQL essentially
gives it to you modulo problems with XA and ill-behaved table types like
MyISAM that don't have guaranteed durability.
2.) Fast. MySQL replication really rips as long as you don't have slow
statements that block application on slaves or don't hit problems like the
infamous InnoDB broken group commit bug (#13669) reported by Peter Zaitsev.
3.) Network protocol. The network protocol for slaves is an excellent
feature. It means you can fetch events and build new forms of replication
without adding native code, which is a huge productivity boost.
4.) Robust. There is no lack of problems with MySQL replication but
realistically any new implementations will have a high bar to function
equally well. Plugin approaches like that used by Drizzle are very flexible
but they also tend to have a kick-the-can-down-the-road effect in that it's
up to plugins to provide a robust implementation. This in turn takes a long
time to do well unless plugins cut down the problem size, for example by
omitting statement replication.
5.) Works now. Starting from something that already works is sometimes the
best approach. Radically new approaches take a long time to pay off and
more often than not fail to see the light of day.
One final suggestion I would propose is that it might make sense to take a
two-pronged approach to replication as follows:
* Logical replication based on an enhanced form of today's MySQL replication
with substantial clean-up of existing code, simplification/enhancement of
binlog event formats, and other features that we can readily agree upon in
short order.
* Physical replication based on change-sets that can be applied at the level
of storage engine recovery logs. (Databases like MyISAM with no logs don't
get to play.) PostgreSQL has this now and it's really great. This might be
where to fit in the Galera ideas. Since this approach would be completely
new it could also take a more radical line without wrecking what is already
there. It would also fill a gap in the MySQL architecture--while there has
been logical replication for many years, there is no physical replication.
Anyway, I look forward to further discussion
Cheers, Robert
(On behalf of Continuent and Tungsten)
On 1/22/10 6:21 AM PST, "Kristian Nielsen"
The three companies Continuent, Codership, and Monty Program are planning to start working on some enhancements to the replication system in MariaDB, together with anyone interested in joining in.
At this stage, there are no fixed directions for the project, and to do this in as open a way possible with the maximum community involvement and interest, we agreed to start with an email discussion on the maria-developers@ mailing list. So consider it started!
The plan so far is:
1) The parties taking this initiative, MP, Continuent, and Codership, present their own ideas in this thread on maria-developers@ (and everyone else who wants to chime in at this stage).
2) Once we have some concrete suggestions as a starting point, we use this to reach out in a broader way with some blog posts on planetmysql / planetmariadb to encourage further input and discussions for possible directions of the project. Eventually we want to end up with a list of the most important goals and a possible roadmap for replication enhancements.
(It is best to have something concrete as a basis of a broad community discussion/process).
To start of, here are some points of interest that I collected. Everyone please chime in with your own additional points, as well as comments and further details on these one.
Three areas in particular seem to be of high interest in the community currently (not excluding any other areas):
- High Availability * Most seems to focus on building HA solutions on top of MySQL replication, eg. MMM and Tungsten. * For this project, seems mainly to be to implement improvements to replication that help facilitate improving these on-top HA solutions. * Tools to automate (or help automate) failover from a master. * Better facilities to do initial setup of new slave without downtime, or re-sync of an old master or slave that has been outside of the replication topology for some period of time.
- Performance, especially scalability * Multi-threaded slave SQL thread. * Store the binlog inside a transactional engine (eg. InnoDB) to reduce I/O, solve problems like group commit, and simplify crash recovery.
- More pluggable replication * Make the replication code and APIs be more suitable for people to build extra functionality into or on top of the stock MySQL replication. * Better documentation of C++ APIs and binlog format. * Adding extra information to binlog that can be useful for non-standard replication stuff. For example column names (for RBR), checksums. * Refactoring the server code to be more modular with APIs more suitable for external usage. * Add support for replication plugins, types to be determined. For example binlog filtering plugins?
It is also very important to consider the work that the replication team at MySQL is doing (and has done). I found a good deal of interesting information about this here:
http://forge.mysql.com/wiki/MySQL_Replication:_Walk-through_of_the_new_5.1_a... _6.0_features)
This describes a number of 6.0/5.4 and preview features that we could merge and/or contribute to. Here are the highlights that I found:
- Features included in 6.0/5.4 (which are cancelled I think, but presumably this will go in a milestone release): * CHANGE MASTER ... IGNORE_SERVER_IDS for better support of circular replication. * Replication heartbeat. * sync_relay_log_info, sync_master_info, sync_relay_log, relay_log_recovery for crash recovery on slave. * Binlog Performance Optimization (lock contention improvement). * Semi-synchronous Replication, with Pluggable Replication Architecture. http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication
- Feature previews: * Parallel slave application: WL#4648 http://forge.mysql.com/wiki/ReplicationFeatures/ParallelSlave * Time-delayed replication: WL#344 http://forge.mysql.com/wiki/ReplicationFeatures/DelayedReplication * Scriptable Replication: WL#4008 http://forge.mysql.com/wiki/ReplicationFeatures/ScriptableReplication * Synchronous Replication.
Drizzle is also doing work on a new replication system. I read through the series of blog posts that Jay Pipes wrote on this subject. They mostly deal with how this is designed in terms of the Drizzle server code, and is low on detail about how the replication will actually work (the only thing I could really extract was that it is a form of row-based replication). If someone has links to information about this that I missed, it could be interesting.
Let the discussion begin!
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Thanks Robert, this is comprehensive enough :)
I'll just address the consistency checking requirement here,
as I believe this is quite widely accepted goal as well.
Tungsten uses a special consistency table for passing consistency
checking information and which is treated in a special way in the
replication. This kind of consistency checking can be managed by
replication system configuration and it does not necessarily
require any new calls in the replication API (except the configuration,
which must be there anyway).
Tungsten consistency checking technology works very well, and there
is no need to "fix it" in any way. However, this method is not directly
usable for multi master replication, because the target node(s) may
have committed some transactions not yet seen in the originating
master node, and these new transactions can interfere with the
consistency check query.
To support multi-master topologies, the Replication API could support
consistency checking directly. At first look, it seems that (at least)
two calls would be needed for it:
* replication_prepare_consistency_check(), to start checking transaction
with consistent snapshot.
* replication_check_consistency(query, result), to run a wanted
select query in prepared transaction and comparing results in all
the nodes.
But how to trigger the consistency checking? That's a question, that
does not touch replication API directly, but affects the SQL layer.
There could be a totally new consistency check command or SELECT query
could have a new option for triggering consistency check.
Consistency call could be also passed as comment (for select), like:
SELECT MD5(...) FROM ...; /* replication-consistency-check */
But, I don't want to dive any deeper in this specification at this point.
Just wanted to bring to the attention the possibility of opening
consistency checking in replication API and SQL layers. And that
multi-master
consistency checking requires special attention.
Cheers, Seppo
--
http://www.codership.com seppo.jaakola@codership.com
Quoting Robert Hodges
Hi Kristian,
Thanks for kicking this thread off. I have had a bit of a busy week so it has taken a while to get around summarizing Continuent thoughts on improvements.
First of all, we Continuent Tungsten folk have a certain set of problems we solve with replication. Here are the key use cases:
1. Making full DBMS copies for high availability and to scale reads (traditional MySQL use pattern). For this we need simple, fast replication of all databases as well as the ability to provision efficiently from full database copies.
2. Supporting scalable data topologies for SaaS and other complex database architectures that have multiple services generating and receiving data. This can include patterns like: * Mastering data in two locations and replicating into a single server * Multi-master replication across sites * Replicating different sets of data from one master to different slaves
3. Replicating heterogeneously between MySQL and other database like Oracle. This requires the ability to filter and transform data easily. Another use case of heterogeneous replication is copying across databases of the same for application upgrades and migration between database versions.
4. Ensuring full data protection such that data, once committed, are not lost or corrupted. This includes replicating [semi-]synchronously to slaves, performing consistency checks on data, performing point-in-time restoration of data (e.g., using backups + a change log), etc.
We can do these things for the most part with MySQL replication as it currently stands by reading binlogs and taking things from there. (That's how Tungsten works). Here are some features that would make it easier to work with the existing replication implementation:
1.) Synchronous replication. It's enough if replication slaves can hold up commit notices on the master. The MySQL 5.5 features look like a good start but I have not started the implementation and have therefore not hit the sharp corners.
2.) CRCs. CRCs and other built-in features like native consistency checks seem like the most glaring omission in the current binlog implementation. It's difficult to ensure correct operation without them, and indeed I'm confident many MySQL bugs can be traced back to the lack of features to make data corruption more easily visible.
3.) Self-contained transactions with adequate metadata. Row replication does not supply column names or keys which makes SQL generation difficult especially when there are differences in master and slave schema. Also, session variables like FOREIGN_KEY_CHECKS affect DDL but are not embedded in the transaction where they are used. Finally, character set support is a little scary based on my one experience in that area. You have to read code to get master lists of character sets; semantics are very unclear.
4.) Global transaction IDs. We put can these in ourselves but it would be even better if MySQL generated them in the logs so we can just use the binlog directly. Unlike Galera, we don't require sequence numbers to be without gaps; it's enough if they are comparable and monotonically increasing. (I don't even think forbidding gaps is a good idea but Alexey and I need to discuss that over a beer in Helsinki.) Also, transactions IDs need an unambiguous source ID or epoch number encoded in the ID so that you can detect diverging serialization histories. This nasty little problem that can lead to big accidents in the field.
5.) Simple-to-read formats. The binlog structure has grown pretty hairy over the years. I would love to have protobuf messages like what Drizzle is doing.
6.) Fewer bugs and weirdnesses. Handling of commits in the binlog has switched back and forth over the years. Mark Callaghan has documented some of the strange XA behavior and persistent holes in the protocol in this area. Statement replication support has led to some kludges make you feel faint when looking at them. (Example: temp tables, which basically don't work.)
In fact, you could summarize 2-6 as making the binlog (whether written to disk or not) into a consistent "database" that you can move elsewhere and apply without having to add extra metadata, such as global IDs or table column names. Currently we have to regenerate the log which is a huge waste of resources and also have to depend external information to derive schema definitions.
Finally, since there is already talk about rewriting replication from scratch, I would like to point out for the sake of discussion a few things that the current MySQL replication in my opinion does well. Any future system must match them.
1.) Serialized transactions. MySQL provides a list of transactions in serial order. All robust replication solutions this, and MySQL essentially gives it to you modulo problems with XA and ill-behaved table types like MyISAM that don't have guaranteed durability.
2.) Fast. MySQL replication really rips as long as you don't have slow statements that block application on slaves or don't hit problems like the infamous InnoDB broken group commit bug (#13669) reported by Peter Zaitsev.
3.) Network protocol. The network protocol for slaves is an excellent feature. It means you can fetch events and build new forms of replication without adding native code, which is a huge productivity boost.
4.) Robust. There is no lack of problems with MySQL replication but realistically any new implementations will have a high bar to function equally well. Plugin approaches like that used by Drizzle are very flexible but they also tend to have a kick-the-can-down-the-road effect in that it's up to plugins to provide a robust implementation. This in turn takes a long time to do well unless plugins cut down the problem size, for example by omitting statement replication.
5.) Works now. Starting from something that already works is sometimes the best approach. Radically new approaches take a long time to pay off and more often than not fail to see the light of day.
One final suggestion I would propose is that it might make sense to take a two-pronged approach to replication as follows:
* Logical replication based on an enhanced form of today's MySQL replication with substantial clean-up of existing code, simplification/enhancement of binlog event formats, and other features that we can readily agree upon in short order.
* Physical replication based on change-sets that can be applied at the level of storage engine recovery logs. (Databases like MyISAM with no logs don't get to play.) PostgreSQL has this now and it's really great. This might be where to fit in the Galera ideas. Since this approach would be completely new it could also take a more radical line without wrecking what is already there. It would also fill a gap in the MySQL architecture--while there has been logical replication for many years, there is no physical replication.
Anyway, I look forward to further discussion
Cheers, Robert (On behalf of Continuent and Tungsten)
On 1/22/10 6:21 AM PST, "Kristian Nielsen"
wrote: The three companies Continuent, Codership, and Monty Program are planning to start working on some enhancements to the replication system in MariaDB, together with anyone interested in joining in.
At this stage, there are no fixed directions for the project, and to do this in as open a way possible with the maximum community involvement and interest, we agreed to start with an email discussion on the maria-developers@ mailing list. So consider it started!
The plan so far is:
1) The parties taking this initiative, MP, Continuent, and Codership, present their own ideas in this thread on maria-developers@ (and everyone else who wants to chime in at this stage).
2) Once we have some concrete suggestions as a starting point, we use this to reach out in a broader way with some blog posts on planetmysql / planetmariadb to encourage further input and discussions for possible directions of the project. Eventually we want to end up with a list of the most important goals and a possible roadmap for replication enhancements.
(It is best to have something concrete as a basis of a broad community discussion/process).
To start of, here are some points of interest that I collected. Everyone please chime in with your own additional points, as well as comments and further details on these one.
Three areas in particular seem to be of high interest in the community currently (not excluding any other areas):
- High Availability * Most seems to focus on building HA solutions on top of MySQL replication, eg. MMM and Tungsten. * For this project, seems mainly to be to implement improvements to replication that help facilitate improving these on-top HA solutions. * Tools to automate (or help automate) failover from a master. * Better facilities to do initial setup of new slave without downtime, or re-sync of an old master or slave that has been outside of the replication topology for some period of time.
- Performance, especially scalability * Multi-threaded slave SQL thread. * Store the binlog inside a transactional engine (eg. InnoDB) to reduce I/O, solve problems like group commit, and simplify crash recovery.
- More pluggable replication * Make the replication code and APIs be more suitable for people to build extra functionality into or on top of the stock MySQL replication. * Better documentation of C++ APIs and binlog format. * Adding extra information to binlog that can be useful for non-standard replication stuff. For example column names (for RBR), checksums. * Refactoring the server code to be more modular with APIs more suitable for external usage. * Add support for replication plugins, types to be determined. For example binlog filtering plugins?
It is also very important to consider the work that the replication team at MySQL is doing (and has done). I found a good deal of interesting information about this here:
http://forge.mysql.com/wiki/MySQL_Replication:_Walk-through_of_the_new_5.1_a... _6.0_features)
This describes a number of 6.0/5.4 and preview features that we could merge and/or contribute to. Here are the highlights that I found:
- Features included in 6.0/5.4 (which are cancelled I think, but presumably this will go in a milestone release): * CHANGE MASTER ... IGNORE_SERVER_IDS for better support of circular replication. * Replication heartbeat. * sync_relay_log_info, sync_master_info, sync_relay_log, relay_log_recovery for crash recovery on slave. * Binlog Performance Optimization (lock contention improvement). * Semi-synchronous Replication, with Pluggable Replication Architecture. http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication
- Feature previews: * Parallel slave application: WL#4648 http://forge.mysql.com/wiki/ReplicationFeatures/ParallelSlave * Time-delayed replication: WL#344 http://forge.mysql.com/wiki/ReplicationFeatures/DelayedReplication * Scriptable Replication: WL#4008 http://forge.mysql.com/wiki/ReplicationFeatures/ScriptableReplication * Synchronous Replication.
Drizzle is also doing work on a new replication system. I read through the series of blog posts that Jay Pipes wrote on this subject. They mostly deal with how this is designed in terms of the Drizzle server code, and is low on detail about how the replication will actually work (the only thing I could really extract was that it is a form of row-based replication). If someone has links to information about this that I missed, it could be interesting.
Let the discussion begin!
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Hi Seppo,
On 1/29/10 4:36 AM PST, "seppo.jaakola@codership.com"
Thanks Robert, this is comprehensive enough :)
I'll just address the consistency checking requirement here, as I believe this is quite widely accepted goal as well.
Tungsten uses a special consistency table for passing consistency checking information and which is treated in a special way in the replication. This kind of consistency checking can be managed by replication system configuration and it does not necessarily require any new calls in the replication API (except the configuration, which must be there anyway).
Tungsten consistency checking technology works very well, and there is no need to "fix it" in any way. However, this method is not directly usable for multi master replication, because the target node(s) may have committed some transactions not yet seen in the originating master node, and these new transactions can interfere with the consistency check query.
I'm sure I should really know this already, but does your statement amount to saying the consistency check needs to execute in total order with other change sets? Or are there other data visibility issues along the lines of non-repeatable reads that would cause you to get different answers on different servers?
To support multi-master topologies, the Replication API could support consistency checking directly. At first look, it seems that (at least) two calls would be needed for it: * replication_prepare_consistency_check(), to start checking transaction with consistent snapshot. * replication_check_consistency(query, result), to run a wanted select query in prepared transaction and comparing results in all the nodes.
But how to trigger the consistency checking? That's a question, that does not touch replication API directly, but affects the SQL layer. There could be a totally new consistency check command or SELECT query could have a new option for triggering consistency check. Consistency call could be also passed as comment (for select), like: SELECT MD5(...) FROM ...; /* replication-consistency-check */
But, I don't want to dive any deeper in this specification at this point. Just wanted to bring to the attention the possibility of opening consistency checking in replication API and SQL layers. And that multi-master consistency checking requires special attention.
Cheers, Seppo
-- http://www.codership.com seppo.jaakola@codership.com
Quoting Robert Hodges
:
Hi Robert,
Tungsten consistency checking technology works very well, and there is no need to "fix it" in any way. However, this method is not directly usable for multi master replication, because the target node(s) may have committed some transactions not yet seen in the originating master node, and these new transactions can interfere with the consistency check query.
I'm sure I should really know this already, but does your statement amount to saying the consistency check needs to execute in total order with other change sets? Or are there other data visibility issues along the lines of non-repeatable reads that would cause you to get different answers on different servers?
Consistency checking needs to be run in same read view in all the nodes. (this works only for MVCC engines). We can use total order monitor to control the start of the checking transactions, the actual check query can then be relaxed from total order. Which is actually crucial for this to work efficiently, as total order operations are very expensive. Seppo -- http://www.codership.com seppo.jaakola@codership.com
Hi Seppo,
Thanks, that was my assumption as well but life tends to a little more
complex than theory.
Cheers, Robert
On 1/30/10 1:10 AM PST, "seppo.jaakola@codership.com"
Hi Robert,
Tungsten consistency checking technology works very well, and there is no need to "fix it" in any way. However, this method is not directly usable for multi master replication, because the target node(s) may have committed some transactions not yet seen in the originating master node, and these new transactions can interfere with the consistency check query.
I'm sure I should really know this already, but does your statement amount to saying the consistency check needs to execute in total order with other change sets? Or are there other data visibility issues along the lines of non-repeatable reads that would cause you to get different answers on different servers?
Consistency checking needs to be run in same read view in all the nodes. (this works only for MVCC engines). We can use total order monitor to control the start of the checking transactions, the actual check query can then be relaxed from total order. Which is actually crucial for this to work efficiently, as total order operations are very expensive.
Seppo
-- http://www.codership.com seppo.jaakola@codership.com
Robert Hodges
First of all, we Continuent Tungsten folk have a certain set of problems we solve with replication. Here are the key use cases:
3. Replicating heterogeneously between MySQL and other database like Oracle. This requires the ability to filter and transform data easily. Another use case of heterogeneous replication is copying across databases of the same for application upgrades and migration between database versions.
Yes, this is quite interesting, and somewhat different from normal MySQL->MySQL replication.
4. Ensuring full data protection such that data, once committed, are not lost or corrupted. This includes replicating [semi-]synchronously to slaves, performing consistency checks on data, performing point-in-time restoration of data (e.g., using backups + a change log), etc.
And also reliable crash recovery. Which I think is not there in 5.1, and in 5.5 is implemented in a way that I fear comes with too high a performance cost for many of the applications that need it the most (too many fsync()s).
how Tungsten works). Here are some features that would make it easier to work with the existing replication implementation:
1.) Synchronous replication. It's enough if replication slaves can hold up commit notices on the master. The MySQL 5.5 features look like a good start but I have not started the implementation and have therefore not hit the sharp corners.
As I understand it, synchronous replication based on current 5.5 features would first start commit on master, then send binlog to slave, then run and commit transaction on slave, then finish commit on master. So transactions-per-second rate would be quite limited. But of course there are many applications where load is light and this would be useful.
2.) CRCs. CRCs and other built-in features like native consistency checks seem like the most glaring omission in the current binlog implementation. It's difficult to ensure correct operation without them, and indeed I'm confident many MySQL bugs can be traced back to the lack of features to make data corruption more easily visible.
Yes. (If there was a good SQL-level way of matching binlog position/transaction ID with MVCC snapshot version, consistency checks on tables could be implemented very well and flexible from outside the server and replication framework. This would be a very nice feature to have. Though not without problems to implement...)
3.) Self-contained transactions with adequate metadata. Row replication does not supply column names or keys which makes SQL generation difficult especially when there are differences in master and slave schema. Also,
Right... but do you suggest putting the entire table definition of every table into every transaction? Sounds a bit bloated perhaps? The row-level binlogging in MySQL is based on column index, not column name. But I understand that the ability to generate SQL (which is based on column name rather than index) would be nice.
session variables like FOREIGN_KEY_CHECKS affect DDL but are not embedded in the transaction where they are used. Finally, character set support is a
Can you elaborate? Wouldn't this also cause bugs in MySQL replication itself?
little scary based on my one experience in that area. You have to read code to get master lists of character sets; semantics are very unclear.
What are the issues with character set?
and I need to discuss that over a beer in Helsinki.) Also, transactions IDs need an unambiguous source ID or epoch number encoded in the ID so that you can detect diverging serialization histories. This nasty little problem that can lead to big accidents in the field.
Can you elaborate? I don't understand exactly what a "source ID" or "epoch number" would be. Can you give an example?
In fact, you could summarize 2-6 as making the binlog (whether written to disk or not) into a consistent "database" that you can move elsewhere and apply without having to add extra metadata, such as global IDs or table column names. Currently we have to regenerate the log which is a huge waste of resources and also have to depend external information to derive schema definitions.
Finally, since there is already talk about rewriting replication from scratch, I would like to point out for the sake of discussion a few things that the current MySQL replication in my opinion does well. Any future system must match them.
4.) Robust. There is no lack of problems with MySQL replication but realistically any new implementations will have a high bar to function equally well. Plugin approaches like that used by Drizzle are very flexible but they also tend to have a kick-the-can-down-the-road effect in that it's up to plugins to provide a robust implementation. This in turn takes a long time to do well unless plugins cut down the problem size, for example by omitting statement replication.
Yes. I think this is a very good point. Eg. many of your points could be answered merely by making MySQL binlogging pluggable and let Tungsten (and everyone else) just implement their own logging to fit their particular purpose. But there is also a lot to be said for providing a single really useful binlog implementation. It does not sound appealing for users to have 3 or 4 different binlogs on their systems, each supporting a particular plugin (we already have two with the engines internal transactional log, which is arguably one too many).
2.) Fast. MySQL replication really rips as long as you don't have slow statements that block application on slaves or don't hit problems like the infamous InnoDB broken group commit bug (#13669) reported by Peter Zaitsev.
Well, this may be true for in-memory working sets. But if you have a larger system that does not fit in main memory and is bottlenecked by the performance of the disk system, the single-threaded slave really hurts. It makes it really hard to scale up on the disk I/O on the slave. Everybody who is into larger systems seem to mention this.
* Logical replication based on an enhanced form of today's MySQL replication with substantial clean-up of existing code, simplification/enhancement of binlog event formats, and other features that we can readily agree upon in short order.
Yes. So one could imagine making the pluggable replication, and moving the existing MySQL binlog into a plugin for backwards compatibility. Then we could write another plugin with enhanced, not backward compatible binlog containing these enhancements in a more extensible format (eg. column names in transactions would hurt no-one if they could be easily switched on or off). (Not sure if current MySQL replication could be suitable extended without a separate plugin, but if so then so much the better). - Kristian.
Robert Hodges
In fact, you could summarize 2-6 as making the binlog (whether written to disk or not) into a consistent "database" that you can move elsewhere and apply without having to add extra metadata, such as global IDs or table column names. Currently we have to regenerate the log which is a huge waste of resources and also have to depend external information to derive schema definitions.
So how should we go about making such improvements to the MySQL binlog, seen from the point of view of Tungsten? I am thinking about the overall approach. What would be possible ways forward for Tungsten? 1. If we make the binlog writer pluggable, would Tungsten want to implement its own plugin, with its own binlog format specially tailored for its needs? 2. Or would the idea case be that MariaDB created a new standard binlog format, based on Google protobuffers or whatever, which would be used by both standard replication, Tungsten, and others? 3. Or would it be better for Tungsten just to keep the current binlog format, merely adding some smaller enhancements needed in Tungsten (as discussed earlier in this thread)? One issue is how much benefit will come from a new binlog format, weighted against how much effort will be needed on the part of Tungsten to implement, support, and maintain the new code. Does Tungsten foresee implementing a new input binlog format for Tungsten and/or a new binlog plugin for the server? Or is it more realistic to work on improving the current code? The other issue is how important compatibility with unmodified Sun/Oracle MySQL. Will it be a requirement to keep Tungsten compatible with "pure" MySQL? Or will it be acceptable to require patches to be installed for full functionality? Or does Tungsten already today require a custom server installation, so that being based on MariaDB in the future is not a problem, and just the server has to be compatible with MySQL from a user perspective? These questions are of course important to plan how this project can fit in best with Tungsten. So I would be very interested to hear your thoughts on this. - Kristian.
I have been occupied with conferences/travel, and will be so next week also. But I still wanted to kick off the next another of this thread. I would like to introduce the perspective of what current replication users are missing the most from the current implementation. I found this blog post (by Mats Kindahl) interesting in this respect: http://mysqlmusings.blogspot.com/2007/06/replication-poll-and-our-plans-for.... It describes a poll on http://dev.mysql.com/ on most wanted replication features, with the following top-5 as result: 1. Online check that Master and Slave tables are consistent 2. Multi-source replication: replicating from several masters to one slave 3. Multi-threaded application of data on slave to improve performance 4. Conflict resolution: earlier replicated rows are not applied 5. Semi-synchronous replication: transaction copied to slave before commit To list list I would add: 6. Reliable crash safety/recovery of slaves 7. Facilities to implement simple and robust slave takeover from old master 8. Easy slave provisioning Now, obviously in the current project we cannot do all of these :-). But the idea is to discuss new APIs, and I think it is important to have in mind what important new features we need to implement in the next say 2-5 years. We would want to help/facilitate such development in the API design where possible/relevant. So please add to the above list if you think something is missing. Of the above list, I think point 3 (parallel replication) is particularly important to keep in mind: - It is one of the top limitations mentioned always by "big" users of replication. - It is a complex task, and so will need a lot of time and effort to get working (so the sooner we start the better!) - As far as I can determine, the MySQL replication team is not really working on this. Points 5 seems to be already implemented in 5.5.1. Points 2 and 4 have already been started by the MySQL replication team according to the (old) blog by Mats (not sure of current status though). Point 8 may already be there with things like LVM snapshots or xtrabackup (not sure). --- One thing I notice in this list is that the concept of global transaction ID seems to play into many of them. Certainly 1, 3, 5, 6, 7, and maybe others also. And it was already discussed in this thread also. So I'm thinking that to get more concrete, global transaction ID seems like the place to start. I will send more thoughts on this in follow-up mails, in particular with respect to possibilities for implementing parallel replication. - Kristian.
participants (13)
-
Alex Yurchenko
-
Harrison Fisk
-
Henrik Ingo
-
Jeremy Zawodny
-
Kristian Nielsen
-
MARK CALLAGHAN
-
Michael Widenius
-
Robert Hodges
-
seppo.jaakola@codership.com
-
Sergei Golubchik
-
Sergei Golubchik
-
Stewart Smith
-
Venu Kalyan