Alex Yurchenko <alexey.yurchenko@codership.com> writes:
On Wed, 19 May 2010 15:05:55 +0200, Sergei Golubchik <serg@askmonty.org> wrote:
Yes, it only describes how the data get to the redundancy service, but not what happens there. I intentionally kept the details of redundancy out, to be able to satisfy a wide range of different implementations.
For example, if I'd put a global transaction ID explicitly in the model, then MySQL replication would not fit into it - it has such a concept only implicitly, as you have noted.
So, what I did was, as Robert Hodges put it, "pushed the can down the road", and let the redundancy service to take care of the transaction ids.
But perhaps I'm biased and the model I've described is influenced by MySQL replication more than it should've been ?
Oh, not really. I just wanted to note that while you were proposing a useful framework, you did not touch actual replication/redundancy specifics.
Yes, I agree. I think what we need to do is have several layers in the API. So far I have identified three different layers: 1. Event generators and consumers. This is what Sergei discussed. The essentials of this layer is hooks in handler::write_row() and similer places that provides data about changes (row values for row-based replication, query texts for statement-based replication, etc). There is no binlog or global transaction ID at this layer, I think there may not even be a defined event format as such, just an API for consumer plugins to get the information (and for generator plugins to provide it). 2. Primary redundancy service and TC manager. There will be exactly one of these in a server. It controls the 2-phase commit among different engines and binlogs etc (and handles recovery of these after crash). And it controls the commit order, so would be the place to implement the global transaction ID. 3. Default event format. I think it will be useful to have a standard replication event format at a high level. This would be optional, so plugins at level 1 and 2 would be free to define their own format, but having a standard format at some level would allow to re-use more code and not have to re-invent the wheel in every plugin. Maybe at this level there could also be some API for defining the encapsulation of a specific event format, so that a generic binlog or network transport could be written supporting multiple event formats.
Speaking of current MySQL replication, I was skeptical from the beginning that it will fit into new redundancy service in its current unmodified form. It is simply too integrated with the server for that (just think of all those HAVE_REPLICATION ifdefs). That's why I proposed to keep them side by side and not try to unify them.
Yes. So with respect to the above levels, I think the current binlog implementation can be built upon a generic layer 1 API without problems. But for layer 2, the existing binlog implementation would be side-by-side with other alternatives. And for level 3 I think it would also be side-by-side. The existing binlog format is really not very extensible, and a more flexible format (maybe based on Google protobuffers like Drizzle does) sounds like a more likely way forward). So for something like Galera, I think it would hook into the layer 1 API to get the events from statements. At layer 2, it would implement its own TC manager, which controls the commit process and recovery, and handles the synchronous replication algorithm. And for level 3, maybe it would implement its own event format, or maybe it could use the default event format (and re-use the code to package such events on a master and apply such events on a slave), but implement its own transport for the events. Sounds reasonable? - Kristian.