Sergei Golubchik <serg@askmonty.org> writes:
Hi, Kristian!
Hi, thanks for your comments! A couple of questions inline, and some comments/thoughts.
On Jun 24, Kristian Nielsen wrote:
At the implementation level, a lot of the work is basically to pull out all of the needed information from the THD object/context. The API I propose tries to _not_ expose the THD to consumers. Instead it provides accessor functions for all the bits and pieces relevant to
of course
each replication event, while the event class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD). Then consumers could just pull out whatever information they want from the THD. The THD implementation is already exposed to storage engines. This would of course greatly reduce the size of the
no, it's not. THD is not exposed to engines (unless they define MYSQL_SERVER but then it's not our problem), they use accessor functions.
Ah, I see. Ok good, so it makes sense to use accessor functions in the replication APIs also, with no trace of THD.
API, eliminating lots of class definitions and accessor functions. Though arguably it wouldn't really simplify the API, as the complexity would just be in understanding the THD class.
For now, the API is proposed without exposing the THD class. (Similar encapsulation could be added in actual implementation to also not expose TABLE and similar classes).
completely agree
Ok, so some follow up questions: 1. Do I understand correctly that you agree that the API should also encapsulate TABLE and similar classes? These _are_ exposed to storage engines as far as I can see. 2. If TABLE and so on should be encapsulated, there will be the issue of having iterators to run over columns, etc. Do we already have standard classes for this that could be used? Or should I do this modelled using the iterators of the Stardard C++ library, for example? (I would like to make the new API fit in as well as possible with the existing MySQL/MariaDB code, which you know much better).
A consumer is implented as a virtual class (interface). There is one virtual function for every event that can be received. A consumer would derive from
hm. This part I don't understand. How would that work ? A consumer want to see a uniform stream of events, perhaps for sending them to a slave. Why would you need different consimers and different methods for different events ?
I'd just have one method, receive_event(rpl_event_base *)
Ok, so do I understand you correctly that class rpl_event_base would have a type field, and the consumer could then down-cast to the appropriate specific event class based on the type? receive_event(const rpl_event_base *generic_event) { switch (generic_event->type) { case rpl_event_base::RPL_EVENT_STATEMENT_QUERY: const rpl_event_statement_query *ev= static_cast<const rpl_event_statement_query *>(generic_event); do_stuff(ev->get_query_string(), ...); break; case rpl_event_base::RPL_EVENT_ROW_UPDATE: const rpl_event_row_update *ev= static_cast<const rpl_event_row_update *>(generic_event); do_stuff(ev->get_after_image(), ...); break; ... } } I have always disliked having such type field and upcasting. So I tried to make an API where it was not needed. Like this: class my_event_consumer { int stmt_query(const rpl_event_statement_query *ev) { do_stuff(ev->get_query_string(), ...); } int row_update(const rpl_event_row_update *ev) { do_stuff(ev->get_after_image(), ...); } ... }; Maybe it was a stupid idea. I don't mind doing the simpler one with just a receive_event() method and a type field. (Actually, I think my dislike is mainly of class hierarchies which start out with full abstraction, taking great care that everything type specific is handled inside generic virtual methods of the base class. And then at some points this gets tricky, and bits and pieces of outside code start to inspect the type and do downcast and type-specific stuff. And you end up with something that has all the complexity of a polymorphic class hierarchy, but none of the elegance. This is not the case here, as the events are just data containers, they do not have complex logic attached).
/* The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave replicating from a master.
This global ID is only available once the transaction is decided to commit by the TC manager / primary redundancy service. This TC also allocates the ID and decides the exact semantics (can there be gaps, etc); however the format is fixed (cluster_id, running_counter).
uhm. XID format is defined by the XA standard. An XID consists of - format ID (unsigned long) - global transaction ID - up to 64 bytes - branch qualifier - up to 64 bytes
as your transaction id is smaller, you will need to consider XID a part of the "context" - in cases where XID was generated externally.
Same about binlog position - which is a "transaction id" in the MySQL replication. It doesn't fit into your scheme, so it will have to be a part of the context. And unless the redundancy service will be allowed to ignore your transaction ids, MySQL native replication will not fit into the API.
Yes, good points. Ok, so my idea with the global transaction ID is following the previous discussion, that there can be a primary redundancy plugin, and this gets to control the commit order and create the global transaction IDs. And the global transaction ID is used to allow slaves to easily synchronise to any master. As long as a slave commits the last global transaction ID applied, it can connect to any master and know where to start replicating (or determine if the slave is actually ahead of the would-be master). Etc. (I do not know if XID can be used for this purpose, but even if not your point is still valid). So maybe it is wrong to fix a particular global transaction ID format at this level of API. One option is to have only the local transaction ID at this level of API. Then the primary redundancy plugin / TC manager should expose an API that allows consumers (and others) to look up the global transaction ID from the local transaction ID (I believe it will need to maintain such mapping anyway). Another option is to expose a global transaction ID of generic format at this layer (we could even use the XA standard XID format).
class rpl_event_base { ... int materialise(int (*writer)(uchar *data, size_t len, void *context)) const; ... Also, I still need to think about whether it is at all useful to be able to generically materialise an event at this level. It may be that any binlog/transport will in any case need to undertand more of the format of events, so that such materialisation/transport is better done at a different layer.
Right, I'm doubful too. Say, to materialize a statement level event you need to know what exactly bits of the context you want to include. When replicating to MariaDB it's one set, when repicating to identically configured MariaDB of the same version it's another set, and when replicating to, say, DB2, it's probably a different (larger) set.
Yes, exactly. So that's the main reason I'd like to have a non-materialised API, and them possibly build materialsation on top. What I have been thinking is to have a default (but not mandatory) event format. I am thinking to maybe use Google protocol buffers (they seem fairly good for this purpose, and they are quite popular, eg. Monty is planning to use them for dynamic columns). With such a format, it would be possible to write generic plugins for a binlog implementation, direct transport to slave, checksum/encrypt/compress etc. etc. Which I agree would be nice (and such plugins don't really want to have to handle complete materialisation of any possible event themselves from scratch). Incidentally, I think the existing binlog format is really hopeless to use with such generic plugins, it seems intricately tied to a particular binlog format (like including master binlog file names and file offsets inside of events).
One generator can be stacked on top of another. This means that a generator on top (for example row-based events) will handle some events itself (eg. non-deterministic update in mixed-mode binlogging). Other events that it does not want to or cannot handle (for example deterministic delete or DDL) will be defered to the generator below (for example statement-based events).
There's a problem with this idea. Say, Event B is nested in Event A:
... ... |<- Event A ... .. .. ->| .. .. .. * * * * * |<- Event B ->| * * * *
This is fine. But what about
... ... |<- Event A ... ->| .. .. .. * * * * * |<- Event B ->| * * * *
In the latter case no event is nested in the other, and no level can simply dever to the other.
I don't know a solution for this, I'm just hoping the above situation is impossible. At least, I could not find an example of "overlapping" events.
Another way of thinking about this is that we have one layer above handling (or not handling) an event that can be generated below. So if a statement is handled using row-based replication events, the row-based replication event generator on top will choose to discard the corrosponding event from the statement-based generator below. If it is not handled, it the row-based will pass through the event from statement-based. (This is one reason I wanted event generation to be very cheap (no materialisation); I prefer this way of generating below and discarding above to having the layer above set and clear flags (or whatever) for the layer below about whether to generate events or not.) So one case where this becomes a problem is if we have a multi-table update where one table is PBXT and another is not, and we are using PBXT engine-level replication on top of statement-based replication. In this case, one half of the statement-based event is handled by the layer above, but the other is not. So we cannot deal with this situation. (We could of course think of ways to handle this. For example, modify the statement event to include a flag to only touch the non-PBXT tables when applied on the slave. This would correspond to slicing up the events to make them be nested properly in one-another in the nested-event description. Probably it is better just to not support such a scenario, trowing an error.)
I added in the proposed API a simple facility to materialise every event as a string of bytes. To use this, I still need to add a suitable facility to de-materialise the event.
Couldn't that be done not in the API or generator, but as a filter somewhere up the chain ?
Yes. It's interesting that it could be a filter/generator higher in the stack, I had not thought about that. - Kristian.