Kristian Nielsen <knielsen@knielsen-hq.org> writes:
1. Event generators and consumers. This is what Sergei discussed. The essentials of this layer is hooks in handler::write_row() and similer places that provides data about changes (row values for row-based replication, query texts for statement-based replication, etc). There is no binlog or global transaction ID at this layer, I think there may not even be a defined event format as such, just an API for consumer plugins to get the information (and for generator plugins to provide it).
I started to write up some more concrete design for this part. Here is the link to the worklog (text also included below): http://askmonty.org/worklog/Server-Sprint/?tid=120 This part describes the lowest layer with the generation of events, and also has a little bit of discussion on some possible layers above (replicated event stream and binlog/transport APIs). This API aims to be useful for something like Tungsten to implement its own binlog format. It might also be usable for something like Galera, or alternatively Galera might want to hook into a higher layer providing default replication stream and slave applier thread (I'm not sure which). Also this API would be used to implement legacy MySQL binlog format for compatibility. This is far from a final design, I plan to work much more on the details. However, I think it is important to start discussing more concretely the overall shape of the API. There are several important overall decisions I made already, which I _did_ consider carefully, but which are still open to be discussed. And any feedback in general would be most welcome. This project is a major addition to the server with the potential to greatly influence the future direction of server development (or not, if we get it wrong). And it's impossible for one person to think of everything on his own. So any feedback welcomed; meanwhile I'll continue expanding this and other parts of the design. - Kristian. ----------------------------------------------------------------------- High-Level Specification Generators and consumbers ------------------------- We have the two concepts: 1. Event _generators_, that produce events describing all changes to data in a server. 2. Event consumers, that receive such events and use them in various ways. Examples of event generators is execution of SQL statements, which generates events like those used for statement-based replication. Another example is PBXT engine-level replication. An example of an event consumer is the writing of the binlog on a master. Event generators are not really plugins. Rather, there are specific points in the server where events are generated. However, a generator can be part of a plugin, for example a PBXT engine-level replication event generator would be part of the PBXT storage engine plugin. Event consumers on the other hand could be a plugin. One generator can be stacked on top of another. This means that a generator on top (for example row-based events) will handle some events itself (eg. non-deterministic update in mixed-mode binlogging). Other events that it does not want to or cannot handle (for example deterministic delete or DDL) will be defered to the generator below (for example statement-based events). Materialisation (or not) ------------------------ A central decision is how to represent events that are generated in the API at the point of generation. I want to avoid making the API require that events are materialised. By "Materialised" I mean that all (or most) of the data for the event is written into memory in a struct/class used inside the server or serialised in a data buffer (byte buffer) in a format suitable for network transport or disk storage. Using a non-materialised event means storing just a reference to appropriate context that allows to retrieve all information for the event using accessors. Ie. typically this would be based on getting the event information from the THD pointer. Some reasons to avoid using materialised events in the API: - Replication events have a _lot_ of detailed context information that can be needed in events: user-defined variables, random seed, character sets, table column names and types, etc. etc. If we make the API based on materialisation, then the initial decision about which context information to include with which events will have to be done in the API, while ideally we want this decision to be done by the individual consumer plugin. There will this be a conflict between what to include (to allow consumers access) and what to exclude (to avoid excessive needless work). - Materialising means defining a very specific format, which will tend to make the API less generic and flexible. - Unless the materialised format is made _very_ specific (and thus very inflexible), it is unlikely to be directly useful for transport (eg. binlog), so it will need to be re-materialised into a different format anyway, wasting work. - If a generator on top handles an event, then we want to avoid wasting work materialising an event in a generator below which would be completely unused. Thus there would be a need for the upper generator to somehow notify the lower generator ahead of event generation time to not fire an event, complicating the API. Some advantages for materialisation: - Using an API based on passing around some well-defined struct event (or byte buffer) will be simpler than the complex class hierarchy proposed here with no requirement for materialisation. - Defining a materialised format would allow an easy way to use the same consumer code on a generator that produces events at the source of execution and on a generator that produces events from eg. reading them from an event log. Note that there can be some middle way, where some data is materialised and some is kept as reference to context (eg. THD) only. This however looses most of the mentioned advantages for materialisation. The design proposed here aims for as little materialisation as possible. Default materialisation format ------------------------------ While the proposed API doesn't _require_ materialisation, we can still think about providing the _option_ for built-in materialisation. This could be useful if such materialisation is made suitable for transport to a different server (eg. no endian-dependance etc). If there is a facility for such materialisation built-in to the API, it becomes possible to write something like a generic binlog plugin or generic network transport plugin. This would be really useful for eg. PBXT engine-level replication, as it could be implemented without having to re-invent a binlog format. I added in the proposed API a simple facility to materialise every event as a string of bytes. To use this, I still need to add a suitable facility to de-materialise the event. However, it is still an open question whether such a facility will be at all useful. It still has some of the problems with materialisation mentioned above. And I think it is likely that a good binlog implementation will need to do more than just blindly copy opaque events from one endpoint to another. For example, it might need different event boundaries (merge and/or split events); it might need to augment or modify events, or inject new events, etc. So I think maybe it is better to add such a generic materialisation facility on top of the basic event generator API. Such a facility would provide materialisation of an replication event stream, not of individual events, so would be more flexible in providing a good implementation. It would be implemented for all generators. It would separate from both the event generator API (so we have flexibility to put a filter class in-between generator and materialisation), and could also be separate from the actual transport handling stuff like fsync() of binlog files and socket connections etc. It would be paired with a corresponding applier API which would handle executing events on a slave. Then we can have a default materialised event format, which is available, but not mandatory. So there can still be other formats alongside (like legacy MySQL 5.1 binlog event format and maybe Tungsten would have its own format). Encapsulation ------------- Another fundamental question about the design is the level of encapsulation used for the API. At the implementation level, a lot of the work is basically to pull out all of the needed information from the THD object/context. The API I propose tries to _not_ expose the THD to consumers. Instead it provides accessor functions for all the bits and pieces relevant to each replication event, while the event class itself likely will be more or less just an encapsulated THD. So an alternative would be to have a generic event that was just (type, THD). Then consumers could just pull out whatever information they want from the THD. The THD implementation is already exposed to storage engines. This would of course greatly reduce the size of the API, eliminating lots of class definitions and accessor functions. Though arguably it wouldn't really simplify the API, as the complexity would just be in understanding the THD class. Note that we do not have to take any performance hit from using encapsulated accessors since compilers can inline them (though if inlining then we do not get any ABI stability with respect to THD implemetation). For now, the API is proposed without exposing the THD class. (Similar encapsulation could be added in actual implementation to also not expose TABLE and similar classes). ----------------------------------------------------------------------- Low-Level Design A consumer is implented as a virtual class (interface). There is one virtual function for every event that can be received. A consumer would derive from the base class and override methods for the events it wants to receive. There is one consumer interface for each generator. When a generator A is stacked on B, the consumer interface for A inherits from the interface for B. This way, when A defers an event to B, the consumer for A will receive the corresponding event from B. There are methods for a consumer to register itself to receive events from each generator. I still need to find a way for a consumer in one plugin to register itself with a generator implemented in another plugin (eg. PBXT engine-level replication). I also need to add a way for consumers to de-register themselves. The current design has consumer callbacks return 0 for success and error code otherwise. I still need to think more about whether this is useful (ie. what is the semantics of returning an error from a consumer callback). Each event passed to consumers is defined as a class with public accessor methods to a private context (which is mostly the THD). My intension is to make all events passed around const, so that the same event can be passed to each of multiple registered consumers (and to emphasise that consumers do not have the ability to modify events). It still needs to be seen whether that const-ness will be feasible in practise without very heavy modification/constification of exiting code. What follows is a partial draft of a possible definition of the API as concrete C++ class definitions. ----------------------------------------------------------------------- /* Virtual base class for generated replication events. This is the parent of events generated from all kinds of generators. Only child classes can be instantiated. This class can be used by code that wants to treat events in a generic way, without any knowledge of event details. I still need to decide whether such generic code is sensible. */ class rpl_event_base { /* Maybe we will want the ability to materialise an event to a standard binary format. This could be achieved with a base method like this. The actual materialisation would be implemented in each deriving class. The public methods would provide different interfaces for specifying the buffer or for writing directly into IO_CACHE or file. */ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */ int materialise(uchar *buffer, size_t buflen) const; /* Returns NULL on error or else malloc()ed buffer with materialised event, caller must free(). */ uchar *materialise() const; /* Same but using passed in memroot. */ uchar *materialise(mem_root *memroot) const; /* Materialise to user-supplied writer function (could write directly to file or the like). */ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const; /* As to for what to do with a materialised event, there are a couple of possibilities. One is to have a de_materialise() method somewhere that can construct an rpl_event_base (really a derived class of course) from a buffer or writer function. This would require each accessor function to conditionally read its data from either THD context or buffer (GCC is able to optimise several such conditionals in multiple accessor function calls into one conditional), or we can make all accessors virtual if the performance hit is acceptable. Another is to have different classes for accessing events read from materialised event data. Also, I still need to think about whether it is at all useful to be able to generically materialise an event at this level. It may be that any binlog/transport will in any case need to undertand more of the format of events, so that such materialisation/transport is better done at a different layer. */ protected: /* Implementation which is the basis for materialise(). */ virtual int do_materialise(int (*writer)(uchar *data, size_t len, void *context)) const = 0; private: /* Virtual base class, private constructor to prevent instantiation. */ rpl_event_base(); }; /* These are the event types output from the transaction event generator. This generator is not stacked on anything. The transaction event generator marks the start and end (commit or rollback) of transactions. It also gives information about whether the transaction was a full transaction or autocommitted statement, whether transactional tables were involved, whether non-transactional tables were involved, and XA information (ToDo). */ /* Base class for transaction events. */ class rpl_event_transaction_base : public rpl_event_base { public: /* Get the local transaction id. This idea is only unique within one server. It is allocated whenever a new transaction is started. Can be used to identify events belonging to the same transaction in a binlog-like stream of events streamed in parallel among multiple transactions. */ uint64_t get_local_trx_id() const { return thd->local_trx_id; }; bool get_is_autocommit() const; private: /* The context is the THD. */ THD *thd; rpl_event_transaction_base(THD *_thd) : thd(_thd) { }; }; /* Transaction start event. */ class rpl_event_transaction_start : public rpl_event_transaction_base { }; /* Transaction commit. */ class rpl_event_transaction_commit : public rpl_event_transaction_base { public: /* The global transaction id is unique cross-server. It can be used to identify the position from which to start a slave replicating from a master. This global ID is only available once the transaction is decided to commit by the TC manager / primary redundancy service. This TC also allocates the ID and decides the exact semantics (can there be gaps, etc); however the format is fixed (cluster_id, running_counter). */ struct global_transaction_id { uint32_t cluster_id; uint64_t counter; }; const global_transaction_id *get_global_transaction_id() const; }; /* Transaction rollback. */ class rpl_event_transaction_rollback : public rpl_event_transaction_base { }; /* Base class for statement events. */ class rpl_event_statement_base : public rpl_event_base { public: LEX_STRING get_current_db() const; }; class rpl_event_statement_start : public rpl_event_statement_base { }; class rpl_event_statement_end : public rpl_event_statement_base { public: int get_errorcode() const; }; class rpl_event_statement_query : public rpl_event_statement_base { public: LEX_STRING get_query_string(); ulong get_sql_mode(); const CHARSET_INFO *get_character_set_client(); const CHARSET_INFO *get_collation_connection(); const CHARSET_INFO *get_collation_server(); const CHARSET_INFO *get_collation_default_db(); /* Access to relevant flags that affect query execution. Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... } */ enum flag_bits { STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks STMT_UNIQUE_KEY_CHECKS, // @@unique_checks STMT_AUTO_IS_NULL, // @@sql_auto_is_null }; uint32_t get_flags(); ulong get_auto_increment_offset(); ulong get_auto_increment_increment(); // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID; // INSERT_ID; random seed; user variables. // // We probably also need get_uses_temporary_table(), get_used_user_vars(), // get_uses_auto_increment() and so on, so a consumer can get more // information about what kind of context information a query will need when // executed on a slave. }; class rpl_event_statement_load_query : public rpl_event_statement_query { }; /* This event is fired with blocks of data for files read (from server-local file or client connection) for LOAD DATA. */ class rpl_event_statement_load_data_block : public rpl_event_statement_base { public: struct block { const uchar *ptr; size_t size; }; block get_block() const; }; /* Base class for row-based replication events. */ class rpl_event_row_base : public rpl_event_base { public: /* Access to relevant handler extra flags and other flags that affect row operations. Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... } */ enum flag_bits { ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks }; uint32_t get_flags(); /* Access to list of tables modified. */ class table_iterator { public: /* Returns table, NULL after last. */ const TABLE *get_next(); private: // ... }; table_iterator get_modified_tables() const; private: /* Context used to provide accessors. */ THD *thd; protected: rpl_event_row_base(THD *_thd) : thd(_thd) { } }; class rpl_event_row_write : public rpl_event_row_base { public: const BITMAP *get_write_set() const; const uchar *get_after_image() const; }; class rpl_event_row_update : public rpl_event_row_base { public: const BITMAP *get_read_set() const; const BITMAP *get_write_set() const; const uchar *get_before_image() const; const uchar *get_after_image() const; }; class rpl_event_row_delete : public rpl_event_row_base { public: const BITMAP *get_read_set() const; const uchar *get_before_image() const; }; /* Event consumer callbacks. An event consumer registers with an event generator to receive event notifications from that generator. The consumer has callbacks (in the form of virtual functions) for the individual event types the consumer is interested in. Only callbacks that are non-NULL will be invoked. If an event applies to multiple callbacks in a single callback struct, it will only be passed to the most specific non-NULL callback (so events never fire more than once per registration). The lifetime of the memory holding the event is only for the duration of the callback invocation, unless otherwise noted. Callbacks return 0 for success or error code (ToDo: does this make sense?). */ struct rpl_event_consumer_transaction { virtual int trx_start(const rpl_event_transaction_start *) { return 0; } virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; } virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; } }; /* Consuming statement-based events. The statement event generator is stacked on top of the transaction event generator, so we can receive those events as well. */ struct rpl_event_consumer_statement : public rpl_event_consumer_transaction { virtual int stmt_start(const rpl_event_statement_start *) { return 0; } virtual int stmt_end(const rpl_event_statement_end *) { return 0; } virtual int stmt_query(const rpl_event_statement_query *) { return 0; } /* Data for a file used in LOAD DATA [LOCAL] INFILE. */ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *) { return 0; } /* These are specific kinds of statements; if specified they override consume_stmt_query() for the corresponding event. */ virtual int stmt_load_query(const rpl_event_statement_load_query *ev) { return stmt_query(ev); } }; /* Consuming row-based events. The row event generator is stacked on top of the statement event generator. */ struct rpl_event_consumer_row : public rpl_event_consumer_statement { virtual int row_write(const rpl_event_row_write *) { return 0; } virtual int row_update(const rpl_event_row_update *) { return 0; } virtual int row_delete(const rpl_event_row_delete *) { return 0; } }; /* Registration functions. ToDo: Make a way to de-register. ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator registration method. */ int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs); int rpl_event_statement_register(const rpl_event_consumer_statement *cbs); int rpl_event_row_register(const rpl_event_consumer_row *cbs);