----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Replication API for stacked event generators CREATION DATE..: Mon, 07 Jun 2010, 13:13 SUPERVISOR.....: Knielsen IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 120 (http://askmonty.org/worklog/?tid=120) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 2 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=- High-Level Specification modified. --- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000 +++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000 @@ -1 +1,159 @@ +Generators and consumbers +------------------------- + +We have the two concepts: + +1. Event _generators_, that produce events describing all changes to data in a + server. + +2. Event consumers, that receive such events and use them in various ways. + +Examples of event generators is execution of SQL statements, which generates +events like those used for statement-based replication. Another example is +PBXT engine-level replication. + +An example of an event consumer is the writing of the binlog on a master. + +Event generators are not really plugins. Rather, there are specific points in +the server where events are generated. However, a generator can be part of a +plugin, for example a PBXT engine-level replication event generator would be +part of the PBXT storage engine plugin. + +Event consumers on the other hand could be a plugin. + +One generator can be stacked on top of another. This means that a generator on +top (for example row-based events) will handle some events itself +(eg. non-deterministic update in mixed-mode binlogging). Other events that it +does not want to or cannot handle (for example deterministic delete or DDL) +will be defered to the generator below (for example statement-based events). + + +Materialisation (or not) +------------------------ + +A central decision is how to represent events that are generated in the API at +the point of generation. + +I want to avoid making the API require that events are materialised. By +"Materialised" I mean that all (or most) of the data for the event is written +into memory in a struct/class used inside the server or serialised in a data +buffer (byte buffer) in a format suitable for network transport or disk +storage. + +Using a non-materialised event means storing just a reference to appropriate +context that allows to retrieve all information for the event using +accessors. Ie. typically this would be based on getting the event information +from the THD pointer. + +Some reasons to avoid using materialised events in the API: + + - Replication events have a _lot_ of detailed context information that can be + needed in events: user-defined variables, random seed, character sets, + table column names and types, etc. etc. If we make the API based on + materialisation, then the initial decision about which context information + to include with which events will have to be done in the API, while ideally + we want this decision to be done by the individual consumer plugin. There + will this be a conflict between what to include (to allow consumers access) + and what to exclude (to avoid excessive needless work). + + - Materialising means defining a very specific format, which will tend to + make the API less generic and flexible. + + - Unless the materialised format is made _very_ specific (and thus very + inflexible), it is unlikely to be directly useful for transport + (eg. binlog), so it will need to be re-materialised into a different format + anyway, wasting work. + + - If a generator on top handles an event, then we want to avoid wasting work + materialising an event in a generator below which would be completely + unused. Thus there would be a need for the upper generator to somehow + notify the lower generator ahead of event generation time to not fire an + event, complicating the API. + +Some advantages for materialisation: + + - Using an API based on passing around some well-defined struct event (or + byte buffer) will be simpler than the complex class hierarchy proposed here + with no requirement for materialisation. + + - Defining a materialised format would allow an easy way to use the same + consumer code on a generator that produces events at the source of + execution and on a generator that produces events from eg. reading them + from an event log. + +Note that there can be some middle way, where some data is materialised and +some is kept as reference to context (eg. THD) only. This however looses most +of the mentioned advantages for materialisation. + +The design proposed here aims for as little materialisation as possible. + + +Default materialisation format +------------------------------ + + +While the proposed API doesn't _require_ materialisation, we can still think +about providing the _option_ for built-in materialisation. This could be +useful if such materialisation is made suitable for transport to a different +server (eg. no endian-dependance etc). If there is a facility for such +materialisation built-in to the API, it becomes possible to write something +like a generic binlog plugin or generic network transport plugin. This would +be really useful for eg. PBXT engine-level replication, as it could be +implemented without having to re-invent a binlog format. + +I added in the proposed API a simple facility to materialise every event as a +string of bytes. To use this, I still need to add a suitable facility to +de-materialise the event. + +However, it is still an open question whether such a facility will be at all +useful. It still has some of the problems with materialisation mentioned +above. And I think it is likely that a good binlog implementation will need +to do more than just blindly copy opaque events from one endpoint to +another. For example, it might need different event boundaries (merge and/or +split events); it might need to augment or modify events, or inject new +events, etc. + +So I think maybe it is better to add such a generic materialisation facility +on top of the basic event generator API. Such a facility would provide +materialisation of an replication event stream, not of individual events, so +would be more flexible in providing a good implementation. It would be +implemented for all generators. It would separate from both the event +generator API (so we have flexibility to put a filter class in-between +generator and materialisation), and could also be separate from the actual +transport handling stuff like fsync() of binlog files and socket connections +etc. It would be paired with a corresponding applier API which would handle +executing events on a slave. + +Then we can have a default materialised event format, which is available, but +not mandatory. So there can still be other formats alongside (like legacy +MySQL 5.1 binlog event format and maybe Tungsten would have its own format). + + +Encapsulation +------------- + +Another fundamental question about the design is the level of encapsulation +used for the API. + +At the implementation level, a lot of the work is basically to pull out all of +the needed information from the THD object/context. The API I propose tries to +_not_ expose the THD to consumers. Instead it provides accessor functions for +all the bits and pieces relevant to each replication event, while the event +class itself likely will be more or less just an encapsulated THD. + +So an alternative would be to have a generic event that was just (type, THD). +Then consumers could just pull out whatever information they want from the +THD. The THD implementation is already exposed to storage engines. This would +of course greatly reduce the size of the API, eliminating lots of class +definitions and accessor functions. Though arguably it wouldn't really +simplify the API, as the complexity would just be in understanding the THD +class. + +Note that we do not have to take any performance hit from using encapsulated +accessors since compilers can inline them (though if inlining then we do not +get any ABI stability with respect to THD implemetation). + +For now, the API is proposed without exposing the THD class. (Similar +encapsulation could be added in actual implementation to also not expose TABLE +and similar classes). -=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=- Dependency created: 107 now depends on 120 -=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=- High Level Description modified. --- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000 +++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000 @@ -11,4 +11,4 @@ Event generators can be stacked, and a generator may defer event generation to the next one down the stack. For example, the row-level replication event -generator may defer DLL to the statement-level replication event generator. +generator may defer DDL to the statement-level replication event generator. -=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=- Research and design thoughts. Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours). DESCRIPTION: A part of the replication project, MWL#107. Events are produced by event Generators. Examples are - Generation of statement-based replication events - Generation of row-based events - Generation of PBXT engine-level replication events and maybe reading of events from relay log on slave may also be an example of generating events. Event generators can be stacked, and a generator may defer event generation to the next one down the stack. For example, the row-level replication event generator may defer DDL to the statement-level replication event generator. HIGH-LEVEL SPECIFICATION: Generators and consumbers ------------------------- We have the two concepts: 1. Event _generators_, that produce events describing all changes to data in a server. 2. Event consumers, that receive such events and use them in various ways. Examples of event generators is execution of SQL statements, which generates events like those used for statement-based replication. Another example is PBXT engine-level replication. An example of an event consumer is the writing of the binlog on a master. Event generators are not really plugins. Rather, there are specific points in the server where events are generated. However, a generator can be part of a plugin, for example a PBXT engine-level replication event generator would be part of the PBXT storage engine plugin. Event consumers on the other hand could be a plugin. One generator can be stacked on top of another. This means that a generator on top (for example row-based events) will handle some events itself (eg. non-deterministic update in mixed-mode binlogging). Other events that it does not want to or cannot handle (for example deterministic delete or DDL) will be defered to the generator below (for example statement-based events). Materialisation (or not) ------------------------ A central decision is how to represent events that are generated in the API at the point of generation. I want to avoid making the API require that events are materialised. By "Materialised" I mean that all (or most) of the data for the event is written into memory in a struct/class used inside the server or serialised in a data buffer (byte buffer) in a format suitable for network transport or disk storage. Using a non-materialised event means storing just a reference to appropriate context that allows to retrieve all information for the event using accessors. Ie. typically this would be based on getting the event information from the THD pointer. Some reasons to avoid using materialised events in the API: - Replication events have a _lot_ of detailed context information that can be needed in events: user-defined variables, random seed, character sets, table column names and types, etc. etc. If we make the API based on materialisation, then the initial decision about which context information to include with which events will have to be done in the API, while ideally we want this decision to be done by the individual consumer plugin. There will this be a conflict between what to include (to allow consumers access) and what to exclude (to avoid excessive needless work). - Materialising means defining a very specific format, which will tend to make the API less generic and flexible. - Unless the materialised format is made _very_ specific (and thus very inflexible), it is unlikely to be directly useful for transport (eg. binlog), so it will need to be re-materialised into a different format anyway, wasting work. - If a generator on top handles an event, then we want to avoid wasting work materialising an event in a generator below which would be completely unused. Thus there would be a need for the upper generator to somehow notify the lower generator ahead of event generation time to not fire an event, complicating the API. Some advantages for materialisation: - Using an API based on passing around some well-defined struct event (or byte buffer) will be simpler than the complex class hierarchy proposed here with no requirement for materialisation. - Defining a materialised format would allow an easy way to use the same consumer code on a generator that produces events at the source of execution and on a generator that produces events from eg. reading them from an event log. Note that there can be some middle way, where some data is materialised and some is kept as reference to context (eg. THD) only. This however looses most of the mentioned advantages for materialisation. The design proposed here aims for as little materialisation as possible. Default materialisation format ------------------------------ While the proposed API doesn't _require_ materialisation, we can still think about providing the _option_ for built-in materialisation. This could be useful if such materialisation is made suitable for transport to a different server (eg. no endian-dependance etc). If there is a facility for such materialisation built-in to the API, it becomes possible to write something like a generic binlog plugin or generic network transport plugin. This would be really useful for eg. PBXT engine-level replication, as it could be implemented without having to re-invent a binlog format. I added in the proposed API a simple facility to materialise every event as a string of bytes. To use this, I still need to add a suitable facility to de-materialise the event. However, it is still an open question whether such a facility will be at all useful. It still has some of the problems with materialisation mentioned above. And I think it is likely that a good binlog implementation will need to do more than just blindly copy opaque events from one endpoint to another. For example, it might need different event boundaries (merge and/or split events); it might need to augment or modify events, or inject new events, etc. So I think maybe it is better to add such a generic materialisation facility on top of the basic event generator API. Such a facility would provide materialisation of an replication event stream, not of individual events, so would be more flexible in providing a good implementation. It would be implemented for all generators. It would separate from both the event generator API (so we have flexibility to put a filter class in-between generator and materialisation), and could also be separate from the actual transport handling stuff like fsync() of binlog files and socket connections etc. It would be paired with a corresponding applier API which would handle executing events on a slave. Then we can have a default materialised event format, which is available, but not mandatory. So there can still be other formats alongside (like legacy MySQL 5.1 binlog event format and maybe Tungsten would have its own format). Encapsulation ------------- Another fundamental question about the design is the level of encapsulation used for the API. At the implementation level, a lot of the work is basically to pull out all of the needed information from the THD object/context. The API I propose tries to _not_ expose the THD to consumers. Instead it provides accessor functions for all the bits and pieces relevant to each replication event, while the event class itself likely will be more or less just an encapsulated THD. So an alternative would be to have a generic event that was just (type, THD). Then consumers could just pull out whatever information they want from the THD. The THD implementation is already exposed to storage engines. This would of course greatly reduce the size of the API, eliminating lots of class definitions and accessor functions. Though arguably it wouldn't really simplify the API, as the complexity would just be in understanding the THD class. Note that we do not have to take any performance hit from using encapsulated accessors since compilers can inline them (though if inlining then we do not get any ABI stability with respect to THD implemetation). For now, the API is proposed without exposing the THD class. (Similar encapsulation could be added in actual implementation to also not expose TABLE and similar classes). ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)