developers
Threads by month
- ----- 2025 -----
- February
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- 6826 discussions
[Maria-developers] Updated (by Guest): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 72
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Guest - Mon, 28 Jun 2010, 11:54)=-=-
Status updated.
--- /tmp/wklog.47.old.915 2010-06-28 11:54:12.000000000 +0000
+++ /tmp/wklog.47.new.915 2010-06-28 11:54:12.000000000 +0000
@@ -1 +1 @@
-Code-Review
+Complete
-=-=(Alexi - Thu, 24 Jun 2010, 09:49)=-=-
Final implementation cleanup, testing, help Percona with build issues related to the WL
Worked 5 hours and estimate 0 hours remain (original estimate increased by 5 hours).
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 38 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Guest): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 72
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Guest - Mon, 28 Jun 2010, 11:54)=-=-
Status updated.
--- /tmp/wklog.47.old.915 2010-06-28 11:54:12.000000000 +0000
+++ /tmp/wklog.47.new.915 2010-06-28 11:54:12.000000000 +0000
@@ -1 +1 @@
-Code-Review
+Complete
-=-=(Alexi - Thu, 24 Jun 2010, 09:49)=-=-
Final implementation cleanup, testing, help Percona with build issues related to the WL
Worked 5 hours and estimate 0 hours remain (original estimate increased by 5 hours).
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 38 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 28 Jun 2010, 07:11)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.18416 2010-06-28 07:11:09.000000000 +0000
+++ /tmp/wklog.120.new.18416 2010-06-28 07:11:09.000000000 +0000
@@ -14,10 +14,14 @@
An example of an event consumer is the writing of the binlog on a master.
-Event generators are not really plugins. Rather, there are specific points in
-the server where events are generated. However, a generator can be part of a
-plugin, for example a PBXT engine-level replication event generator would be
-part of the PBXT storage engine plugin.
+Some event generators are not really plugins. Rather, there are specific
+points in the server where events are generated. However, a generator can be
+part of a plugin, for example a PBXT engine-level replication event generator
+would be part of the PBXT storage engine plugin. And for example we could
+write a filter plugin, which would be stacked on top of an existing generator
+and provide the same event types and interfaces, but filtered in some way (for
+example by removing certain events on the master side, or by re-writing events
+in certain ways).
Event consumers on the other hand could be a plugin.
@@ -85,6 +89,9 @@
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
+Also, the non-materialising interface should be a good interface on top of
+which to build a materialising interface.
+
The design proposed here aims for as little materialisation as possible.
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Some event generators are not really plugins. Rather, there are specific
points in the server where events are generated. However, a generator can be
part of a plugin, for example a PBXT engine-level replication event generator
would be part of the PBXT storage engine plugin. And for example we could
write a filter plugin, which would be stacked on top of an existing generator
and provide the same event types and interfaces, but filtered in some way (for
example by removing certain events on the master side, or by re-writing events
in certain ways).
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
Also, the non-materialising interface should be a good interface on top of
which to build a materialising interface.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 28 Jun 2010, 07:11)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.18416 2010-06-28 07:11:09.000000000 +0000
+++ /tmp/wklog.120.new.18416 2010-06-28 07:11:09.000000000 +0000
@@ -14,10 +14,14 @@
An example of an event consumer is the writing of the binlog on a master.
-Event generators are not really plugins. Rather, there are specific points in
-the server where events are generated. However, a generator can be part of a
-plugin, for example a PBXT engine-level replication event generator would be
-part of the PBXT storage engine plugin.
+Some event generators are not really plugins. Rather, there are specific
+points in the server where events are generated. However, a generator can be
+part of a plugin, for example a PBXT engine-level replication event generator
+would be part of the PBXT storage engine plugin. And for example we could
+write a filter plugin, which would be stacked on top of an existing generator
+and provide the same event types and interfaces, but filtered in some way (for
+example by removing certain events on the master side, or by re-writing events
+in certain ways).
Event consumers on the other hand could be a plugin.
@@ -85,6 +89,9 @@
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
+Also, the non-materialising interface should be a good interface on top of
+which to build a materialising interface.
+
The design proposed here aims for as little materialisation as possible.
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Some event generators are not really plugins. Rather, there are specific
points in the server where events are generated. However, a generator can be
part of a plugin, for example a PBXT engine-level replication event generator
would be part of the PBXT storage engine plugin. And for example we could
write a filter plugin, which would be stacked on top of an existing generator
and provide the same event types and interfaces, but filtered in some way (for
example by removing certain events on the master side, or by re-writing events
in certain ways).
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
Also, the non-materialising interface should be a good interface on top of
which to build a materialising interface.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
Hi everyone,
After a long and intense fight with CPack and NSIS, I finally have a
solution that is functional. The one TODO I have before I consider it
really good enough is to be able to set up MariaDB as a service. That
will come later.
The big problem with the installer was how to handle the database files.
If they are just copied to the data dir and used, the uninstaller will
silently delete them. This is *bad*. So I spent a long time trying to
get around this problem and make the uninstaller ask if the user wants
to get rid of these files. I'm now completely convinced this is
impossible with the current CPack :(
I have tried several workarounds, that also wouldn't work before I came
up with this:
The installer will install the data files to data\clean. At the end of
the installer, it checks if there is a file called data\mysql\db.frm
(could have been any other file). If the file is there, the user gets a
message saying the installer have not written the clean database files
to the data directory. If the file isn't there, the installer copies all
the files in data\clean to data.
The uninstaller will of course silently delete all the files in
data\clean. But it will give the user a message that the database files
are not deleted.
So, if you install this package and uninstall it again, the database
files are still on the disk. If you reinstall the package, it will use
the existing data files.
If you upgrade to a newer version, this will be installed in a different
directory (the default directory name contains the version number), and
can copy the data files from the old directory in there if you want to.
Or you can copy the clean dir somewhere else and modify the ini file to
point at it.
IMHO, this is a reasonable solution that doesn't involve patching CMake
or some other evil scheme I've been considering.
To generate an installer: Run cmake as usual, build in visual studio,
and call "cpack" when the build is done. That's about as simple as
possible :)
Can I check this into the 5.2 branch?
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
1
0
Hello 5.3 developers,
We all know that 5.3 tree have some buildbot failures that
- are unlikely to be result of any 5.3 work,
- cannot be observed in 5.2
- still are somehow present.
I got suspicious about one failure, and investigated it:
https://bugs.launchpad.net/maria/+bug/597742. Long story short, it was
present in 5.2 at some earlier point but has been fixed there since then.
I think, in order to avoid spent time in a way it was spent on analyzing the
above mentioned bug, we should do a 5.2->5.3 merge. 5.2 now produces an almost
green run in buildbot (the exception is plugin_load.test), and AFAIU the
release of 5.2.1 can be interpreted as indication that 5.2's code is not going
to change much anymore.
Any objections to doing the merge?
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
Re: [Maria-developers] [Commits] Rev 2817: Make MariaDB compile with VS 2010 in file:///Users/hakan/work/monty_program/maria-5.2/
by Kristian Nielsen 24 Jun '10
by Kristian Nielsen 24 Jun '10
24 Jun '10
Hakan Kuecuekyilmaz <hakan(a)askmonty.org> writes:
> === modified file 'sql/CMakeLists.txt'
> --- a/sql/CMakeLists.txt 2010-06-01 19:52:20 +0000
> +++ b/sql/CMakeLists.txt 2010-06-24 10:44:39 +0000
> @@ -17,8 +17,7 @@
> SET(CMAKE_CXX_FLAGS_DEBUG
> "${CMAKE_CXX_FLAGS_DEBUG} -DSAFEMALLOC -DSAFE_MUTEX -DUSE_SYMDIR /Zi")
> SET(CMAKE_C_FLAGS_DEBUG
> - "${CMAKE_C_FLAGS_DEBUG} -DSAFEMALLOC -DSAFE_MUTEX -DUSE_SYMDIR /Zi")
> -SET(CMAKE_EXE_LINKER_FLAGS_DEBUG "${CMAKE_EXE_LINKER_FLAGS_DEBUG} /MAP /MAPINFO:EXPORTS")
> + "${CMAKE_C_FLAGS_DEBUG} -DSAFEMALLOC -DSAFE_MUTEX -DUSE_SYMDIR /Zi")
Avoid making spurious whitespace-only changes like this (added space at end of line).
> === added file 'win/build-vs10.bat'
> --- a/win/build-vs10.bat 1970-01-01 00:00:00 +0000
> +++ b/win/build-vs10.bat 2010-06-24 10:44:39 +0000
> @@ -0,0 +1,18 @@
> +@echo off
> +
> +REM Copyright (C) 2010 Monty Program AB
> +REM
> +REM This program is free software; you can redistribute it and/or modify
> +REM it under the terms of the GNU General Public License as published by
> +REM the Free Software Foundation; version 2 of the License.
> +REM
> +REM This program is distributed in the hope that it will be useful,
> +REM but WITHOUT ANY WARRANTY; without even the implied warranty of
> +REM MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +REM GNU General Public License for more details.
> +REM
> +REM You should have received a copy of the GNU General Public License
> +REM along with this program; if not, write to the Free Software
> +REM Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> +cmake -G "Visual Studio 10"
> +
>
> === added file 'win/build-vs10_x64.bat'
> --- a/win/build-vs10_x64.bat 1970-01-01 00:00:00 +0000
> +++ b/win/build-vs10_x64.bat 2010-06-24 10:44:39 +0000
> @@ -0,0 +1,18 @@
> +@echo off
> +
> +REM Copyright (C) 2010 Monty Program AB
> +REM
> +REM This program is free software; you can redistribute it and/or modify
> +REM it under the terms of the GNU General Public License as published by
> +REM the Free Software Foundation; version 2 of the License.
> +REM
> +REM This program is distributed in the hope that it will be useful,
> +REM but WITHOUT ANY WARRANTY; without even the implied warranty of
> +REM MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +REM GNU General Public License for more details.
> +REM
> +REM You should have received a copy of the GNU General Public License
> +REM along with this program; if not, write to the Free Software
> +REM Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> +cmake -G "Visual Studio 10 Win64"
> +
You need to add these new files to EXTRA_DIST in Makefile.am.
> === modified file 'win/configure-mariadb.sh'
> --- a/win/configure-mariadb.sh 2009-10-08 19:04:12 +0000
> +++ b/win/configure-mariadb.sh 2010-06-24 10:44:39 +0000
> @@ -15,9 +15,7 @@
> WITH_FEDERATED_STORAGE_ENGINE \
> WITH_MERGE_STORAGE_ENGINE \
> WITH_PARTITION_STORAGE_ENGINE \
> - WITH_MARIA_STORAGE_ENGINE \
> - WITH_PBXT_STORAGE_ENGINE \
> - WITH_XTRADB_STORAGE_ENGINE \
> + WITH_MARIA_STORAGE_ENGINE \
> + WITH_PBXT_STORAGE_ENGINE \
> + WITH_XTRADB_STORAGE_ENGINE \
> WITH_EMBEDDED_SERVER
> -
> -
Why?
- Kristian.
2
1
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Event generators are not really plugins. Rather, there are specific points in
the server where events are generated. However, a generator can be part of a
plugin, for example a PBXT engine-level replication event generator would be
part of the PBXT storage engine plugin.
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Event generators are not really plugins. Rather, there are specific points in
the server where events are generated. However, a generator can be part of a
plugin, for example a PBXT engine-level replication event generator would be
part of the PBXT storage engine plugin.
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Event generators are not really plugins. Rather, there are specific points in
the server where events are generated. However, a generator can be part of a
plugin, for example a PBXT engine-level replication event generator would be
part of the PBXT storage engine plugin.
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0