----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Store in binlog text of statements that caused RBR events CREATION DATE..: Sat, 15 Aug 2009, 23:48 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Knielsen, Serg CATEGORY.......: Server-Sprint TASK ID........: 47 (http://askmonty.org/worklog/?tid=47) VERSION........: Server-9.x STATUS.........: Code-Review PRIORITY.......: 60 WORKED HOURS...: 41 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 35 PROGRESS NOTES: -=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=- Help debug some test failures seen in Buildbot. Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours). -=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=- Help Alexi debug+fix some test problems in the patch. Worked 4 hours and estimate 0 hours remain (original estimate unchanged). -=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=- Help debug strange problem in mysqlbinlog.test. Worked 1 hour and estimate 4 hours remain (original estimate unchanged). -=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=- Merge with latest trunk and run Buildbot tests. Worked 1 hour and estimate 5 hours remain (original estimate unchanged). -=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=- Review of fixes to first review done. No new issues found. Worked 2 hours and estimate 6 hours remain (original estimate unchanged). -=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=- Status updated. --- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000 +++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000 @@ -1 +1 @@ -In-Progress +Code-Review -=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=- Code review (mailed to maria-developers@). Worked 7 hours and estimate 8 hours remain (original estimate unchanged). -=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=- Status updated. --- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000 +++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000 @@ -1 +1 @@ -Code-Review +In-Progress -=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=- Status updated. --- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000 +++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000 @@ -1 +1 @@ -In-Progress +Code-Review -=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=- Worked 20 hours (alexi) Worked 20 hours and estimate 15 hours remain (original estimate unchanged). ------------------------------------------------------------ -=-=(View All Progress Notes, 33 total)=-=- http://askmonty.org/worklog/index.pl?tid=47&nolimit=1 DESCRIPTION: Store in binlog (and show in mysqlbinlog output) texts of statements that caused RBR events This is needed for (list from Monty): - Easier to understand why updates happened - Would make it easier to find out where in application things went wrong (as you can search for exact strings) - Allow one to filter things based on comments in the statement. The cost of this can be that the binlog will be approximately 2x in size (especially insert of big blob's would be a bit painful), so this should be an optional feature. HIGH-LEVEL SPECIFICATION: Content ~~~~~~~ 1. Annotate_rows_log_event 2. Server option: --binlog-annotate-rows-events 3. Server option: --replicate-annotate-rows-events 4. mysqlbinlog option: --print-annotate-rows-events 5. mysqlbinlog output 1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Describes the query which caused the corresponding rows events. Has empty post-header and contains the query text in its data part. Example: ************************ ANNOTATE_ROWS_EVENT ************************ 00000220 | B6 A0 2C 4B | time_when = 1261215926 00000224 | 33 | event_type = 51 00000225 | 64 00 00 00 | server_id = 100 00000229 | 36 00 00 00 | event_len = 54 0000022D | 56 02 00 00 | log_pos = 00000256 00000231 | 00 00 | flags = <none> ------------------------ 00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)" 00000237 | 52 54 20 49 | 0000023B | 4E 54 4F 20 | 0000023F | 74 31 20 56 | 00000243 | 41 4C 55 45 | 00000247 | 53 20 28 31 | 0000024B | 29 2C 20 28 | 0000024F | 32 29 2C 20 | 00000253 | 28 33 29 | ************************ In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event and precedes the first of Table map events which accompany the corresponding rows events. (See example in the "mysqlbinlog output" section below.) 2. Server option: --binlog-annotate-rows-events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Tells the master to write Annotate_rows events to the binary log. * Variable Name: binlog_annotate_rows_events * Scope: Global & Session * Access Type: Dynamic * Data Type: bool * Default Value: OFF NOTE. Session values allows to annotate only some selected statements: ... SET SESSION binlog_annotate_rows_events=ON; ... statements to be annotated ... SET SESSION binlog_annotate_rows_events=OFF; ... statements not to be annotated ... 3. Server option: --replicate-annotate-rows-events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Tells the slave to reproduce Annotate_rows events recieved from the master in its own binary log (sensible only in pair with log-slave-updates option). * Variable Name: replicate_annotate_rows_events * Scope: Global * Access Type: Read only * Data Type: bool * Default Value: OFF NOTE. Why do we additionally need this 'replicate' option? Why not to make the slave to reproduce this events when its binlog-annotate-rows-events global value is ON? Well, because, for example, we may want to configure the slave which should reproduce Annotate_rows events but has global binlog-annotate-rows-events = OFF meaning this to be the default value for the client threads (see also "How slave treats replicate-annotate-rows-events option" in LLD part). 4. mysqlbinlog option: --print-annotate-rows-events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With this option, mysqlbinlog prints the content of Annotate_rows events (if the binary log does contain them). Without this option (i.e. by default), mysqlbinlog skips Annotate_rows events. 5. mysqlbinlog output ~~~~~~~~~~~~~~~~~~~~~ With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events in a form like this: ... # at 1646 #091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1 exec_time=0 error_code=0 SET TIMESTAMP=1261215926/*!*/; BEGIN /*!*/; # at 1714 # at 1812 # at 1853 # at 1894 # at 1938 #091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a` #091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1` mapped to number 16 #091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2` mapped to number 17 #091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16 #091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17 flags: STMT_END_F ... LOW-LEVEL DESIGN: Content ~~~~~~~ 1. Annotate_rows event number 2. Outline of Annotate_rows event behavior 3. How Master writes Annotate_rows events to the binary log 4. How slave treats replicate-annotate-rows-events option 5. How slave IO thread requests Annotate_rows events 6. How master executes the request 7. How slave SQL thread processes Annotate_rows events 8. General remarks 1. Annotate_rows event number ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To avoid possible event numbers conflict with MySQL/Sun, we leave a gap between the last MySQL event number and the Annotate_rows event number: enum Log_event_type { ... INCIDENT_EVENT= 26, // New MySQL event numbers are to be added here MYSQL_EVENTS_END, MARIA_EVENTS_BEGIN= 51, // New Maria event numbers start from here ANNOTATE_ROWS_EVENT= 51, ENUM_END_EVENT }; together with the corresponding extension of 'post_header_len' array in the Format description event. (This extension does not affect the compatibility of the binary log). Here is how Format description event looks like with this extension: ************************ FORMAT_DESCRIPTION_EVENT ************************ 00000004 | A1 A0 2C 4B | time_when = 1261215905 00000008 | 0F | event_type = 15 00000009 | 64 00 00 00 | server_id = 100 0000000D | 7F 00 00 00 | event_len = 127 00000011 | 83 00 00 00 | log_pos = 00000083 00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F ------------------------ 00000017 | 04 00 | binlog_ver = 4 00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log ..... ... 0000004B | A1 A0 2C 4B | time_created = 1261215905 0000004F | 13 | common_header_len = 19 ------------------------ post_header_len ------------------------ 00000050 | 38 | 56 - START_EVENT_V3 [1] ..... ... 00000069 | 02 | 2 - INCIDENT_EVENT [26] 0000006A | 00 | 0 - RESERVED [27] ..... ... 00000081 | 00 | 0 - RESERVED [50] 00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51] ************************ 2. Outline of Annotate_rows event behavior ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Each Annotate_rows_log_event object has two private members describing the corresponding query: char *m_query_txt; uint m_query_len; When the object is created for writing to a binary log, this query is taken from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix as well as other implementation details): Annotate_rows_log_event(THD *thd) { m_query_txt = thd->query(); m_query_len = thd->query_length(); } When the object is read from a binary log, the query is taken from the buffer containing the binary log representation of the event (this buffer is allocated in Log_event object from which all Log events are derived): Annotate_rows_log_event(char *buf, uint event_len, Format_description_log_event *desc) { m_query_len = event_len - desc->common_header_len; m_query_txt = buf + desc->common_header_len; } The events are written to the binary log by the Log_event::write() member which calls virtual write_data_header() and write_data_body() members ("data header" and "post header" are synonym in replication terminology). In our case, data header is empty and data body is just the query: bool write_data_body(IO_CACHE *file) { return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len); } Printing the event is just printing the query: void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo) { my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt); } 3. How Master writes Annotate_rows events to the binary log ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The event is written to the binary log just before the group of Table_map events which precede corresponding Rows events (one query may generate several Table map events in the binary log, but the corresponding Annotate_rows event must be written only once before the first Table map event; hence the boolean variable 'with_annotate' below): int write_locked_table_maps(THD *thd) { ... bool with_annotate= thd->variables.binlog_annotate_rows_events; ... for (uint i= 0; i < ... <number of tables> ...; ++i) { ... thd->binlog_write_table_map(table, ..., with_annotate); with_annotate= 0; // write Annotate_event not more than once ... } ... } int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate) { ... Table_map_log_event the_event(...); ... if (with_annotate) { Annotate_rows_log_event anno(this); mysql_bin_log.write(&anno); } mysql_bin_log.write(&the_event); ... } 4. How slave treats replicate-annotate-rows-events option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The replicate-annotate-rows-events option is treated just as the session value of the binlog_annotate_rows_events variable for the slave IO and SQL threads. This setting is done during initialization of these threads: pthread_handler_t handle_slave_io(void *arg) { THD *thd= new THD; ... init_slave_thread(thd, SLAVE_THD_IO); ... } pthread_handler_t handle_slave_sql(void *arg) { THD *thd= new THD; ... init_slave_thread(thd, SLAVE_THD_SQL); ... } int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type) { ... thd->variables.binlog_annotate_rows_events= opt_replicate_annotate_rows_events; ... } 5. How slave IO thread requests Annotate_rows events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If the replicate-annotate-rows-events option is not set on a slave, there is no need for master to send Annotate_rows events to this slave. The slave (or mysqlbinlog in remote case), before requesting binlog dump via the COM_BINLOG_DUMP command, informs the master whether it should send these events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server command: case COM_BINLOG_DUMP_OPTIONS_EXT: thd->binlog_dump_flags_ext= packet[0]; my_ok(thd); break; Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible conflicts with MySQL/Sun. 6. How master executes the request ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ case COM_BINLOG_DUMP: { ... flags= uint2korr(packet + 4); ... mysql_binlog_send(thd, ..., flags); ... } void mysql_binlog_send(THD* thd, ..., ushort flags) { ... Log_event::read_log_event(&log, packet, ...); ... if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT || flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT) { my_net_write(net, packet->ptr(), packet->length()); } ... } 7. How slave SQL thread processes Annotate_rows events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The slave processes each recieved event by "applying" it, i.e. by calling the Log_event::apply_event() function which in turn calls the virtual do_apply_event() member specific for each type of the event. int exec_relay_log_event(THD* thd, Relay_log_info* rli) { ... Log_event *ev = next_event(rli); ... apply_event_and_update_pos(ev, ...); if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT) delete ev; ... } int apply_event_and_update_pos(Log_event *ev, ...) { ... ev->apply_event(...); ... } int Log_event::apply_event(...) { return do_apply_event(...); } What does it mean to "apply" an Annotate_rows event? It means to set current thd query to that of the described by the event, i.e. to the query which caused the subsequent Rows events (see "How Master writes Annotate_rows events to the binary log" to follow what happens further when the subsequent Rows events are applied): int Annotate_rows_log_event::do_apply_event(...) { thd->set_query(m_query_txt, m_query_len); } NOTE. I am not sure, but possibly current values of thd->query and thd->query_length should be saved before calling set_query() and to be restored on the Annotate_rows_log_event object deletion. Is it really needed ? After calling this do_apply_event() function we may not delete the Annotate_rows_log_event object immediatedly (see exec_relay_log_event() above) because thd->query now points to the string inside this object. We may keep the pointer to this object in the Relay_log_info: class Relay_log_info { public: ... void set_annotate_event(Annotate_rows_log_event*); Annotate_rows_log_event* get_annotate_event(); void free_annotate_event(); ... private: Annotate_rows_log_event* m_annotate_event; }; The saved Annotate_rows object should be deleted when all corresponding Rows events will be processed: int exec_relay_log_event(THD* thd, Relay_log_info* rli) { ... Log_event *ev= next_event(rli); ... apply_event_and_update_pos(ev, ...); if (rli->get_annotate_event() && is_last_rows_event(ev)) rli->free_annotate_event(); else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT) rli->set_annotate_event((Annotate_rows_log_event*) ev); else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT) delete ev; ... } where bool is_last_rows_event(Log_event* ev) { Log_event_type type= ev->get_type_code(); if (IS_ROWS_EVENT_TYPE(type)) { Rows_log_event* rows= (Rows_log_event*)ev; return rows->get_flags(Rows_log_event::STMT_END_F); } return 0; } #define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \ (type) == UPDATE_ROWS_EVENT || \ (type) == DELETE_ROWS_EVENT) 8. General remarks ~~~~~~~~~~~~~~~~~~ Kristian noticed that introducing new log event type should be coordinated somehow with MySQL/Sun: Kristian: The numeric code for this event must be assigned carefully. It should be coordinated with MySQL/Sun, otherwise we can get into a situation where MySQL uses the same numeric code for one event that MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two impossible. Alex: I reserved about 20 numbers not to have possible conflicts with MySQL. Kristian: Still, I think it would be appropriate to send a polite email to internals@lists.mysql.com about this and suggesting to reserve the event number. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)