developers
Threads by month
- ----- 2025 -----
- February
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- 6826 discussions
[Maria-developers] Progress (by Alexi): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 67
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 36 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Alexi): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 67
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 36 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Alexi): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 67
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 36 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
24 Jun '10
Hi all,
For the next release of 5.2, we wanted to be sure that it comes with the
proper engines. In previous releases it had to be built manually to
include XtraDB, and didn't have FederatedX at all.
I spent some time looking at the plugin loading code, which turned out
to be irrelevant. Because a standard build of the zip file now actually
has FederatedX and XtraDB in it.
So the Windows code is good to go :-)
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
1
0
[Maria-developers] Updated (by Sanja): Subquery optimization: Avoid recalculating subquery if external fields values found in subquery cache (66)
by worklog-noreply@askmonty.org 23 Jun '10
by worklog-noreply@askmonty.org 23 Jun '10
23 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Subquery optimization: Avoid recalculating subquery if external fields
values found in subquery cache
CREATION DATE..: Wed, 25 Nov 2009, 22:25
SUPERVISOR.....: Monty
IMPLEMENTOR....: Sanja
COPIES TO......:
CATEGORY.......: Client-Sprint
TASK ID........: 66 (http://askmonty.org/worklog/?tid=66)
VERSION........: 9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sanja - Thu, 24 Jun 2010, 06:01)=-=-
Low Level Design modified.
--- /tmp/wklog.66.old.19228 2010-06-24 06:01:35.000000000 +0000
+++ /tmp/wklog.66.new.19228 2010-06-24 06:01:35.000000000 +0000
@@ -1,10 +1,10 @@
-* Target version: base on mysql-5.2 code
+* Target version: base on mysql-5.3 code
All items on which subquery depend could be collected in
-st_select_lex::mark_as_dependent (direct of indirect reference?)
+st_select_lex::register_dependency_item (indirect reference)
Temporary table index should be created by all fields except result field
-(TMP_TABLE_PARAM::keyinfo).
+(TABLE::add_tmp_key).
How to fill the temptable
-------------------------
@@ -49,7 +49,8 @@
if (res || !sjm->in_equality->val_int())
DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
-The code in this WL will use the same approach
+The code in this WL will use the same approach except eqality which will be
+created according to field type (some types do not need it)
Extracting the value of the subquery predicate
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -58,3 +59,11 @@
temporary table's field and then subquery_predicate->val_int() will invoke
$I->val_int(), subquery_predicate->val_str() will invoke $I->val_str() and so
forth.
+
+Caching the subquery in Item tree
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We use Item::transform to put caching Item (Item_cache_wrapper) before the
+subquery (Item_subquery* or Item_in_optimizer). For this we add new transformer
+method ::cache_insert_transformer.
+
-=-=(Guest - Sun, 13 Jun 2010, 16:51)=-=-
Dependency deleted: 91 no longer depends on 66
-=-=(Igor - Wed, 10 Mar 2010, 21:29)=-=-
High Level Description modified.
--- /tmp/wklog.66.old.32188 2010-03-10 21:29:16.000000000 +0000
+++ /tmp/wklog.66.new.32188 2010-03-10 21:29:16.000000000 +0000
@@ -1,3 +1,10 @@
+The goal of this task is to optimize evaluation of subqueries and subquery
+predicates by storing the results of a correlated subquery together with
+correlation parameters in a cache and reusing those results for the same sets of
+parameters.
+
+Here's what is to be done in this task in more details:
+
Collect all outer items/references (left part of the subquiery and outer
references inside the subquery) in key string. Compare the string (which
represents certain value set of the references) against values in hash table and
-=-=(Igor - Wed, 10 Mar 2010, 21:13)=-=-
Dependency created: 91 now depends on 66
-=-=(Igor - Wed, 10 Mar 2010, 21:12)=-=-
Category updated.
--- /tmp/wklog.66.old.31558 2010-03-10 21:12:50.000000000 +0000
+++ /tmp/wklog.66.new.31558 2010-03-10 21:12:50.000000000 +0000
@@ -1 +1 @@
-Server-BackLog
+Client-Sprint
-=-=(Igor - Wed, 10 Mar 2010, 21:12)=-=-
Version updated.
--- /tmp/wklog.66.old.31558 2010-03-10 21:12:50.000000000 +0000
+++ /tmp/wklog.66.new.31558 2010-03-10 21:12:50.000000000 +0000
@@ -1 +1 @@
-Server-5.3
+9.x
-=-=(Monty - Fri, 29 Jan 2010, 19:07)=-=-
Version updated.
--- /tmp/wklog.66.old.5893 2010-01-29 19:07:10.000000000 +0200
+++ /tmp/wklog.66.new.5893 2010-01-29 19:07:10.000000000 +0200
@@ -1 +1 @@
-Server-5.2
+Server-5.3
-=-=(Psergey - Wed, 20 Jan 2010, 14:50)=-=-
High-Level Specification modified.
--- /tmp/wklog.66.old.26873 2010-01-20 14:50:41.000000000 +0200
+++ /tmp/wklog.66.new.26873 2010-01-20 14:50:41.000000000 +0200
@@ -4,7 +4,6 @@
To check/discuss:
-----------------
-* Do we put subquery cache on all levels of subqueries or on highest level only
* Will there be any means to measure subquery cache hit rate?
* MySQL-6.0 has a one-element predicate result cache. It is called "left
expression cache", grep for left_expr_cache in sql/item_subselect.*
@@ -41,7 +40,12 @@
- subquery_item_result is 'bool' for subquery predicates, and is of
some scalar or ROW(scalar1,...scalarN) type for scalar-context subquery.
-We dont support cases when outer_expr or correlation_references are blobs.
+We don't support cases when outer_expr or correlation_references are blobs.
+
+All subquery predicates are cached. That is, if one subquery predicate is
+located within another, both of them will have caches (one option to reduce
+cache memory usage was to use cache only for the upper-most select. we decided
+against it).
2. Data structure used for the cache
------------------------------------
-=-=(Psergey - Wed, 20 Jan 2010, 13:07)=-=-
High-Level Specification modified.
--- /tmp/wklog.66.old.17649 2010-01-20 13:07:07.000000000 +0200
+++ /tmp/wklog.66.new.17649 2010-01-20 13:07:07.000000000 +0200
@@ -3,7 +3,13 @@
To check/discuss:
- To put subquery cache on all levels of subqueries or on highest level only.
+-----------------
+* Do we put subquery cache on all levels of subqueries or on highest level only
+* Will there be any means to measure subquery cache hit rate?
+* MySQL-6.0 has a one-element predicate result cache. It is called "left
+ expression cache", grep for left_expr_cache in sql/item_subselect.*
+ When this WL is merged with 6.0's optimizations, these two caches will
+ need to be unified somehow.
<contents>
-=-=(Psergey - Mon, 18 Jan 2010, 16:40)=-=-
Low Level Design modified.
--- /tmp/wklog.66.old.24899 2010-01-18 16:40:16.000000000 +0200
+++ /tmp/wklog.66.new.24899 2010-01-18 16:40:16.000000000 +0200
@@ -1,3 +1,5 @@
+* Target version: base on mysql-5.2 code
+
All items on which subquery depend could be collected in
st_select_lex::mark_as_dependent (direct of indirect reference?)
------------------------------------------------------------
-=-=(View All Progress Notes, 19 total)=-=-
http://askmonty.org/worklog/index.pl?tid=66&nolimit=1
DESCRIPTION:
The goal of this task is to optimize evaluation of subqueries and subquery
predicates by storing the results of a correlated subquery together with
correlation parameters in a cache and reusing those results for the same sets of
parameters.
Here's what is to be done in this task in more details:
Collect all outer items/references (left part of the subquiery and outer
references inside the subquery) in key string. Compare the string (which
represents certain value set of the references) against values in hash table and
return cached result of subquery if the reference values combination has already
been used.
For example in the following subquery:
(L1, L2) IN (SELECT A, B FROM T WHERE T.F1>OTER_FIELD)
set of references to look into the subquery cache is (L1, L2, OTER_FIELD).
The subquery cache should be implemented as simple LRU connected to the subquery.
Size of the subquery cache (in number of results (but maybe in used memory
amount)) is limited by session variable (query parameter?).
HIGH-LEVEL SPECIFICATION:
Attach subquery cache to each Item_subquery. Interface should allow to use hash
or temporary table inside.
To check/discuss:
-----------------
* Will there be any means to measure subquery cache hit rate?
* MySQL-6.0 has a one-element predicate result cache. It is called "left
expression cache", grep for left_expr_cache in sql/item_subselect.*
When this WL is merged with 6.0's optimizations, these two caches will
need to be unified somehow.
<contents>
1. Scope of the task
2. Data structure used for the cache
3. Cache size
4. Interplay with other subquery optimizations
5. User interface
</contents>
1. Scope of the task
--------------------
This WL should handle all subquery predicates, i.e. it should handle these
cases:
outer_expr IN (SELECT correlated_select)
outer_expr $CMP$ ALL/ANY (SELECT correlated_select)
EXISTS (SELECT correlated_select)
scalar-context subquery: (SELECT correlated_select)
The cache will maintain
(outer_expr, correlation_references)-> subquery_item_result
mapping, where
- correlation_references is a list of tablename.column_name that are referred
from the correlated_select but tablename is a table that is ouside the
subquery.
- subquery_item_result is 'bool' for subquery predicates, and is of
some scalar or ROW(scalar1,...scalarN) type for scalar-context subquery.
We don't support cases when outer_expr or correlation_references are blobs.
All subquery predicates are cached. That is, if one subquery predicate is
located within another, both of them will have caches (one option to reduce
cache memory usage was to use cache only for the upper-most select. we decided
against it).
2. Data structure used for the cache
------------------------------------
There are two data structures available in the codebase that will allow fast
equality lookups:
1. HASH (mysys/hash.c) tables
2. Temporary tables (the ones that are used for e.g. GROUP BY)
None of them has any support for element eviction on overflow (using LRU or
some other policy).
Query cache and MyISAM/Maria's key/page cache ought to support some eviction
mechanism, but code-wise it is not readily reusable, one will need to factor
it out (or copy it).
We choose to use #2, and not to have any eviction policy. See subsequent
sections for details and reasoning behind the decision.
3. Cache size
-------------
Typically, a cache has some maximum size and a policy which is used to
select a cache entry for removal when the cache becomes full (e.g. find
and remove the least [recently] used entry)
For this WL entry we will use a cache of infinite size. The reasoning behind
this is that:
- is is easy to do: we have temporary tables that can grow to arbitrarily
large size while still providing the same insert/lookup interface.
- it suits us: unless the subquery is resolved with one index lookup,
hitting the cache would be many times cheaper than re-running the
subquery, so cache is worth having.
4. Interplay with other subquery optimizations
----------------------------------------------
* This WL entry should not care about IN->EXISTS transformation: caching for
IN subquery and result of its conversion to EXISTS would work in the same
way.
* This optimization is orthogonal to <=>ANY -> MIN/MAX rewrite (it will
work/be useful irrespectively of whether the rewrite has been performed or
not)
* TODO: compare this with materialization for uncorrelated IN-subqueries. Is
this basically the same?
A: no, it is not:
- IN-Materialization has to perform full materialization before it can
do the first subquery evaluation. This WL's code has almost no startup
costs.
- This optimization has temp.table of (corr_reference, predicate_value),
while IN-materialization will have (corr_reference) only.
5. User interface
-----------------
* There will be an @@optimizer_switch flag to turn this optimization on and
off (TODO: name of the flag?)
* TODO: how do we show this in EXPLAIN [EXTENDED]? The most easiest is to
print something in the warning text of EXPLAIN EXTEDED that would indicate
use of cache.
* temporary table sizing (max size for heap table, whether to use MyISAM or
Maria) will be controlled with common temp.table control variables.
LOW-LEVEL DESIGN:
* Target version: base on mysql-5.3 code
All items on which subquery depend could be collected in
st_select_lex::register_dependency_item (indirect reference)
Temporary table index should be created by all fields except result field
(TABLE::add_tmp_key).
How to fill the temptable
-------------------------
Can reuse approach from SJ-Materialization. Its code is in end_sj_materialize()
and is supposed to be quite trivial.
How to make lookups into temptable
----------------------------------
We'll reuse approach used by SJ-Materialization in 6.0.
Setup process
~~~~~~~~~~~~~
Setup is performed in the same way as in setup_sj_materialization(),
see the code that starts these lines:
/*
Create/initialize everything we will need to index lookups into the
temptable.
*/
and ends at this line:
Remove the injected semi-join IN-equalities from join_tab conds. This
<questionable>
We'll also need to check equalities, i.e. do an equivalent of this:
if (!(sjm->in_equality= create_subq_in_equalities(thd, sjm,
emb_sj_nest->sj_subq_pred)))
DBUG_RETURN(TRUE); /* purecov: inspected */
Question: or perhaps that is not necessarry?
</questionable>
Doing the lookup
~~~~~~~~~~~~~~~~
SJ-Materialization does lookup in sub_select_sjm(), with this code:
/* Do index lookup in the materialized table */
if ((res= join_read_key2(join_tab, sjm->table, sjm->tab_ref)) == 1)
DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
if (res || !sjm->in_equality->val_int())
DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
The code in this WL will use the same approach except eqality which will be
created according to field type (some types do not need it)
Extracting the value of the subquery predicate
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The goal of making the lookup is to get the value of subquery predicate.
This is done by creating an Item_field $I which refers to appropriate
temporary table's field and then subquery_predicate->val_int() will invoke
$I->val_int(), subquery_predicate->val_str() will invoke $I->val_str() and so
forth.
Caching the subquery in Item tree
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We use Item::transform to put caching Item (Item_cache_wrapper) before the
subquery (Item_subquery* or Item_in_optimizer). For this we add new transformer
method ::cache_insert_transformer.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Sanja): Subquery optimization: Avoid recalculating subquery if external fields values found in subquery cache (66)
by worklog-noreply@askmonty.org 23 Jun '10
by worklog-noreply@askmonty.org 23 Jun '10
23 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Subquery optimization: Avoid recalculating subquery if external fields
values found in subquery cache
CREATION DATE..: Wed, 25 Nov 2009, 22:25
SUPERVISOR.....: Monty
IMPLEMENTOR....: Sanja
COPIES TO......:
CATEGORY.......: Client-Sprint
TASK ID........: 66 (http://askmonty.org/worklog/?tid=66)
VERSION........: 9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sanja - Thu, 24 Jun 2010, 06:01)=-=-
Low Level Design modified.
--- /tmp/wklog.66.old.19228 2010-06-24 06:01:35.000000000 +0000
+++ /tmp/wklog.66.new.19228 2010-06-24 06:01:35.000000000 +0000
@@ -1,10 +1,10 @@
-* Target version: base on mysql-5.2 code
+* Target version: base on mysql-5.3 code
All items on which subquery depend could be collected in
-st_select_lex::mark_as_dependent (direct of indirect reference?)
+st_select_lex::register_dependency_item (indirect reference)
Temporary table index should be created by all fields except result field
-(TMP_TABLE_PARAM::keyinfo).
+(TABLE::add_tmp_key).
How to fill the temptable
-------------------------
@@ -49,7 +49,8 @@
if (res || !sjm->in_equality->val_int())
DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
-The code in this WL will use the same approach
+The code in this WL will use the same approach except eqality which will be
+created according to field type (some types do not need it)
Extracting the value of the subquery predicate
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -58,3 +59,11 @@
temporary table's field and then subquery_predicate->val_int() will invoke
$I->val_int(), subquery_predicate->val_str() will invoke $I->val_str() and so
forth.
+
+Caching the subquery in Item tree
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We use Item::transform to put caching Item (Item_cache_wrapper) before the
+subquery (Item_subquery* or Item_in_optimizer). For this we add new transformer
+method ::cache_insert_transformer.
+
-=-=(Guest - Sun, 13 Jun 2010, 16:51)=-=-
Dependency deleted: 91 no longer depends on 66
-=-=(Igor - Wed, 10 Mar 2010, 21:29)=-=-
High Level Description modified.
--- /tmp/wklog.66.old.32188 2010-03-10 21:29:16.000000000 +0000
+++ /tmp/wklog.66.new.32188 2010-03-10 21:29:16.000000000 +0000
@@ -1,3 +1,10 @@
+The goal of this task is to optimize evaluation of subqueries and subquery
+predicates by storing the results of a correlated subquery together with
+correlation parameters in a cache and reusing those results for the same sets of
+parameters.
+
+Here's what is to be done in this task in more details:
+
Collect all outer items/references (left part of the subquiery and outer
references inside the subquery) in key string. Compare the string (which
represents certain value set of the references) against values in hash table and
-=-=(Igor - Wed, 10 Mar 2010, 21:13)=-=-
Dependency created: 91 now depends on 66
-=-=(Igor - Wed, 10 Mar 2010, 21:12)=-=-
Category updated.
--- /tmp/wklog.66.old.31558 2010-03-10 21:12:50.000000000 +0000
+++ /tmp/wklog.66.new.31558 2010-03-10 21:12:50.000000000 +0000
@@ -1 +1 @@
-Server-BackLog
+Client-Sprint
-=-=(Igor - Wed, 10 Mar 2010, 21:12)=-=-
Version updated.
--- /tmp/wklog.66.old.31558 2010-03-10 21:12:50.000000000 +0000
+++ /tmp/wklog.66.new.31558 2010-03-10 21:12:50.000000000 +0000
@@ -1 +1 @@
-Server-5.3
+9.x
-=-=(Monty - Fri, 29 Jan 2010, 19:07)=-=-
Version updated.
--- /tmp/wklog.66.old.5893 2010-01-29 19:07:10.000000000 +0200
+++ /tmp/wklog.66.new.5893 2010-01-29 19:07:10.000000000 +0200
@@ -1 +1 @@
-Server-5.2
+Server-5.3
-=-=(Psergey - Wed, 20 Jan 2010, 14:50)=-=-
High-Level Specification modified.
--- /tmp/wklog.66.old.26873 2010-01-20 14:50:41.000000000 +0200
+++ /tmp/wklog.66.new.26873 2010-01-20 14:50:41.000000000 +0200
@@ -4,7 +4,6 @@
To check/discuss:
-----------------
-* Do we put subquery cache on all levels of subqueries or on highest level only
* Will there be any means to measure subquery cache hit rate?
* MySQL-6.0 has a one-element predicate result cache. It is called "left
expression cache", grep for left_expr_cache in sql/item_subselect.*
@@ -41,7 +40,12 @@
- subquery_item_result is 'bool' for subquery predicates, and is of
some scalar or ROW(scalar1,...scalarN) type for scalar-context subquery.
-We dont support cases when outer_expr or correlation_references are blobs.
+We don't support cases when outer_expr or correlation_references are blobs.
+
+All subquery predicates are cached. That is, if one subquery predicate is
+located within another, both of them will have caches (one option to reduce
+cache memory usage was to use cache only for the upper-most select. we decided
+against it).
2. Data structure used for the cache
------------------------------------
-=-=(Psergey - Wed, 20 Jan 2010, 13:07)=-=-
High-Level Specification modified.
--- /tmp/wklog.66.old.17649 2010-01-20 13:07:07.000000000 +0200
+++ /tmp/wklog.66.new.17649 2010-01-20 13:07:07.000000000 +0200
@@ -3,7 +3,13 @@
To check/discuss:
- To put subquery cache on all levels of subqueries or on highest level only.
+-----------------
+* Do we put subquery cache on all levels of subqueries or on highest level only
+* Will there be any means to measure subquery cache hit rate?
+* MySQL-6.0 has a one-element predicate result cache. It is called "left
+ expression cache", grep for left_expr_cache in sql/item_subselect.*
+ When this WL is merged with 6.0's optimizations, these two caches will
+ need to be unified somehow.
<contents>
-=-=(Psergey - Mon, 18 Jan 2010, 16:40)=-=-
Low Level Design modified.
--- /tmp/wklog.66.old.24899 2010-01-18 16:40:16.000000000 +0200
+++ /tmp/wklog.66.new.24899 2010-01-18 16:40:16.000000000 +0200
@@ -1,3 +1,5 @@
+* Target version: base on mysql-5.2 code
+
All items on which subquery depend could be collected in
st_select_lex::mark_as_dependent (direct of indirect reference?)
------------------------------------------------------------
-=-=(View All Progress Notes, 19 total)=-=-
http://askmonty.org/worklog/index.pl?tid=66&nolimit=1
DESCRIPTION:
The goal of this task is to optimize evaluation of subqueries and subquery
predicates by storing the results of a correlated subquery together with
correlation parameters in a cache and reusing those results for the same sets of
parameters.
Here's what is to be done in this task in more details:
Collect all outer items/references (left part of the subquiery and outer
references inside the subquery) in key string. Compare the string (which
represents certain value set of the references) against values in hash table and
return cached result of subquery if the reference values combination has already
been used.
For example in the following subquery:
(L1, L2) IN (SELECT A, B FROM T WHERE T.F1>OTER_FIELD)
set of references to look into the subquery cache is (L1, L2, OTER_FIELD).
The subquery cache should be implemented as simple LRU connected to the subquery.
Size of the subquery cache (in number of results (but maybe in used memory
amount)) is limited by session variable (query parameter?).
HIGH-LEVEL SPECIFICATION:
Attach subquery cache to each Item_subquery. Interface should allow to use hash
or temporary table inside.
To check/discuss:
-----------------
* Will there be any means to measure subquery cache hit rate?
* MySQL-6.0 has a one-element predicate result cache. It is called "left
expression cache", grep for left_expr_cache in sql/item_subselect.*
When this WL is merged with 6.0's optimizations, these two caches will
need to be unified somehow.
<contents>
1. Scope of the task
2. Data structure used for the cache
3. Cache size
4. Interplay with other subquery optimizations
5. User interface
</contents>
1. Scope of the task
--------------------
This WL should handle all subquery predicates, i.e. it should handle these
cases:
outer_expr IN (SELECT correlated_select)
outer_expr $CMP$ ALL/ANY (SELECT correlated_select)
EXISTS (SELECT correlated_select)
scalar-context subquery: (SELECT correlated_select)
The cache will maintain
(outer_expr, correlation_references)-> subquery_item_result
mapping, where
- correlation_references is a list of tablename.column_name that are referred
from the correlated_select but tablename is a table that is ouside the
subquery.
- subquery_item_result is 'bool' for subquery predicates, and is of
some scalar or ROW(scalar1,...scalarN) type for scalar-context subquery.
We don't support cases when outer_expr or correlation_references are blobs.
All subquery predicates are cached. That is, if one subquery predicate is
located within another, both of them will have caches (one option to reduce
cache memory usage was to use cache only for the upper-most select. we decided
against it).
2. Data structure used for the cache
------------------------------------
There are two data structures available in the codebase that will allow fast
equality lookups:
1. HASH (mysys/hash.c) tables
2. Temporary tables (the ones that are used for e.g. GROUP BY)
None of them has any support for element eviction on overflow (using LRU or
some other policy).
Query cache and MyISAM/Maria's key/page cache ought to support some eviction
mechanism, but code-wise it is not readily reusable, one will need to factor
it out (or copy it).
We choose to use #2, and not to have any eviction policy. See subsequent
sections for details and reasoning behind the decision.
3. Cache size
-------------
Typically, a cache has some maximum size and a policy which is used to
select a cache entry for removal when the cache becomes full (e.g. find
and remove the least [recently] used entry)
For this WL entry we will use a cache of infinite size. The reasoning behind
this is that:
- is is easy to do: we have temporary tables that can grow to arbitrarily
large size while still providing the same insert/lookup interface.
- it suits us: unless the subquery is resolved with one index lookup,
hitting the cache would be many times cheaper than re-running the
subquery, so cache is worth having.
4. Interplay with other subquery optimizations
----------------------------------------------
* This WL entry should not care about IN->EXISTS transformation: caching for
IN subquery and result of its conversion to EXISTS would work in the same
way.
* This optimization is orthogonal to <=>ANY -> MIN/MAX rewrite (it will
work/be useful irrespectively of whether the rewrite has been performed or
not)
* TODO: compare this with materialization for uncorrelated IN-subqueries. Is
this basically the same?
A: no, it is not:
- IN-Materialization has to perform full materialization before it can
do the first subquery evaluation. This WL's code has almost no startup
costs.
- This optimization has temp.table of (corr_reference, predicate_value),
while IN-materialization will have (corr_reference) only.
5. User interface
-----------------
* There will be an @@optimizer_switch flag to turn this optimization on and
off (TODO: name of the flag?)
* TODO: how do we show this in EXPLAIN [EXTENDED]? The most easiest is to
print something in the warning text of EXPLAIN EXTEDED that would indicate
use of cache.
* temporary table sizing (max size for heap table, whether to use MyISAM or
Maria) will be controlled with common temp.table control variables.
LOW-LEVEL DESIGN:
* Target version: base on mysql-5.3 code
All items on which subquery depend could be collected in
st_select_lex::register_dependency_item (indirect reference)
Temporary table index should be created by all fields except result field
(TABLE::add_tmp_key).
How to fill the temptable
-------------------------
Can reuse approach from SJ-Materialization. Its code is in end_sj_materialize()
and is supposed to be quite trivial.
How to make lookups into temptable
----------------------------------
We'll reuse approach used by SJ-Materialization in 6.0.
Setup process
~~~~~~~~~~~~~
Setup is performed in the same way as in setup_sj_materialization(),
see the code that starts these lines:
/*
Create/initialize everything we will need to index lookups into the
temptable.
*/
and ends at this line:
Remove the injected semi-join IN-equalities from join_tab conds. This
<questionable>
We'll also need to check equalities, i.e. do an equivalent of this:
if (!(sjm->in_equality= create_subq_in_equalities(thd, sjm,
emb_sj_nest->sj_subq_pred)))
DBUG_RETURN(TRUE); /* purecov: inspected */
Question: or perhaps that is not necessarry?
</questionable>
Doing the lookup
~~~~~~~~~~~~~~~~
SJ-Materialization does lookup in sub_select_sjm(), with this code:
/* Do index lookup in the materialized table */
if ((res= join_read_key2(join_tab, sjm->table, sjm->tab_ref)) == 1)
DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
if (res || !sjm->in_equality->val_int())
DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
The code in this WL will use the same approach except eqality which will be
created according to field type (some types do not need it)
Extracting the value of the subquery predicate
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The goal of making the lookup is to get the value of subquery predicate.
This is done by creating an Item_field $I which refers to appropriate
temporary table's field and then subquery_predicate->val_int() will invoke
$I->val_int(), subquery_predicate->val_str() will invoke $I->val_str() and so
forth.
Caching the subquery in Item tree
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We use Item::transform to put caching Item (Item_cache_wrapper) before the
subquery (Item_subquery* or Item_in_optimizer). For this we add new transformer
method ::cache_insert_transformer.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 23 Jun '10
by worklog-noreply@askmonty.org 23 Jun '10
23 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Version updated.
--- /tmp/wklog.21.old.18774 2010-06-24 05:49:41.000000000 +0000
+++ /tmp/wklog.21.new.18774 2010-06-24 05:49:41.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 13 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 23 Jun '10
by worklog-noreply@askmonty.org 23 Jun '10
23 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Version updated.
--- /tmp/wklog.21.old.18774 2010-06-24 05:49:41.000000000 +0000
+++ /tmp/wklog.21.new.18774 2010-06-24 05:49:41.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 13 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 23 Jun '10
by worklog-noreply@askmonty.org 23 Jun '10
23 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Version updated.
--- /tmp/wklog.21.old.18774 2010-06-24 05:49:41.000000000 +0000
+++ /tmp/wklog.21.new.18774 2010-06-24 05:49:41.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 13 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 23 Jun '10
by worklog-noreply@askmonty.org 23 Jun '10
23 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Version updated.
--- /tmp/wklog.21.old.18774 2010-06-24 05:49:41.000000000 +0000
+++ /tmp/wklog.21.new.18774 2010-06-24 05:49:41.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 13 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0