developers
Threads by month
- ----- 2025 -----
- April
- March
- February
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- 8 participants
- 6853 discussions

[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Guest): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Guest): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
Hi!
I have problem with automatic commits sending so sends the diff here
(sorry, will fix it tomorrow).
I also have thought about renaming sql/sql_expression_cache.* to
sql/item_expression_cache and moving the item also there but I am not
sure if it is better.
Also I am not sure that Item_cache_wrapper is the best name but
Item_expression_cache_wrapper IMHO is too long.
I re-made 5.3-mwl-66 so it need re-branching (not pulling) if you need
to look at it.
1
0

[Maria-developers] Please review: MWL#121: DS-MRR support for clustered primary keys
by Sergey Petrunya 22 Jun '10
by Sergey Petrunya 22 Jun '10
22 Jun '10
Hello Igor,
Please find below the combined patch for MWL#121. It is ready for review.
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/mysql-test/r/innodb_mrr_cpk.result maria-5.3-dsmrr-for-cpk-noc/mysql-test/r/innodb_mrr_cpk.result
--- maria-5.3-dsmrr-for-cpk-clean/mysql-test/r/innodb_mrr_cpk.result 1970-01-01 03:00:00.000000000 +0300
+++ maria-5.3-dsmrr-for-cpk-noc/mysql-test/r/innodb_mrr_cpk.result 2010-06-22 23:28:02.000000000 +0400
@@ -0,0 +1,148 @@
+drop table if exists t0,t1,t2,t3;
+set @save_join_cache_level=@@join_cache_level;
+set join_cache_level=6;
+set @save_storage_engine=@@storage_engine;
+set storage_engine=innodb;
+create table t0(a int);
+insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
+create table t1(a char(8), b char(8), filler char(100), primary key(a));
+show create table t1;
+Table Create Table
+t1 CREATE TABLE `t1` (
+ `a` char(8) NOT NULL DEFAULT '',
+ `b` char(8) DEFAULT NULL,
+ `filler` char(100) DEFAULT NULL,
+ PRIMARY KEY (`a`)
+) ENGINE=InnoDB DEFAULT CHARSET=latin1
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+concat('b-', 1000 + A.a + B.a*10 + C.a*100, '=B'),
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8));
+insert into t2 values ('a-1010=A'), ('a-1030=A'), ('a-1020=A');
+This should use join buffer:
+explain select * from t1, t2 where t1.a=t2.a;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 8 test.t2.a 1 Using join buffer
+This output must be sorted by value of t1.a:
+select * from t1, t2 where t1.a=t2.a;
+a b filler a
+a-1010=A b-1010=B filler a-1010=A
+a-1020=A b-1020=B filler a-1020=A
+a-1030=A b-1030=B filler a-1030=A
+drop table t1, t2;
+create table t1(
+a char(8) character set utf8, b int, filler char(100),
+primary key(a,b)
+);
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+1000 + A.a + B.a*10 + C.a*100,
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 28 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+insert into t2 values ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 5
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 28 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+a-1030=A 1030 filler a-1030=A 1030
+drop table t1, t2;
+create table t1(
+a varchar(8) character set utf8, b int, filler char(100),
+primary key(a,b)
+);
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+1000 + A.a + B.a*10 + C.a*100,
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 30 test.t2.a,test.t2.b 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+explain select * from t1, t2 where t1.a=t2.a;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 26 test.t2.a 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+drop table t1, t2;
+create table t1 (a int, b int, c int, filler char(100), primary key(a,b,c));
+insert into t1 select A.a, B.a, C.a, 'filler' from t0 A, t0 B, t0 C;
+insert into t1 values (11, 11, 11, 'filler');
+insert into t1 values (11, 11, 12, 'filler');
+insert into t1 values (11, 11, 13, 'filler');
+insert into t1 values (11, 22, 1234, 'filler');
+insert into t1 values (11, 33, 124, 'filler');
+insert into t1 values (11, 33, 125, 'filler');
+create table t2 (a int, b int);
+insert into t2 values (11,33), (11,22), (11,11);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 8 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b c filler a b
+11 11 11 filler 11 11
+11 11 12 filler 11 11
+11 11 13 filler 11 11
+11 22 1234 filler 11 22
+11 33 124 filler 11 33
+11 33 125 filler 11 33
+set join_cache_level=0;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b c filler a b
+11 33 124 filler 11 33
+11 33 125 filler 11 33
+11 22 1234 filler 11 22
+11 11 11 filler 11 11
+11 11 12 filler 11 11
+11 11 13 filler 11 11
+set join_cache_level=6;
+explain select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 4 test.t2.a 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+a b c filler a b
+set optimizer_switch='index_condition_pushdown=off';
+explain select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 4 test.t2.a 1 Using where; Using join buffer
+select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+a b c filler a b
+set optimizer_switch='index_condition_pushdown=on';
+drop table t1,t2;
+set @@join_cache_level= @save_join_cache_level;
+set storage_engine=@save_storage_engine;
+drop table t0;
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/mysql-test/r/innodb_mrr_cpk.result.moved maria-5.3-dsmrr-for-cpk-noc/mysql-test/r/innodb_mrr_cpk.result.moved
--- maria-5.3-dsmrr-for-cpk-clean/mysql-test/r/innodb_mrr_cpk.result.moved 1970-01-01 03:00:00.000000000 +0300
+++ maria-5.3-dsmrr-for-cpk-noc/mysql-test/r/innodb_mrr_cpk.result.moved 2010-06-22 19:23:18.000000000 +0400
@@ -0,0 +1,122 @@
+drop table if exists t0,t1,t2,t3;
+set @save_join_cache_level=@@join_cache_level;
+set join_cache_level=6;
+set @save_storage_engine=@@storage_engine;
+set storage_engine=innodb;
+create table t0(a int);
+insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
+create table t1(a char(8), b char(8), filler char(100), primary key(a));
+show create table t1;
+Table Create Table
+t1 CREATE TABLE `t1` (
+ `a` char(8) NOT NULL DEFAULT '',
+ `b` char(8) DEFAULT NULL,
+ `filler` char(100) DEFAULT NULL,
+ PRIMARY KEY (`a`)
+) ENGINE=InnoDB DEFAULT CHARSET=latin1
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+concat('b-', 1000 + A.a + B.a*10 + C.a*100, '=B'),
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8));
+insert into t2 values ('a-1010=A'), ('a-1030=A'), ('a-1020=A');
+This should use join buffer:
+explain select * from t1, t2 where t1.a=t2.a;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 8 test.t2.a 1 Using join buffer
+This output must be sorted by value of t1.a:
+select * from t1, t2 where t1.a=t2.a;
+a b filler a
+a-1010=A b-1010=B filler a-1010=A
+a-1020=A b-1020=B filler a-1020=A
+a-1030=A b-1030=B filler a-1030=A
+drop table t1, t2;
+create table t1(
+a char(8) character set utf8, b int, filler char(100),
+primary key(a,b)
+);
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+1000 + A.a + B.a*10 + C.a*100,
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 28 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+drop table t1, t2;
+create table t1(
+a varchar(8) character set utf8, b int, filler char(100),
+primary key(a,b)
+);
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+1000 + A.a + B.a*10 + C.a*100,
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 30 test.t2.a,test.t2.b 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+explain select * from t1, t2 where t1.a=t2.a;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 26 test.t2.a 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+drop table t1, t2;
+create table t1 (a int, b int, c int, filler char(100), primary key(a,b,c));
+insert into t1 select A.a, B.a, C.a, 'filler' from t0 A, t0 B, t0 C;
+insert into t1 values (11, 11, 11, 'filler');
+insert into t1 values (11, 11, 12, 'filler');
+insert into t1 values (11, 11, 13, 'filler');
+insert into t1 values (11, 22, 1234, 'filler');
+insert into t1 values (11, 33, 124, 'filler');
+insert into t1 values (11, 33, 125, 'filler');
+create table t2 (a int, b int);
+insert into t2 values (11,33), (11,22), (11,11);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 8 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b c filler a b
+11 11 11 filler 11 11
+11 11 12 filler 11 11
+11 11 13 filler 11 11
+11 22 1234 filler 11 22
+11 33 124 filler 11 33
+11 33 125 filler 11 33
+set join_cache_level=0;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b c filler a b
+11 33 124 filler 11 33
+11 33 125 filler 11 33
+11 22 1234 filler 11 22
+11 11 11 filler 11 11
+11 11 12 filler 11 11
+11 11 13 filler 11 11
+set join_cache_level=6;
+drop table t1,t2;
+set @@join_cache_level= @save_join_cache_level;
+set storage_engine=@save_storage_engine;
+drop table t0;
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/mysql-test/t/innodb_mrr_cpk.test maria-5.3-dsmrr-for-cpk-noc/mysql-test/t/innodb_mrr_cpk.test
--- maria-5.3-dsmrr-for-cpk-clean/mysql-test/t/innodb_mrr_cpk.test 1970-01-01 03:00:00.000000000 +0300
+++ maria-5.3-dsmrr-for-cpk-noc/mysql-test/t/innodb_mrr_cpk.test 2010-06-22 23:28:02.000000000 +0400
@@ -0,0 +1,137 @@
+#
+# Tests for DS-MRR over clustered primary key. The only engine that supports
+# this is InnoDB/XtraDB.
+#
+# Basic idea about testing
+# - DS-MRR/CPK works only with BKA
+# - Should also test index condition pushdown
+# - Should also test whatever uses RANGE_SEQ_IF::skip_record() for filtering
+# - Also test access using prefix of primary key
+#
+# - Forget about cost model, BKA's multi_range_read_info() call passes 10 for
+# #rows, the call is there at all only for applicability check
+#
+-- source include/have_innodb.inc
+
+--disable_warnings
+drop table if exists t0,t1,t2,t3;
+--enable_warnings
+
+set @save_join_cache_level=@@join_cache_level;
+set join_cache_level=6;
+
+set @save_storage_engine=@@storage_engine;
+set storage_engine=innodb;
+
+create table t0(a int);
+insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
+create table t1(a char(8), b char(8), filler char(100), primary key(a));
+show create table t1;
+
+insert into t1 select
+ concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+ concat('b-', 1000 + A.a + B.a*10 + C.a*100, '=B'),
+ 'filler'
+from t0 A, t0 B, t0 C;
+
+create table t2 (a char(8));
+insert into t2 values ('a-1010=A'), ('a-1030=A'), ('a-1020=A');
+
+--echo This should use join buffer:
+explain select * from t1, t2 where t1.a=t2.a;
+
+--echo This output must be sorted by value of t1.a:
+select * from t1, t2 where t1.a=t2.a;
+drop table t1, t2;
+
+# Try multi-column indexes
+create table t1(
+ a char(8) character set utf8, b int, filler char(100),
+ primary key(a,b)
+);
+
+insert into t1 select
+ concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+ 1000 + A.a + B.a*10 + C.a*100,
+ 'filler'
+from t0 A, t0 B, t0 C;
+
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+
+# Try with dataset that causes identical lookup keys:
+insert into t2 values ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+
+drop table t1, t2;
+
+create table t1(
+ a varchar(8) character set utf8, b int, filler char(100),
+ primary key(a,b)
+);
+
+insert into t1 select
+ concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+ 1000 + A.a + B.a*10 + C.a*100,
+ 'filler'
+from t0 A, t0 B, t0 C;
+
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+
+#
+# Try scanning on a CPK prefix
+#
+explain select * from t1, t2 where t1.a=t2.a;
+select * from t1, t2 where t1.a=t2.a;
+drop table t1, t2;
+
+#
+# The above example is not very interesting, as CPK prefix has
+# only one match. Create a dataset where scan on CPK prefix
+# would produce multiple matches:
+#
+create table t1 (a int, b int, c int, filler char(100), primary key(a,b,c));
+insert into t1 select A.a, B.a, C.a, 'filler' from t0 A, t0 B, t0 C;
+
+insert into t1 values (11, 11, 11, 'filler');
+insert into t1 values (11, 11, 12, 'filler');
+insert into t1 values (11, 11, 13, 'filler');
+insert into t1 values (11, 22, 1234, 'filler');
+insert into t1 values (11, 33, 124, 'filler');
+insert into t1 values (11, 33, 125, 'filler');
+
+create table t2 (a int, b int);
+insert into t2 values (11,33), (11,22), (11,11);
+
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+
+# Check a real resultset for comaprison:
+set join_cache_level=0;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+set join_cache_level=6;
+
+
+#
+# Check that Index Condition Pushdown (BKA) actually works:
+#
+explain select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+
+set optimizer_switch='index_condition_pushdown=off';
+explain select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+set optimizer_switch='index_condition_pushdown=on';
+
+drop table t1,t2;
+
+set @@join_cache_level= @save_join_cache_level;
+set storage_engine=@save_storage_engine;
+drop table t0;
+
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/mysql-test/t/innodb_mrr_cpk.test.moved maria-5.3-dsmrr-for-cpk-noc/mysql-test/t/innodb_mrr_cpk.test.moved
--- maria-5.3-dsmrr-for-cpk-clean/mysql-test/t/innodb_mrr_cpk.test.moved 1970-01-01 03:00:00.000000000 +0300
+++ maria-5.3-dsmrr-for-cpk-noc/mysql-test/t/innodb_mrr_cpk.test.moved 2010-06-22 19:23:18.000000000 +0400
@@ -0,0 +1,128 @@
+#
+# Tests for DS-MRR over clustered primary key. The only engine that supports
+# this is InnoDB/XtraDB.
+#
+# Basic idea about testing
+# - DS-MRR/CPK works only with BKA
+# - Should also test index condition pushdown
+# - Should also test whatever uses RANGE_SEQ_IF::skip_record() for filtering
+# - Also test access using prefix of primary key
+#
+# - Forget about cost model, BKA's multi_range_read_info() call passes 10 for
+# #rows, the call is there at all only for applicability check
+#
+-- source include/have_innodb.inc
+
+--disable_warnings
+drop table if exists t0,t1,t2,t3;
+--enable_warnings
+
+set @save_join_cache_level=@@join_cache_level;
+set join_cache_level=6;
+
+set @save_storage_engine=@@storage_engine;
+set storage_engine=innodb;
+
+create table t0(a int);
+insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
+create table t1(a char(8), b char(8), filler char(100), primary key(a));
+show create table t1;
+
+insert into t1 select
+ concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+ concat('b-', 1000 + A.a + B.a*10 + C.a*100, '=B'),
+ 'filler'
+from t0 A, t0 B, t0 C;
+
+create table t2 (a char(8));
+insert into t2 values ('a-1010=A'), ('a-1030=A'), ('a-1020=A');
+
+--echo This should use join buffer:
+explain select * from t1, t2 where t1.a=t2.a;
+
+--echo This output must be sorted by value of t1.a:
+select * from t1, t2 where t1.a=t2.a;
+drop table t1, t2;
+
+# Try multi-column indexes
+create table t1(
+ a char(8) character set utf8, b int, filler char(100),
+ primary key(a,b)
+);
+
+insert into t1 select
+ concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+ 1000 + A.a + B.a*10 + C.a*100,
+ 'filler'
+from t0 A, t0 B, t0 C;
+
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+drop table t1, t2;
+
+create table t1(
+ a varchar(8) character set utf8, b int, filler char(100),
+ primary key(a,b)
+);
+
+insert into t1 select
+ concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+ 1000 + A.a + B.a*10 + C.a*100,
+ 'filler'
+from t0 A, t0 B, t0 C;
+
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+
+#
+# Try scanning on a CPK prefix
+#
+explain select * from t1, t2 where t1.a=t2.a;
+select * from t1, t2 where t1.a=t2.a;
+drop table t1, t2;
+
+#
+# The above example is not very interesting, as CPK prefix has
+# only one match. Create a dataset where scan on CPK prefix
+# would produce multiple matches:
+#
+create table t1 (a int, b int, c int, filler char(100), primary key(a,b,c));
+insert into t1 select A.a, B.a, C.a, 'filler' from t0 A, t0 B, t0 C;
+
+insert into t1 values (11, 11, 11, 'filler');
+insert into t1 values (11, 11, 12, 'filler');
+insert into t1 values (11, 11, 13, 'filler');
+insert into t1 values (11, 22, 1234, 'filler');
+insert into t1 values (11, 33, 124, 'filler');
+insert into t1 values (11, 33, 125, 'filler');
+
+create table t2 (a int, b int);
+insert into t2 values (11,33), (11,22), (11,11);
+
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+
+set join_cache_level=0;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+set join_cache_level=6;
+
+drop table t1,t2;
+
+#
+# Check that Index Condition Pushdown (BKA) actually works:
+#
+
+# TODO
+
+#
+# Check that record-check-func is done:
+#
+
+set @@join_cache_level= @save_join_cache_level;
+set storage_engine=@save_storage_engine;
+drop table t0;
+
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/r/innodb_mrr_cpk.result maria-5.3-dsmrr-for-cpk-noc/r/innodb_mrr_cpk.result
--- maria-5.3-dsmrr-for-cpk-clean/r/innodb_mrr_cpk.result 1970-01-01 03:00:00.000000000 +0300
+++ maria-5.3-dsmrr-for-cpk-noc/r/innodb_mrr_cpk.result 2010-06-22 19:23:14.000000000 +0400
@@ -0,0 +1,122 @@
+drop table if exists t0,t1,t2,t3;
+set @save_join_cache_level=@@join_cache_level;
+set join_cache_level=6;
+set @save_storage_engine=@@storage_engine;
+set storage_engine=innodb;
+create table t0(a int);
+insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
+create table t1(a char(8), b char(8), filler char(100), primary key(a));
+show create table t1;
+Table Create Table
+t1 CREATE TABLE `t1` (
+ `a` char(8) NOT NULL DEFAULT '',
+ `b` char(8) DEFAULT NULL,
+ `filler` char(100) DEFAULT NULL,
+ PRIMARY KEY (`a`)
+) ENGINE=InnoDB DEFAULT CHARSET=latin1
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+concat('b-', 1000 + A.a + B.a*10 + C.a*100, '=B'),
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8));
+insert into t2 values ('a-1010=A'), ('a-1030=A'), ('a-1020=A');
+This should use join buffer:
+explain select * from t1, t2 where t1.a=t2.a;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 8 test.t2.a 1 Using join buffer
+This output must be sorted by value of t1.a:
+select * from t1, t2 where t1.a=t2.a;
+a b filler a
+a-1010=A b-1010=B filler a-1010=A
+a-1020=A b-1020=B filler a-1020=A
+a-1030=A b-1030=B filler a-1030=A
+drop table t1, t2;
+create table t1(
+a char(8) character set utf8, b int, filler char(100),
+primary key(a,b)
+);
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+1000 + A.a + B.a*10 + C.a*100,
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 28 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+drop table t1, t2;
+create table t1(
+a varchar(8) character set utf8, b int, filler char(100),
+primary key(a,b)
+);
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+1000 + A.a + B.a*10 + C.a*100,
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 30 test.t2.a,test.t2.b 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+explain select * from t1, t2 where t1.a=t2.a;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 26 test.t2.a 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+drop table t1, t2;
+create table t1 (a int, b int, c int, filler char(100), primary key(a,b,c));
+insert into t1 select A.a, B.a, C.a, 'filler' from t0 A, t0 B, t0 C;
+insert into t1 values (11, 11, 11, 'filler');
+insert into t1 values (11, 11, 12, 'filler');
+insert into t1 values (11, 11, 13, 'filler');
+insert into t1 values (11, 22, 1234, 'filler');
+insert into t1 values (11, 33, 124, 'filler');
+insert into t1 values (11, 33, 125, 'filler');
+create table t2 (a int, b int);
+insert into t2 values (11,33), (11,22), (11,11);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 8 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b c filler a b
+11 11 11 filler 11 11
+11 11 12 filler 11 11
+11 11 13 filler 11 11
+11 22 1234 filler 11 22
+11 33 124 filler 11 33
+11 33 125 filler 11 33
+set join_cache_level=0;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b c filler a b
+11 33 124 filler 11 33
+11 33 125 filler 11 33
+11 22 1234 filler 11 22
+11 11 11 filler 11 11
+11 11 12 filler 11 11
+11 11 13 filler 11 11
+set join_cache_level=6;
+drop table t1,t2;
+set @@join_cache_level= @save_join_cache_level;
+set storage_engine=@save_storage_engine;
+drop table t0;
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/sql/handler.h maria-5.3-dsmrr-for-cpk-noc/sql/handler.h
--- maria-5.3-dsmrr-for-cpk-clean/sql/handler.h 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/sql/handler.h 2010-06-22 23:28:40.000000000 +0400
@@ -1168,9 +1168,9 @@
COST_VECT *cost);
/*
- The below two are not used (and not handled) in this milestone of this WL
- entry because there seems to be no use for them at this stage of
- implementation.
+ Indicates that all scanned ranges will be singlepoint (aka equality) ranges.
+ The ranges may not use the full key but all of them will use the same number
+ of key parts.
*/
#define HA_MRR_SINGLE_POINT 1
#define HA_MRR_FIXED_KEY 2
@@ -1752,9 +1752,10 @@
uint n_ranges, uint *bufsz,
uint *flags, COST_VECT *cost);
virtual ha_rows multi_range_read_info(uint keyno, uint n_ranges, uint keys,
- uint *bufsz, uint *flags, COST_VECT *cost);
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost);
virtual int multi_range_read_init(RANGE_SEQ_IF *seq, void *seq_init_param,
- uint n_ranges, uint mode,
+ uint n_ranges, uint mode,
HANDLER_BUFFER *buf);
virtual int multi_range_read_next(char **range_info);
virtual int read_range_first(const key_range *start_key,
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/sql/multi_range_read.cc maria-5.3-dsmrr-for-cpk-noc/sql/multi_range_read.cc
--- maria-5.3-dsmrr-for-cpk-clean/sql/multi_range_read.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/sql/multi_range_read.cc 2010-06-22 23:28:40.000000000 +0400
@@ -1,4 +1,5 @@
#include "mysql_priv.h"
+#include <my_bit.h>
#include "sql_select.h"
/****************************************************************************
@@ -136,10 +137,16 @@
*/
ha_rows handler::multi_range_read_info(uint keyno, uint n_ranges, uint n_rows,
- uint *bufsz, uint *flags, COST_VECT *cost)
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost)
{
- *bufsz= 0; /* Default implementation doesn't need a buffer */
+ /*
+ Currently we expect this function to be called only in preparation of scan
+ with HA_MRR_SINGLE_POINT property.
+ */
+ DBUG_ASSERT(*flags | HA_MRR_SINGLE_POINT);
+ *bufsz= 0; /* Default implementation doesn't need a buffer */
*flags |= HA_MRR_USE_DEFAULT_IMPL;
cost->zero();
@@ -316,25 +323,39 @@
{
use_default_impl= TRUE;
const int retval=
- h->handler::multi_range_read_init(seq_funcs, seq_init_param,
- n_ranges, mode, buf);
+ h->handler::multi_range_read_init(seq_funcs, seq_init_param, n_ranges,
+ mode, buf);
DBUG_RETURN(retval);
}
- rowids_buf= buf->buffer;
+ mrr_buf= buf->buffer;
is_mrr_assoc= !test(mode & HA_MRR_NO_ASSOCIATION);
if (is_mrr_assoc)
status_var_increment(table->in_use->status_var.ha_multi_range_read_init_count);
- rowids_buf_end= buf->buffer_end;
+ mrr_buf_end= buf->buffer_end;
+
+ if ((doing_cpk_scan= check_cpk_scan(h->active_index, mode)))
+ {
+ /* It's a DS-MRR/CPK scan */
+ cpk_tuple_length= 0; /* dummy value telling it needs to be inited */
+ cpk_have_range= FALSE;
+ use_default_impl= FALSE;
+ h->mrr_iter= seq_funcs->init(seq_init_param, n_ranges, mode);
+ h->mrr_funcs= *seq_funcs;
+ dsmrr_fill_buffer_cpk();
+ if (dsmrr_eof)
+ buf->end_of_used_area= mrr_buf_last;
+ DBUG_RETURN(0); /* nothing could go wrong while filling the buffer */
+ }
+
+ /* In regular DS-MRR, buffer stores {rowid, range_id} pairs */
elem_size= h->ref_length + (int)is_mrr_assoc * sizeof(void*);
- rowids_buf_last= rowids_buf +
- ((rowids_buf_end - rowids_buf)/ elem_size)*
- elem_size;
- rowids_buf_end= rowids_buf_last;
+ mrr_buf_last= mrr_buf + ((mrr_buf_end - mrr_buf)/ elem_size)* elem_size;
+ mrr_buf_end= mrr_buf_last;
- /*
+ /*
There can be two cases:
- This is the first call since index_init(), h2==NULL
Need to setup h2 then.
@@ -406,8 +427,8 @@
goto error;
}
- if (h2->handler::multi_range_read_init(seq_funcs, seq_init_param, n_ranges,
- mode, buf) ||
+ if (h2->handler::multi_range_read_init(seq_funcs, seq_init_param, n_ranges,
+ mode, buf) ||
dsmrr_fill_buffer())
{
goto error;
@@ -417,7 +438,7 @@
adjust *buf to indicate that the remaining buffer space will not be used.
*/
if (dsmrr_eof)
- buf->end_of_used_area= rowids_buf_last;
+ buf->end_of_used_area= mrr_buf_last;
/*
h->inited == INDEX may occur when 'range checked for each record' is
@@ -473,6 +494,9 @@
rowid and return.
The function assumes that rowids buffer is empty when it is invoked.
+
+ dsmrr_eof is set to indicate whether we've exhausted the list of ranges we're
+ scanning.
@param h Table handler
@@ -487,8 +511,8 @@
int res;
DBUG_ENTER("DsMrr_impl::dsmrr_fill_buffer");
- rowids_buf_cur= rowids_buf;
- while ((rowids_buf_cur < rowids_buf_end) &&
+ mrr_buf_cur= mrr_buf;
+ while ((mrr_buf_cur < mrr_buf_end) &&
!(res= h2->handler::multi_range_read_next(&range_info)))
{
KEY_MULTI_RANGE *curr_range= &h2->handler::mrr_cur_range;
@@ -498,13 +522,13 @@
/* Put rowid, or {rowid, range_id} pair into the buffer */
h2->position(table->record[0]);
- memcpy(rowids_buf_cur, h2->ref, h2->ref_length);
- rowids_buf_cur += h2->ref_length;
+ memcpy(mrr_buf_cur, h2->ref, h2->ref_length);
+ mrr_buf_cur += h2->ref_length;
if (is_mrr_assoc)
{
- memcpy(rowids_buf_cur, &range_info, sizeof(void*));
- rowids_buf_cur += sizeof(void*);
+ memcpy(mrr_buf_cur, &range_info, sizeof(void*));
+ mrr_buf_cur += sizeof(void*);
}
}
@@ -514,16 +538,224 @@
/* Sort the buffer contents by rowid */
uint elem_size= h->ref_length + (int)is_mrr_assoc * sizeof(void*);
- uint n_rowids= (rowids_buf_cur - rowids_buf) / elem_size;
+ uint n_rowids= (mrr_buf_cur - mrr_buf) / elem_size;
- my_qsort2(rowids_buf, n_rowids, elem_size, (qsort2_cmp)rowid_cmp,
+ my_qsort2(mrr_buf, n_rowids, elem_size, (qsort2_cmp)rowid_cmp,
(void*)h);
- rowids_buf_last= rowids_buf_cur;
- rowids_buf_cur= rowids_buf;
+ mrr_buf_last= mrr_buf_cur;
+ mrr_buf_cur= mrr_buf;
DBUG_RETURN(0);
}
+/*
+ my_qsort2-compatible function to compare key tuples
+*/
+
+int DsMrr_impl::key_tuple_cmp(void* arg, uchar* key1, uchar* key2)
+{
+ DsMrr_impl *dsmrr= (DsMrr_impl*)arg;
+ TABLE *table= dsmrr->h->table;
+
+ KEY_PART_INFO *part= table->key_info[table->s->primary_key].key_part;
+ uchar *key1_end= key1 + dsmrr->cpk_tuple_length;
+
+ while (key1 < key1_end)
+ {
+ Field* f = part->field;
+ int len = part->store_length;
+ int res = f->cmp(key1, key2);
+ if (res)
+ return res;
+ key1 += len;
+ key2 += len;
+ part++;
+ }
+ return 0;
+}
+
+
+/*
+ DS-MRR/CPK: Fill the buffer with (lookup_tuple, range_id) pairs and sort
+
+ SYNOPSIS
+ DsMrr_impl::dsmrr_fill_buffer_cpk()
+
+ DESCRIPTION
+ DS-MRR/CPK: Fill the buffer with (lookup_tuple, range_id) pairs and sort
+
+ dsmrr_eof is set to indicate whether we've exhausted the list of ranges
+ we're scanning.
+*/
+
+void DsMrr_impl::dsmrr_fill_buffer_cpk()
+{
+ int res;
+ KEY_MULTI_RANGE cur_range;
+ DBUG_ENTER("DsMrr_impl::dsmrr_fill_buffer_cpk");
+
+ mrr_buf_cur= mrr_buf;
+ while ((mrr_buf_cur < mrr_buf_end) &&
+ !(res= h->mrr_funcs.next(h->mrr_iter, &cur_range)))
+ {
+ DBUG_ASSERT(cur_range.range_flag & EQ_RANGE);
+ DBUG_ASSERT(!cpk_tuple_length ||
+ cpk_tuple_length == cur_range.start_key.length);
+ if (!cpk_tuple_length)
+ {
+ cpk_tuple_length= cur_range.start_key.length;
+ cpk_is_unique_scan= test(table->key_info[h->active_index].key_parts ==
+ my_count_bits(cur_range.start_key.keypart_map));
+ uint elem_size= cpk_tuple_length + (int)is_mrr_assoc * sizeof(void*);
+ mrr_buf_last= mrr_buf + ((mrr_buf_end - mrr_buf)/elem_size) * elem_size;
+ mrr_buf_end= mrr_buf_last;
+ }
+
+ /* Put key, or {key, range_id} pair into the buffer */
+ memcpy(mrr_buf_cur, cur_range.start_key.key, cpk_tuple_length);
+ mrr_buf_cur += cpk_tuple_length;
+
+ if (is_mrr_assoc)
+ {
+ memcpy(mrr_buf_cur, &cur_range.ptr, sizeof(void*));
+ mrr_buf_cur += sizeof(void*);
+ }
+ }
+
+ dsmrr_eof= test(res);
+
+ /* Sort the buffer contents by rowid */
+ uint elem_size= cpk_tuple_length + (int)is_mrr_assoc * sizeof(void*);
+ uint n_rowids= (mrr_buf_cur - mrr_buf) / elem_size;
+
+ my_qsort2(mrr_buf, n_rowids, elem_size,
+ (qsort2_cmp)DsMrr_impl::key_tuple_cmp, (void*)this);
+ mrr_buf_last= mrr_buf_cur;
+ mrr_buf_cur= mrr_buf;
+ DBUG_VOID_RETURN;
+}
+
+
+/*
+ DS-MRR/CPK: multi_range_read_next() function
+
+ DESCRIPTION
+ DsMrr_impl::dsmrr_next_cpk()
+ range_info OUT identifier of range that the returned record belongs to
+
+ DESCRIPTION
+ DS-MRR/CPK: multi_range_read_next() function.
+ This is similar to DsMrr_impl::dsmrr_next(), the differences are that
+ - we get records with index_read(), not with rnd_pos()
+ - we may get multiple records for one key (=element of the buffer)
+ - unlike dsmrr_fill_buffer(), dsmrr_fill_buffer_cpk() never fails.
+
+ RETURN
+ 0 OK, next record was successfully read
+ HA_ERR_END_OF_FILE End of records
+ Other Some other error
+*/
+
+int DsMrr_impl::dsmrr_next_cpk(char **range_info)
+{
+ int res;
+
+ while (cpk_have_range)
+ {
+
+ if (h->mrr_funcs.skip_record &&
+ h->mrr_funcs.skip_record(h->mrr_iter, cpk_saved_range_info, NULL))
+ {
+ cpk_have_range= FALSE;
+ break;
+ }
+
+ res= h->index_next_same(table->record[0], mrr_buf_cur, cpk_tuple_length);
+
+ if (h->mrr_funcs.skip_index_tuple &&
+ h->mrr_funcs.skip_index_tuple(h->mrr_iter, cpk_saved_range_info))
+ continue;
+
+ if (res != HA_ERR_END_OF_FILE)
+ {
+ if (is_mrr_assoc)
+ memcpy(range_info, &cpk_saved_range_info, sizeof(void*));
+ return res;
+ }
+
+ /* No more records in this range. Exit this loop and go get another range */
+ cpk_have_range= FALSE;
+ }
+
+ do
+ {
+ /* First, make sure we have a range at start of the buffer */
+ if (mrr_buf_cur == mrr_buf_last)
+ {
+ if (dsmrr_eof)
+ {
+ res= HA_ERR_END_OF_FILE;
+ goto end;
+ }
+ dsmrr_fill_buffer_cpk();
+ }
+ if (mrr_buf_cur == mrr_buf_last)
+ {
+ res= HA_ERR_END_OF_FILE;
+ goto end;
+ }
+
+ /* Ok, got the range. Try making a lookup. */
+ uchar *lookup_tuple= mrr_buf_cur;
+ mrr_buf_cur += cpk_tuple_length;
+ if (is_mrr_assoc)
+ {
+ memcpy(&cpk_saved_range_info, mrr_buf_cur, sizeof(void*));
+ mrr_buf_cur += sizeof(void*) * test(is_mrr_assoc);
+ }
+
+ if (h->mrr_funcs.skip_record &&
+ h->mrr_funcs.skip_record(h->mrr_iter, cpk_saved_range_info, NULL))
+ continue;
+
+ res= h->index_read(table->record[0], lookup_tuple, cpk_tuple_length,
+ HA_READ_KEY_EXACT);
+
+ /*
+ Check pushed index condition. Performance-wise, it does not make any
+ sense to put this call here (the above call has already accessed the full
+ record). That's the best I could do, though, because:
+ - ha_innobase doesn't support IndexConditionPushdown on clustered PK
+ - MRR interface doesn't allow the storage engine to refuse a pushed index
+ condition.
+ Having this call here is not fully harmless: EXPLAIN shows "pushed index
+ condition", which is technically true but doesn't bring the benefits that
+ one might expect.
+ */
+ if (h->mrr_funcs.skip_index_tuple &&
+ h->mrr_funcs.skip_index_tuple(h->mrr_iter, cpk_saved_range_info))
+ continue;
+
+ if (res && res != HA_ERR_END_OF_FILE)
+ goto end;
+
+ if (!res)
+ {
+ memcpy(range_info, &cpk_saved_range_info, sizeof(void*));
+ /*
+ Attempt reading more rows from this range only if there actually can
+ be multiple matches:
+ */
+ cpk_have_range= !cpk_is_unique_scan;
+ break;
+ }
+ } while (true);
+
+end:
+ return res;
+}
+
+
/**
DS-MRR implementation: multi_range_read_next() function
*/
@@ -536,10 +768,13 @@
if (use_default_impl)
return h->handler::multi_range_read_next(range_info);
+
+ if (doing_cpk_scan)
+ return dsmrr_next_cpk(range_info);
do
{
- if (rowids_buf_cur == rowids_buf_last)
+ if (mrr_buf_cur == mrr_buf_last)
{
if (dsmrr_eof)
{
@@ -552,17 +787,17 @@
}
/* return eof if there are no rowids in the buffer after re-fill attempt */
- if (rowids_buf_cur == rowids_buf_last)
+ if (mrr_buf_cur == mrr_buf_last)
{
res= HA_ERR_END_OF_FILE;
goto end;
}
- rowid= rowids_buf_cur;
+ rowid= mrr_buf_cur;
if (is_mrr_assoc)
- memcpy(&cur_range_info, rowids_buf_cur + h->ref_length, sizeof(uchar**));
+ memcpy(&cur_range_info, mrr_buf_cur + h->ref_length, sizeof(uchar**));
- rowids_buf_cur += h->ref_length + sizeof(void*) * test(is_mrr_assoc);
+ mrr_buf_cur += h->ref_length + sizeof(void*) * test(is_mrr_assoc);
if (h2->mrr_funcs.skip_record &&
h2->mrr_funcs.skip_record(h2->mrr_iter, (char *) cur_range_info, rowid))
continue;
@@ -582,7 +817,8 @@
/**
DS-MRR implementation: multi_range_read_info() function
*/
-ha_rows DsMrr_impl::dsmrr_info(uint keyno, uint n_ranges, uint rows,
+ha_rows DsMrr_impl::dsmrr_info(uint keyno, uint n_ranges, uint rows,
+ uint key_parts,
uint *bufsz, uint *flags, COST_VECT *cost)
{
ha_rows res;
@@ -590,8 +826,8 @@
uint def_bufsz= *bufsz;
/* Get cost/flags/mem_usage of default MRR implementation */
- res= h->handler::multi_range_read_info(keyno, n_ranges, rows, &def_bufsz,
- &def_flags, cost);
+ res= h->handler::multi_range_read_info(keyno, n_ranges, rows, key_parts,
+ &def_bufsz, &def_flags, cost);
DBUG_ASSERT(!res);
if ((*flags & HA_MRR_USE_DEFAULT_IMPL) ||
@@ -683,7 +919,33 @@
return FALSE;
}
-/**
+
+/*
+ Check if key/flags allow DS-MRR/CPK strategy to be used
+
+ SYNOPSIS
+ DsMrr_impl::check_cpk_scan()
+ keyno Index that will be used
+ mrr_flags
+
+ DESCRIPTION
+ Check if key/flags allow DS-MRR/CPK strategy to be used.
+
+ RETURN
+ TRUE DS-MRR/CPK should be used
+ FALSE Otherwise
+*/
+
+bool DsMrr_impl::check_cpk_scan(uint keyno, uint mrr_flags)
+{
+ return test((mrr_flags & HA_MRR_SINGLE_POINT) &&
+ !(mrr_flags & HA_MRR_SORTED) &&
+ keyno == table->s->primary_key &&
+ h->primary_key_is_clustered());
+}
+
+
+/*
DS-MRR Internals: Choose between Default MRR implementation and DS-MRR
Make the choice between using Default MRR implementation and DS-MRR.
@@ -706,14 +968,18 @@
@retval FALSE DS-MRR implementation should be used
*/
+
bool DsMrr_impl::choose_mrr_impl(uint keyno, ha_rows rows, uint *flags,
uint *bufsz, COST_VECT *cost)
{
COST_VECT dsmrr_cost;
bool res;
THD *thd= current_thd;
+
+ doing_cpk_scan= check_cpk_scan(keyno, *flags);
if (thd->variables.optimizer_use_mrr == 2 || *flags & HA_MRR_INDEX_ONLY ||
- (keyno == table->s->primary_key && h->primary_key_is_clustered()) ||
+ (keyno == table->s->primary_key && h->primary_key_is_clustered() &&
+ !doing_cpk_scan) ||
key_uses_partial_cols(table, keyno))
{
/* Use the default implementation */
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/sql/multi_range_read.h maria-5.3-dsmrr-for-cpk-noc/sql/multi_range_read.h
--- maria-5.3-dsmrr-for-cpk-clean/sql/multi_range_read.h 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/sql/multi_range_read.h 2010-06-22 23:28:40.000000000 +0400
@@ -1,16 +1,76 @@
/*
- This file contains declarations for
- - Disk-Sweep MultiRangeRead (DS-MRR) implementation
+ This file contains declarations for Disk-Sweep MultiRangeRead (DS-MRR)
+ implementation
*/
/**
- A Disk-Sweep MRR interface implementation
+ A Disk-Sweep implementation of MRR Interface (DS-MRR for short)
- This implementation makes range (and, in the future, 'ref') scans to read
- table rows in disk sweeps.
-
- Currently it is used by MyISAM and InnoDB. Potentially it can be used with
- any table handler that has non-clustered indexes and on-disk rows.
+ This is a "plugin"(*) for storage engines that allows make index scans
+ read table rows in rowid order. For disk-based storage engines, this is
+ faster than reading table rows in whatever-SQL-layer-makes-calls-in order.
+
+ (*) - only conceptually. No dynamic loading or binary compatibility of any
+ kind.
+
+ General scheme of things:
+
+ SQL Layer code
+ | | |
+ -v---v---v---- handler->multi_range_read_XXX() function calls
+ | | |
+ ____________________________________
+ / DS-MRR module \
+ | (scan indexes, order rowids, do |
+ | full record reads in rowid order) |
+ \____________________________________/
+ | | |
+ -|---|---|----- handler->read_range_first()/read_range_next(),
+ | | | handler->index_read(), handler->rnd_pos() calls.
+ | | |
+ v v v
+ Storage engine internals
+
+ Currently DS-MRR is used by MyISAM, InnoDB/XtraDB and Maria storage engines.
+ Potentially it can be used with any table handler that has disk-based data
+ storage and has better performance when reading data in rowid order.
+*/
+
+
+/*
+ DS-MRR implementation for one table. Create/use one object of this class for
+ each ha_{myisam/innobase/etc} object. That object will be further referred to
+ as "the handler"
+
+ There are actually three strategies
+ S1. Bypass DS-MRR, pass all calls to default implementation (i.e. to
+ MRR-to-non-MRR calls converter)
+ S2. Regular DS-MRR
+ S3. DS-MRR/CPK for doing scans on clustered primary keys.
+
+ S1 is used for cases which DS-MRR is unable to handle for some reason.
+
+ S2 is the actual DS-MRR. The basic algorithm is as follows:
+ 1. Scan the index (and only index, that is, with HA_EXTRA_KEYREAD on) and
+ fill the buffer with {rowid, range_id} pairs
+ 2. Sort the buffer by rowid
+ 3. for each {rowid, range_id} pair in the buffer
+ get record by rowid and return the {record, range_id} pair
+ 4. Repeat the above steps until we've exhausted the list of ranges we're
+ scanning.
+
+ S3 is the variant of DS-MRR for use with clustered primary keys (or any
+ clustered index). The idea is that in clustered index it is sufficient to
+ access the index in index order, and we don't need an intermediate steps to
+ get rowid (like step #1 in S2).
+
+ DS-MRR/CPK's basic algorithm is as follows:
+ 1. Collect a number of ranges (=lookup keys)
+ 2. Sort them so that they follow in index order.
+ 3. for each {lookup_key, range_id} pair in the buffer
+ get record(s) matching the lookup key and return {record, range_id} pairs
+ 4. Repeat the above steps until we've exhausted the list of ranges we're
+ scanning.
*/
class DsMrr_impl
@@ -21,21 +81,38 @@
DsMrr_impl()
: h2(NULL) {};
+ void init(handler *h_arg, TABLE *table_arg)
+ {
+ h= h_arg;
+ table= table_arg;
+ }
+ int dsmrr_init(handler *h, RANGE_SEQ_IF *seq_funcs, void *seq_init_param,
+ uint n_ranges, uint mode, HANDLER_BUFFER *buf);
+ void dsmrr_close();
+ int dsmrr_next(char **range_info);
+
+ ha_rows dsmrr_info(uint keyno, uint n_ranges, uint keys, uint key_parts,
+ uint *bufsz, uint *flags, COST_VECT *cost);
+
+ ha_rows dsmrr_info_const(uint keyno, RANGE_SEQ_IF *seq,
+ void *seq_init_param, uint n_ranges, uint *bufsz,
+ uint *flags, COST_VECT *cost);
+private:
/*
The "owner" handler object (the one that calls dsmrr_XXX functions.
It is used to retrieve full table rows by calling rnd_pos().
*/
handler *h;
TABLE *table; /* Always equal to h->table */
-private:
+
/* Secondary handler object. It is used for scanning the index */
handler *h2;
/* Buffer to store rowids, or (rowid, range_id) pairs */
- uchar *rowids_buf;
- uchar *rowids_buf_cur; /* Current position when reading/writing */
- uchar *rowids_buf_last; /* When reading: end of used buffer space */
- uchar *rowids_buf_end; /* End of the buffer */
+ uchar *mrr_buf;
+ uchar *mrr_buf_cur; /* Current position when reading/writing */
+ uchar *mrr_buf_last; /* When reading: end of used buffer space */
+ uchar *mrr_buf_end; /* End of the buffer */
bool dsmrr_eof; /* TRUE <=> We have reached EOF when reading index tuples */
@@ -43,28 +120,31 @@
bool is_mrr_assoc;
bool use_default_impl; /* TRUE <=> shortcut all calls to default MRR impl */
-public:
- void init(handler *h_arg, TABLE *table_arg)
- {
- h= h_arg;
- table= table_arg;
- }
- int dsmrr_init(handler *h, RANGE_SEQ_IF *seq_funcs, void *seq_init_param,
- uint n_ranges, uint mode, HANDLER_BUFFER *buf);
- void dsmrr_close();
- int dsmrr_fill_buffer();
- int dsmrr_next(char **range_info);
- ha_rows dsmrr_info(uint keyno, uint n_ranges, uint keys, uint *bufsz,
- uint *flags, COST_VECT *cost);
+ bool doing_cpk_scan; /* TRUE <=> DS-MRR/CPK variant is used */
+
+ /** DS-MRR/CPK variables start */
+
+ /* Length of lookup tuple being used, in bytes */
+ uint cpk_tuple_length;
+ /*
+ TRUE <=> We're scanning on a full primary key (and not on prefix), and so
+ can get max. one match for each key
+ */
+ bool cpk_is_unique_scan;
+ /* TRUE<=> we're in a middle of enumerating records from a range */
+ bool cpk_have_range;
+ /* Valid if cpk_have_range==TRUE: range_id of the range we're enumerating */
+ char *cpk_saved_range_info;
- ha_rows dsmrr_info_const(uint keyno, RANGE_SEQ_IF *seq,
- void *seq_init_param, uint n_ranges, uint *bufsz,
- uint *flags, COST_VECT *cost);
-private:
bool choose_mrr_impl(uint keyno, ha_rows rows, uint *flags, uint *bufsz,
COST_VECT *cost);
bool get_disk_sweep_mrr_cost(uint keynr, ha_rows rows, uint flags,
uint *buffer_size, COST_VECT *cost);
+ bool check_cpk_scan(uint keyno, uint mrr_flags);
+ static int key_tuple_cmp(void* arg, uchar* key1, uchar* key2);
+ int dsmrr_fill_buffer();
+ void dsmrr_fill_buffer_cpk();
+ int dsmrr_next_cpk(char **range_info);
};
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/sql/opt_range.cc maria-5.3-dsmrr-for-cpk-noc/sql/opt_range.cc
--- maria-5.3-dsmrr-for-cpk-clean/sql/opt_range.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/sql/opt_range.cc 2010-06-22 23:28:40.000000000 +0400
@@ -8006,6 +8006,7 @@
quick->mrr_buf_size= thd->variables.mrr_buff_size;
if (table->file->multi_range_read_info(quick->index, 1, (uint)records,
+ uint(-1),
&quick->mrr_buf_size,
&quick->mrr_flags, &cost))
goto err;
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/sql/sql_join_cache.cc maria-5.3-dsmrr-for-cpk-noc/sql/sql_join_cache.cc
--- maria-5.3-dsmrr-for-cpk-clean/sql/sql_join_cache.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/sql/sql_join_cache.cc 2010-06-22 23:28:40.000000000 +0400
@@ -2376,8 +2376,8 @@
*/
if (!file->inited)
file->ha_index_init(join_tab->ref.key, 1);
- if ((error= file->multi_range_read_init(seq_funcs, (void*) this, ranges,
- mrr_mode, &mrr_buff)))
+ if ((error= file->multi_range_read_init(seq_funcs, (void*) this, ranges,
+ mrr_mode, &mrr_buff)))
rc= error < 0 ? NESTED_LOOP_NO_MORE_ROWS: NESTED_LOOP_ERROR;
return rc;
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/sql/sql_select.cc maria-5.3-dsmrr-for-cpk-noc/sql/sql_select.cc
--- maria-5.3-dsmrr-for-cpk-clean/sql/sql_select.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/sql/sql_select.cc 2010-06-22 19:06:54.000000000 +0400
@@ -7318,10 +7318,11 @@
case JT_EQ_REF:
if (cache_level <= 4)
return 0;
- flags= HA_MRR_NO_NULL_ENDPOINTS;
+ flags= HA_MRR_NO_NULL_ENDPOINTS | HA_MRR_SINGLE_POINT;
if (tab->table->covering_keys.is_set(tab->ref.key))
flags|= HA_MRR_INDEX_ONLY;
rows= tab->table->file->multi_range_read_info(tab->ref.key, 10, 20,
+ tab->ref.key_parts,
&bufsz, &flags, &cost);
if ((rows != HA_POS_ERROR) && !(flags & HA_MRR_USE_DEFAULT_IMPL) &&
(!(flags & HA_MRR_NO_ASSOCIATION) || cache_level > 6) &&
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/storage/maria/ha_maria.cc maria-5.3-dsmrr-for-cpk-noc/storage/maria/ha_maria.cc
--- maria-5.3-dsmrr-for-cpk-clean/storage/maria/ha_maria.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/storage/maria/ha_maria.cc 2010-06-22 23:28:40.000000000 +0400
@@ -3501,8 +3501,8 @@
***************************************************************************/
int ha_maria::multi_range_read_init(RANGE_SEQ_IF *seq, void *seq_init_param,
- uint n_ranges, uint mode,
- HANDLER_BUFFER *buf)
+ uint n_ranges, uint mode,
+ HANDLER_BUFFER *buf)
{
return ds_mrr.dsmrr_init(this, seq, seq_init_param, n_ranges, mode, buf);
}
@@ -3528,11 +3528,11 @@
}
ha_rows ha_maria::multi_range_read_info(uint keyno, uint n_ranges, uint keys,
- uint *bufsz, uint *flags,
- COST_VECT *cost)
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost)
{
ds_mrr.init(this, table);
- return ds_mrr.dsmrr_info(keyno, n_ranges, keys, bufsz, flags, cost);
+ return ds_mrr.dsmrr_info(keyno, n_ranges, keys, key_parts, bufsz, flags, cost);
}
/* MyISAM MRR implementation ends */
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/storage/maria/ha_maria.h maria-5.3-dsmrr-for-cpk-noc/storage/maria/ha_maria.h
--- maria-5.3-dsmrr-for-cpk-clean/storage/maria/ha_maria.h 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/storage/maria/ha_maria.h 2010-06-22 23:28:40.000000000 +0400
@@ -181,7 +181,8 @@
uint n_ranges, uint *bufsz,
uint *flags, COST_VECT *cost);
ha_rows multi_range_read_info(uint keyno, uint n_ranges, uint keys,
- uint *bufsz, uint *flags, COST_VECT *cost);
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost);
/* Index condition pushdown implementation */
Item *idx_cond_push(uint keyno, Item* idx_cond);
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/storage/myisam/ha_myisam.cc maria-5.3-dsmrr-for-cpk-noc/storage/myisam/ha_myisam.cc
--- maria-5.3-dsmrr-for-cpk-clean/storage/myisam/ha_myisam.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/storage/myisam/ha_myisam.cc 2010-06-22 23:28:40.000000000 +0400
@@ -2244,11 +2244,11 @@
}
ha_rows ha_myisam::multi_range_read_info(uint keyno, uint n_ranges, uint keys,
- uint *bufsz, uint *flags,
- COST_VECT *cost)
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost)
{
ds_mrr.init(this, table);
- return ds_mrr.dsmrr_info(keyno, n_ranges, keys, bufsz, flags, cost);
+ return ds_mrr.dsmrr_info(keyno, n_ranges, keys, key_parts, bufsz, flags, cost);
}
/* MyISAM MRR implementation ends */
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/storage/myisam/ha_myisam.h maria-5.3-dsmrr-for-cpk-noc/storage/myisam/ha_myisam.h
--- maria-5.3-dsmrr-for-cpk-clean/storage/myisam/ha_myisam.h 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/storage/myisam/ha_myisam.h 2010-06-22 23:28:40.000000000 +0400
@@ -169,7 +169,8 @@
uint n_ranges, uint *bufsz,
uint *flags, COST_VECT *cost);
ha_rows multi_range_read_info(uint keyno, uint n_ranges, uint keys,
- uint *bufsz, uint *flags, COST_VECT *cost);
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost);
/* Index condition pushdown implementation */
Item *idx_cond_push(uint keyno, Item* idx_cond);
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/storage/xtradb/handler/ha_innodb.cc maria-5.3-dsmrr-for-cpk-noc/storage/xtradb/handler/ha_innodb.cc
--- maria-5.3-dsmrr-for-cpk-clean/storage/xtradb/handler/ha_innodb.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/storage/xtradb/handler/ha_innodb.cc 2010-06-22 23:28:40.000000000 +0400
@@ -11025,7 +11025,8 @@
*/
int ha_innobase::multi_range_read_init(RANGE_SEQ_IF *seq, void *seq_init_param,
- uint n_ranges, uint mode, HANDLER_BUFFER *buf)
+ uint n_ranges, uint mode,
+ HANDLER_BUFFER *buf)
{
return ds_mrr.dsmrr_init(this, seq, seq_init_param, n_ranges, mode, buf);
}
@@ -11052,12 +11053,13 @@
return res;
}
-ha_rows ha_innobase::multi_range_read_info(uint keyno, uint n_ranges,
- uint keys, uint *bufsz,
+ha_rows ha_innobase::multi_range_read_info(uint keyno, uint n_ranges, uint keys,
+ uint key_parts, uint *bufsz,
uint *flags, COST_VECT *cost)
{
ds_mrr.init(this, table);
- ha_rows res= ds_mrr.dsmrr_info(keyno, n_ranges, keys, bufsz, flags, cost);
+ ha_rows res= ds_mrr.dsmrr_info(keyno, n_ranges, keys, key_parts, bufsz,
+ flags, cost);
return res;
}
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/storage/xtradb/handler/ha_innodb.h maria-5.3-dsmrr-for-cpk-noc/storage/xtradb/handler/ha_innodb.h
--- maria-5.3-dsmrr-for-cpk-clean/storage/xtradb/handler/ha_innodb.h 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/storage/xtradb/handler/ha_innodb.h 2010-06-22 23:28:40.000000000 +0400
@@ -217,7 +217,8 @@
uint n_ranges, uint *bufsz,
uint *flags, COST_VECT *cost);
ha_rows multi_range_read_info(uint keyno, uint n_ranges, uint keys,
- uint *bufsz, uint *flags, COST_VECT *cost);
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost);
DsMrr_impl ds_mrr;
Item *idx_cond_push(uint keyno, Item* idx_cond);
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0

[Maria-developers] Progress (by Knielsen): New replication APIs (107)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: New replication APIs
CREATION DATE..: Mon, 15 Mar 2010, 13:55
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 107 (http://askmonty.org/worklog/?tid=107)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 69
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:36)=-=-
Research and design thoughts.
Worked 19 hours and estimate 0 hours remain (original estimate increased by 19 hours).
-=-=(Knielsen - Mon, 07 Jun 2010, 12:11)=-=-
High Level Description modified.
--- /tmp/wklog.107.old.31097 2010-06-07 12:11:57.000000000 +0000
+++ /tmp/wklog.107.new.31097 2010-06-07 12:11:57.000000000 +0000
@@ -7,3 +7,6 @@
https://lists.launchpad.net/maria-developers/msg01998.html
+Wiki page for the project:
+
+ http://askmonty.org/wiki/ReplicationProject
-=-=(Knielsen - Mon, 29 Mar 2010, 07:33)=-=-
Research and design discussions: Galera, 2pc/XA, group commit, multi-engine transactions.
Worked 14 hours and estimate 0 hours remain (original estimate increased by 14 hours).
-=-=(Knielsen - Wed, 24 Mar 2010, 10:39)=-=-
Design discussions
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Mon, 15 Mar 2010, 14:28)=-=-
Research into the problem, and discussions on phone/mailing list
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Guest - Mon, 15 Mar 2010, 14:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.107.old.9086 2010-03-15 14:18:18.000000000 +0000
+++ /tmp/wklog.107.new.9086 2010-03-15 14:18:18.000000000 +0000
@@ -1 +1,43 @@
+Current ideas/status after discussions on the mailing list:
+
+ - Implement a set of plugin APIs and use them to move all of the existing
+ MySQL replication into a (set of) plugins.
+
+ - Design the APIs so that they can support full MySQL replication, but also
+ so that they do not hardcode assumptions about how this replication
+ implementation is done, and so that they will be suitable for other types of
+ replication (Tungsten, Galera, parallel replication, ...).
+
+ - APIs need to include the concept of a global transaction ID. Need to
+ determine the extent to which the semantics of such ID will be defined
+ by the API, and to which extend it will be defined by the plugin
+ implementations.
+
+ - APIs should properly support reliable crash-recovery with decent
+ performance (eg. not require multiple mandatory fsync()s per commit, and
+ not make group commit impossible).
+
+ - Would be nice if the API provided facilities for implementing good
+ consistency checking support (mainly checking master tables against slave
+ tables is hard here I think, but also applying wrong binlog data and
+ individual event checksums).
+
+
+Steps to make this more concrete:
+
+ - Investigate the current MySQL replication, and list all of the places where
+ a plugin implementation will need to connect/hook into the MySQL server.
+ * handler::{write,update,delete}_row()
+ * Statement execution
+ * Transaction start/commit
+ * Table open
+ * Query safe/not/safe for statement based replication
+ * Statement-based logging details (user variables, random seed, etc.)
+ * ...
+
+ - Use this list to make an initial sketch of the set of APIs we need.
+
+ - Use the list to determine the feasibility of this project and the level of
+ detail in the API needed to support a full replication implementation as a
+ plugin.
-=-=(Serg - Mon, 15 Mar 2010, 14:13)=-=-
Observers changed: Sergei
DESCRIPTION:
This is a top-level task for the project of designing a new set of replication
APIs for MariaDB.
This task is for the initial discussion of what to do and where to focus.
The project is started in this email thread:
https://lists.launchpad.net/maria-developers/msg01998.html
Wiki page for the project:
http://askmonty.org/wiki/ReplicationProject
HIGH-LEVEL SPECIFICATION:
Current ideas/status after discussions on the mailing list:
- Implement a set of plugin APIs and use them to move all of the existing
MySQL replication into a (set of) plugins.
- Design the APIs so that they can support full MySQL replication, but also
so that they do not hardcode assumptions about how this replication
implementation is done, and so that they will be suitable for other types of
replication (Tungsten, Galera, parallel replication, ...).
- APIs need to include the concept of a global transaction ID. Need to
determine the extent to which the semantics of such ID will be defined
by the API, and to which extend it will be defined by the plugin
implementations.
- APIs should properly support reliable crash-recovery with decent
performance (eg. not require multiple mandatory fsync()s per commit, and
not make group commit impossible).
- Would be nice if the API provided facilities for implementing good
consistency checking support (mainly checking master tables against slave
tables is hard here I think, but also applying wrong binlog data and
individual event checksums).
Steps to make this more concrete:
- Investigate the current MySQL replication, and list all of the places where
a plugin implementation will need to connect/hook into the MySQL server.
* handler::{write,update,delete}_row()
* Statement execution
* Transaction start/commit
* Table open
* Query safe/not/safe for statement based replication
* Statement-based logging details (user variables, random seed, etc.)
* ...
- Use this list to make an initial sketch of the set of APIs we need.
- Use the list to determine the feasibility of this project and the level of
detail in the API needed to support a full replication implementation as a
plugin.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): New replication APIs (107)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: New replication APIs
CREATION DATE..: Mon, 15 Mar 2010, 13:55
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 107 (http://askmonty.org/worklog/?tid=107)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 69
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:36)=-=-
Research and design thoughts.
Worked 19 hours and estimate 0 hours remain (original estimate increased by 19 hours).
-=-=(Knielsen - Mon, 07 Jun 2010, 12:11)=-=-
High Level Description modified.
--- /tmp/wklog.107.old.31097 2010-06-07 12:11:57.000000000 +0000
+++ /tmp/wklog.107.new.31097 2010-06-07 12:11:57.000000000 +0000
@@ -7,3 +7,6 @@
https://lists.launchpad.net/maria-developers/msg01998.html
+Wiki page for the project:
+
+ http://askmonty.org/wiki/ReplicationProject
-=-=(Knielsen - Mon, 29 Mar 2010, 07:33)=-=-
Research and design discussions: Galera, 2pc/XA, group commit, multi-engine transactions.
Worked 14 hours and estimate 0 hours remain (original estimate increased by 14 hours).
-=-=(Knielsen - Wed, 24 Mar 2010, 10:39)=-=-
Design discussions
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Mon, 15 Mar 2010, 14:28)=-=-
Research into the problem, and discussions on phone/mailing list
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Guest - Mon, 15 Mar 2010, 14:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.107.old.9086 2010-03-15 14:18:18.000000000 +0000
+++ /tmp/wklog.107.new.9086 2010-03-15 14:18:18.000000000 +0000
@@ -1 +1,43 @@
+Current ideas/status after discussions on the mailing list:
+
+ - Implement a set of plugin APIs and use them to move all of the existing
+ MySQL replication into a (set of) plugins.
+
+ - Design the APIs so that they can support full MySQL replication, but also
+ so that they do not hardcode assumptions about how this replication
+ implementation is done, and so that they will be suitable for other types of
+ replication (Tungsten, Galera, parallel replication, ...).
+
+ - APIs need to include the concept of a global transaction ID. Need to
+ determine the extent to which the semantics of such ID will be defined
+ by the API, and to which extend it will be defined by the plugin
+ implementations.
+
+ - APIs should properly support reliable crash-recovery with decent
+ performance (eg. not require multiple mandatory fsync()s per commit, and
+ not make group commit impossible).
+
+ - Would be nice if the API provided facilities for implementing good
+ consistency checking support (mainly checking master tables against slave
+ tables is hard here I think, but also applying wrong binlog data and
+ individual event checksums).
+
+
+Steps to make this more concrete:
+
+ - Investigate the current MySQL replication, and list all of the places where
+ a plugin implementation will need to connect/hook into the MySQL server.
+ * handler::{write,update,delete}_row()
+ * Statement execution
+ * Transaction start/commit
+ * Table open
+ * Query safe/not/safe for statement based replication
+ * Statement-based logging details (user variables, random seed, etc.)
+ * ...
+
+ - Use this list to make an initial sketch of the set of APIs we need.
+
+ - Use the list to determine the feasibility of this project and the level of
+ detail in the API needed to support a full replication implementation as a
+ plugin.
-=-=(Serg - Mon, 15 Mar 2010, 14:13)=-=-
Observers changed: Sergei
DESCRIPTION:
This is a top-level task for the project of designing a new set of replication
APIs for MariaDB.
This task is for the initial discussion of what to do and where to focus.
The project is started in this email thread:
https://lists.launchpad.net/maria-developers/msg01998.html
Wiki page for the project:
http://askmonty.org/wiki/ReplicationProject
HIGH-LEVEL SPECIFICATION:
Current ideas/status after discussions on the mailing list:
- Implement a set of plugin APIs and use them to move all of the existing
MySQL replication into a (set of) plugins.
- Design the APIs so that they can support full MySQL replication, but also
so that they do not hardcode assumptions about how this replication
implementation is done, and so that they will be suitable for other types of
replication (Tungsten, Galera, parallel replication, ...).
- APIs need to include the concept of a global transaction ID. Need to
determine the extent to which the semantics of such ID will be defined
by the API, and to which extend it will be defined by the plugin
implementations.
- APIs should properly support reliable crash-recovery with decent
performance (eg. not require multiple mandatory fsync()s per commit, and
not make group commit impossible).
- Would be nice if the API provided facilities for implementing good
consistency checking support (mainly checking master tables against slave
tables is hard here I think, but also applying wrong binlog data and
individual event checksums).
Steps to make this more concrete:
- Investigate the current MySQL replication, and list all of the places where
a plugin implementation will need to connect/hook into the MySQL server.
* handler::{write,update,delete}_row()
* Statement execution
* Transaction start/commit
* Table open
* Query safe/not/safe for statement based replication
* Statement-based logging details (user variables, random seed, etc.)
* ...
- Use this list to make an initial sketch of the set of APIs we need.
- Use the list to determine the feasibility of this project and the level of
detail in the API needed to support a full replication implementation as a
plugin.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DLL to the statement-level replication event generator.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DLL to the statement-level replication event generator.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 42
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
------------------------------------------------------------
-=-=(View All Progress Notes, 35 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 42
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
------------------------------------------------------------
-=-=(View All Progress Notes, 35 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 42
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
------------------------------------------------------------
-=-=(View All Progress Notes, 35 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 42
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
------------------------------------------------------------
-=-=(View All Progress Notes, 35 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

19 Jun '10
Sergei (and everyone else),
The Release Notes and Changelog pages for the MariaDB 5.2.1-beta release
are ready:
http://askmonty.org/wiki/Manual:MariaDB_5.2.1_Release_Notes
http://askmonty.org/wiki/Manual:MariaDB_5.2.1_Changelog
Please let me know if the Release Notes should mention anything else
or if there is anything on that page which should be changed. The
Changelog should have the full list of commits from the 5.2.0-beta up
through the commit with the 5.2.1-beta tag.
The download page for this release is also ready to go, but I haven't
activated it yet. I will activate it (i.e. link to it from the download
page, and other wiki pages) once the mirrors have been seeded (later
tonight or tomorrow).
Thanks.
--
Daniel Bartholomew
Monty Program - http://askmonty.org
1
1

[Maria-developers] Updated (by Guest): Add a mysqlbinlog option to change the used database (36)
by worklog-noreply@askmonty.org 18 Jun '10
by worklog-noreply@askmonty.org 18 Jun '10
18 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add a mysqlbinlog option to change the used database
CREATION DATE..: Fri, 07 Aug 2009, 14:57
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 36 (http://askmonty.org/worklog/?tid=36)
VERSION........: Server-5.3
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 49
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Guest - Fri, 18 Jun 2010, 15:20)=-=-
Version updated.
--- /tmp/wklog.36.old.11335 2010-06-18 15:20:26.000000000 +0000
+++ /tmp/wklog.36.new.11335 2010-06-18 15:20:26.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Guest - Thu, 17 Jun 2010, 00:39)=-=-
Dependency deleted: 39 no longer depends on 36
-=-=(Guest - Sat, 07 Nov 2009, 22:43)=-=-
Category updated.
--- /tmp/wklog.36.old.9112 2009-11-07 22:43:50.000000000 +0200
+++ /tmp/wklog.36.new.9112 2009-11-07 22:43:50.000000000 +0200
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Guest - Sat, 07 Nov 2009, 22:43)=-=-
Status updated.
--- /tmp/wklog.36.old.9112 2009-11-07 22:43:50.000000000 +0200
+++ /tmp/wklog.36.new.9112 2009-11-07 22:43:50.000000000 +0200
@@ -1 +1 @@
-Un-Assigned
+Complete
-=-=(Bothorsen - Tue, 03 Nov 2009, 13:49)=-=-
More cleanup work done by Alexi, Bo and Sergey.
Worked 4 hours and estimate 0 hours remain (original estimate increased by 4 hours).
-=-=(Bothorsen - Tue, 03 Nov 2009, 13:49)=-=-
Sergey and Bo has been working on getting the patch ready, and Alexi has fixed some issues with the
patch.
Worked 15 hours and estimate 0 hours remain (original estimate increased by 15 hours).
-=-=(Bothorsen - Tue, 03 Nov 2009, 13:47)=-=-
Alexi has implemented a patch for this item.
Worked 30 hours and estimate 0 hours remain (original estimate increased by 30 hours).
-=-=(Guest - Tue, 15 Sep 2009, 18:04)=-=-
Low Level Design modified.
--- /tmp/wklog.36.old.19322 2009-09-15 18:04:49.000000000 +0300
+++ /tmp/wklog.36.new.19322 2009-09-15 18:04:49.000000000 +0300
@@ -191,7 +191,7 @@
- In process_event() function add switch case for Load_log_event and
add print_use_stmt() invocations where needed (according to the
- events lis above), e.g.:
+ events list above), e.g.:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
-=-=(Guest - Tue, 15 Sep 2009, 15:53)=-=-
Low Level Design modified.
--- /tmp/wklog.36.old.13421 2009-09-15 15:53:31.000000000 +0300
+++ /tmp/wklog.36.new.13421 2009-09-15 15:53:31.000000000 +0300
@@ -150,10 +150,17 @@
following events (see process_event() function):
- Query_log_event
-- Execute_load_query_log_event
-- Create_file_log_event
-
-TODO. Needed to check this list requires carefully !!!
+- Load_log_event
+- Execute_load_query_log_event [ :public Query_log_event ]
+- Create_file_log_event [ :public Load_log_event ]
+
+TODO. Needed to check this list carefully (not sure for Create_file_log_event)
+ Notes.
+ - In replication, only Query_log_event and Load_log_event uses
+ rpl_filter->get_rewrite_db();
+ - In mysqlbinlog (process_event), Execute_load_query_log_event
+ and Create_file_log_event are processed in separate switch
+ cases. And Load_log_event is processed in the default switch case.
Conditions for emiting use-statement:
- LOG_EVENT_SUPPRESS_USE_F is OFF for the event
@@ -182,8 +189,9 @@
*/
}
-- In process_event() function add print_use_stmt() invocations where
- needed (according to the events lis above), e.g.:
+- In process_event() function add switch case for Load_log_event and
+ add print_use_stmt() invocations where needed (according to the
+ events lis above), e.g.:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
@@ -207,6 +215,11 @@
}
break;
...
+ case LOAD_EVENT:
+ print_use_stmt((Load_log_event*)ev, print_event_info);
+ break;
+ default:
+ ...
}
...
}
-=-=(Guest - Tue, 15 Sep 2009, 12:12)=-=-
Low Level Design modified.
--- /tmp/wklog.36.old.3961 2009-09-15 12:12:26.000000000 +0300
+++ /tmp/wklog.36.new.3961 2009-09-15 12:12:26.000000000 +0300
@@ -144,6 +144,8 @@
3. Supporting rewrite-db for SBR events
---------------------------------------
+Limited to emiting USE <db_to> instead of USE <db_from>.
+
USE statements can be emited by mysqlbinlog as a result of processing the
following events (see process_event() function):
------------------------------------------------------------
-=-=(View All Progress Notes, 20 total)=-=-
http://askmonty.org/worklog/index.pl?tid=36&nolimit=1
DESCRIPTION:
Sometimes there is a need to take a binary log and apply it to a database with
a different name than the original name of the database on binlog producer.
If one is using statement-based replication, he can achieve this by grepping
out "USE dbname" statements out of the output of mysqlbinlog(*). With
row-based replication this is no longer possible, as database name is encoded
within the the BINLOG '....' statement.
This task is about adding an option to mysqlbinlog that would allow to change
the names of used databases in both RBR and SBR events.
(*) this implies that all statements refer to tables in the current database,
doesn't catch updates made inside stored functions and so forth, but still
works for a practially-important subset of cases.
HIGH-LEVEL SPECIFICATION:
Context
-------
(See http://askmonty.org/wiki/index.php/Scratch/ReplicationOptions for global
overview)
At the moment, the server has a replication slave option
--replicate-rewrite-db="from->to"
the option affects
- Table_map_log_event (all RBR events)
- Load_log_event (LOAD DATA)
- Query_log_event (SBR-based updates, with the usual assumption that the
statement refers to tables in current database, so that changing the current
database will make the statement to work on a table in a different database).
See also MySQL BUG#42941. Note this bug is fixed in MySQL 5.1.37, which is not
merged into MariaDB at the time of writing, but planned to be merged before
release.
What we could do
----------------
Option1: make mysqlbinlog accept --replicate-rewrite-db option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Make mysqlbinlog accept --replicate-rewrite-db options and process them to the
same extent as replication slave would process --replicate-rewrite-db option.
Option2: Add database-agnostic RBR events and --strip-db option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Right now RBR events require a databasename. It is not possible to have RBR
event stream that won't mention which database the events are for. When I
tried to use debugger and specify empty database name, attempt to apply the
binlog resulted in this error:
090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on
opening tables,
We could do as follows:
- Make the server interpret empty database name in RBR event (i.e. in a
Table_map_log_event) as "use current database". Binlog slave thread
probably should not allow such events as it doesn't have a natural current
database.
- Add a mysqlbinlog --strip-db option that would
= not produce any "USE dbname" statements
= change databasename for all RBR events to be empty
That way, mysqlbinlog output will be database-agnostic and apply to the
current database.
(this will have the usual limitations that we assume that all statements in
the binlog refer to the current database).
Option3: Enhance database rewrite
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If there is a need to support database change for statements that use
dbname.tablename notation and are replicated as statements (i.e. are DDL
statements and/or DML statements that are binlogged as statements),
then that could be supported as follows:
- Make the server's parser recognize special form of comments
/* !database-alias(oldname,newname) */
and save the mapping somewhere
- Put the hooks in table open and name resolution code to use the saved
mapping.
Once we've done the above, it will be easy to perform a complete,
no-compromise or restrictions database name change in binary log.
It will be possible to do the rewrites either on the slave (
--replicate-rewrite-db will work for all kinds of statements), or in
mysqlbinlog (adding a comment is easy and doesn't require mysqlbinlog to
parse the statement).
LOW-LEVEL DESIGN:
Content
-------
1. Adding rewrite-db option
2. Supporting rewrite-db option for RBR events
3. Supporting rewrite-db option for SBR events
(Limited to affecting only USE statements)
4. Current status
1. Adding rewrite-db option
---------------------------
1.1. Syntax:
--rewrite-db='db_from->db_to'
1.2. Add 'OPT_REWRITE_DB' to 'options_client' (in client_priv.h).
1.3. In mysqlbinlog.cc:
- Add { "rewrite-db", OPT_REWRITE_DB, ...} record to my_long_options:
- Add Rpl_filter object to mysqlbinlog.cc
Rpl_filter* binlog_filter;
- Add corresponding switch case to get_one_option():
case OPT_REWRITE_DB:
<extract db-from and db-to strings>
binlog_filter->add_db_rewrite(db_from, db_to);
break;
.
Note. To make Rpl_filter usable in a MYSQL_CLIENT context, few small
additional changes are required:
- In sql_list.cc/h, Sql_alloc::new(size_t) and Sql_alloc::new[](size_t)
uses sql_alloc() which is THD dependent. These are to be modified
as follows:
#ifdef MYSQL_CLIENT
extern MEM_ROOT sql_list_client_mem_root; // defined in sql_list.cc
#endif
class Sql_alloc
{ ...
static void *operator new(size_t size) throw ()
{
#ifndef MYSQL_CLIENT
return sql_alloc(size);
#else
return alloc_root(&sql_list_client_mem_root, size);
#endif
}
static void *operator new[](size_t size) throw ()
{
#ifndef MYSQL_CLIENT
return sql_alloc(size);
#else
return alloc_root(&sql_list_client_mem_root, size);
#endif
}
...
}
- In rpl_filter.cc:
Rpl_filter::Rpl_filter() :
...
{
#ifdef MYSQL_CLIENT
init_alloc_root(&sql_list_client_mem_root, ...);
#endif
...
}
Rpl_filter::~Rpl_filter()
{ ...
#ifdef MYSQL_CLIENT
free_root(&sql_list_client_mem_root, ...);
#endif
}
2. Supporting rewrite-db for RBR events
---------------------------------------
In binlog, each row operation event is preceded by Table map event(s) which maps
table id(s) to database and table names. So, it's enough to support rewriting
database name in a Table map.
2.1. Add rewrite_db() member to Table_map_log_event:
int Table_map_log_event::rewrite_db(
const char* new_db,
size_t new_db_len,
const Format_description_log_event* desc)
{
/* 1. In temp_buf member (possibly reallocating it) rewrite
event length, db length, and db parts
2. Change m_dblen and m_dbnam members
*/
}
Comment. This function assumes that temp_buf member contains Table map
binlog representaion (temp_buf is used for creating corresponding
BINLOG statement).
2.2. In mysqlbinlog modify corresponding switch case in the
process_event() function:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
Log_event *ev, ...)
{
...
switch (ev_type) {
...
case TABLE_MAP_EVENT:
{
Table_map_log_event *map= ((Table_map_log_event *)ev);
if (shall_skip_database(map->get_db_name()))
{ ...
}
// WL36
size_t new_len= 0;
const char* new_db= binlog_filter->get_rewrite_db(
map->get_db_name(), &new_len);
if (new_len && map->rewrite_db(new_db, new_len,
glob_description_event))
{ error("Could not rewrite database name");
goto err;
}
}
case WRITE_ROWS_EVENT:
case DELETE_ROWS_EVENT:
case UPDATE_ROWS_EVENT:
...
}
...
}
Comment. Rpl_filter::get_rewrite_db(db_from, &len): if filter contains
a (db_from, db_to) pair, this function returns pointer to db_to and
sets len = db_to length; otherwise, it returns db_from and does not
change len value.
3. Supporting rewrite-db for SBR events
---------------------------------------
Limited to emiting USE <db_to> instead of USE <db_from>.
USE statements can be emited by mysqlbinlog as a result of processing the
following events (see process_event() function):
- Query_log_event
- Load_log_event
- Execute_load_query_log_event [ :public Query_log_event ]
- Create_file_log_event [ :public Load_log_event ]
TODO. Needed to check this list carefully (not sure for Create_file_log_event)
Notes.
- In replication, only Query_log_event and Load_log_event uses
rpl_filter->get_rewrite_db();
- In mysqlbinlog (process_event), Execute_load_query_log_event
and Create_file_log_event are processed in separate switch
cases. And Load_log_event is processed in the default switch case.
Conditions for emiting use-statement:
- LOG_EVENT_SUPPRESS_USE_F is OFF for the event
(e.g. it is ON for 'create database' statement)
- event's db name differs from db_name in PRINT_EVENT_INFO
(PRINT_EVENT_INFO keeps db name of the last issued USE statement;
initially, this db name is empty).
3.1. In mysqlbinlog.cc
- Add the following function:
void print_use_stmt(Log_event* event, PRINT_EVENT_INFO* pinfo)
{
if (event->flags & LOG_EVENT_SUPPRESS_USE_F)
return;
/*
- For events listed above get db_from = event->db;
- If db_from is the same as pinfo->db then return;
- If there is rewrite-db rule db_from->db_to,
set db = db_to. Else set db = db_from;
- Print "use <db>" to mysqlbinlog output
- Set pinfo->db = db_from
(this suppresses emiting use-statements by corresponding
log_event's print-function)
*/
}
- In process_event() function add switch case for Load_log_event and
add print_use_stmt() invocations where needed (according to the
events list above), e.g.:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
Log_event *ev, ...)
{
...
switch (ev_type) {
case QUERY_EVENT:
if (shall_skip_database(((Query_log_event*)ev)->db))
goto end;
if (opt_base64_output_mode == BASE64_OUTPUT_ALWAYS)
{
// Possibly in case of rewite-db rule for ev->db
// a warning should be emited here (see note below)
... write_event_header_and_base64(ev, ...) ...
}
else
{
print_use_stmt((Query_log_event*)ev, print_event_info);
ev->print(result_file, print_event_info);
}
break;
...
case LOAD_EVENT:
print_use_stmt((Load_log_event*)ev, print_event_info);
break;
default:
...
}
...
}
Note. write_event_header_and_base64() does not print use-statement. It
produces BINLOG statement using ev->temp_buf content (i.e. the binary
log representation of the event). We don't rewrite temp_buf here with
db_to name (as we do it for Table map event) - this implies the
limitation 3 mentioned above.
Question: Is supporting of rewite_db + --base64-output really needed
currently?
4. Current status
-----------------
The outlined design (implemented for mysql-5.1.37) is tested for
simple test-cases.
TODO. 1. Check list of events which can emit use-statement.
2. Supporting of rewite_db + --base64-output ?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Guest): Add a mysqlbinlog option to change the used database (36)
by worklog-noreply@askmonty.org 18 Jun '10
by worklog-noreply@askmonty.org 18 Jun '10
18 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add a mysqlbinlog option to change the used database
CREATION DATE..: Fri, 07 Aug 2009, 14:57
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 36 (http://askmonty.org/worklog/?tid=36)
VERSION........: Server-5.3
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 49
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Guest - Fri, 18 Jun 2010, 15:20)=-=-
Version updated.
--- /tmp/wklog.36.old.11335 2010-06-18 15:20:26.000000000 +0000
+++ /tmp/wklog.36.new.11335 2010-06-18 15:20:26.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Guest - Thu, 17 Jun 2010, 00:39)=-=-
Dependency deleted: 39 no longer depends on 36
-=-=(Guest - Sat, 07 Nov 2009, 22:43)=-=-
Category updated.
--- /tmp/wklog.36.old.9112 2009-11-07 22:43:50.000000000 +0200
+++ /tmp/wklog.36.new.9112 2009-11-07 22:43:50.000000000 +0200
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Guest - Sat, 07 Nov 2009, 22:43)=-=-
Status updated.
--- /tmp/wklog.36.old.9112 2009-11-07 22:43:50.000000000 +0200
+++ /tmp/wklog.36.new.9112 2009-11-07 22:43:50.000000000 +0200
@@ -1 +1 @@
-Un-Assigned
+Complete
-=-=(Bothorsen - Tue, 03 Nov 2009, 13:49)=-=-
More cleanup work done by Alexi, Bo and Sergey.
Worked 4 hours and estimate 0 hours remain (original estimate increased by 4 hours).
-=-=(Bothorsen - Tue, 03 Nov 2009, 13:49)=-=-
Sergey and Bo has been working on getting the patch ready, and Alexi has fixed some issues with the
patch.
Worked 15 hours and estimate 0 hours remain (original estimate increased by 15 hours).
-=-=(Bothorsen - Tue, 03 Nov 2009, 13:47)=-=-
Alexi has implemented a patch for this item.
Worked 30 hours and estimate 0 hours remain (original estimate increased by 30 hours).
-=-=(Guest - Tue, 15 Sep 2009, 18:04)=-=-
Low Level Design modified.
--- /tmp/wklog.36.old.19322 2009-09-15 18:04:49.000000000 +0300
+++ /tmp/wklog.36.new.19322 2009-09-15 18:04:49.000000000 +0300
@@ -191,7 +191,7 @@
- In process_event() function add switch case for Load_log_event and
add print_use_stmt() invocations where needed (according to the
- events lis above), e.g.:
+ events list above), e.g.:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
-=-=(Guest - Tue, 15 Sep 2009, 15:53)=-=-
Low Level Design modified.
--- /tmp/wklog.36.old.13421 2009-09-15 15:53:31.000000000 +0300
+++ /tmp/wklog.36.new.13421 2009-09-15 15:53:31.000000000 +0300
@@ -150,10 +150,17 @@
following events (see process_event() function):
- Query_log_event
-- Execute_load_query_log_event
-- Create_file_log_event
-
-TODO. Needed to check this list requires carefully !!!
+- Load_log_event
+- Execute_load_query_log_event [ :public Query_log_event ]
+- Create_file_log_event [ :public Load_log_event ]
+
+TODO. Needed to check this list carefully (not sure for Create_file_log_event)
+ Notes.
+ - In replication, only Query_log_event and Load_log_event uses
+ rpl_filter->get_rewrite_db();
+ - In mysqlbinlog (process_event), Execute_load_query_log_event
+ and Create_file_log_event are processed in separate switch
+ cases. And Load_log_event is processed in the default switch case.
Conditions for emiting use-statement:
- LOG_EVENT_SUPPRESS_USE_F is OFF for the event
@@ -182,8 +189,9 @@
*/
}
-- In process_event() function add print_use_stmt() invocations where
- needed (according to the events lis above), e.g.:
+- In process_event() function add switch case for Load_log_event and
+ add print_use_stmt() invocations where needed (according to the
+ events lis above), e.g.:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
@@ -207,6 +215,11 @@
}
break;
...
+ case LOAD_EVENT:
+ print_use_stmt((Load_log_event*)ev, print_event_info);
+ break;
+ default:
+ ...
}
...
}
-=-=(Guest - Tue, 15 Sep 2009, 12:12)=-=-
Low Level Design modified.
--- /tmp/wklog.36.old.3961 2009-09-15 12:12:26.000000000 +0300
+++ /tmp/wklog.36.new.3961 2009-09-15 12:12:26.000000000 +0300
@@ -144,6 +144,8 @@
3. Supporting rewrite-db for SBR events
---------------------------------------
+Limited to emiting USE <db_to> instead of USE <db_from>.
+
USE statements can be emited by mysqlbinlog as a result of processing the
following events (see process_event() function):
------------------------------------------------------------
-=-=(View All Progress Notes, 20 total)=-=-
http://askmonty.org/worklog/index.pl?tid=36&nolimit=1
DESCRIPTION:
Sometimes there is a need to take a binary log and apply it to a database with
a different name than the original name of the database on binlog producer.
If one is using statement-based replication, he can achieve this by grepping
out "USE dbname" statements out of the output of mysqlbinlog(*). With
row-based replication this is no longer possible, as database name is encoded
within the the BINLOG '....' statement.
This task is about adding an option to mysqlbinlog that would allow to change
the names of used databases in both RBR and SBR events.
(*) this implies that all statements refer to tables in the current database,
doesn't catch updates made inside stored functions and so forth, but still
works for a practially-important subset of cases.
HIGH-LEVEL SPECIFICATION:
Context
-------
(See http://askmonty.org/wiki/index.php/Scratch/ReplicationOptions for global
overview)
At the moment, the server has a replication slave option
--replicate-rewrite-db="from->to"
the option affects
- Table_map_log_event (all RBR events)
- Load_log_event (LOAD DATA)
- Query_log_event (SBR-based updates, with the usual assumption that the
statement refers to tables in current database, so that changing the current
database will make the statement to work on a table in a different database).
See also MySQL BUG#42941. Note this bug is fixed in MySQL 5.1.37, which is not
merged into MariaDB at the time of writing, but planned to be merged before
release.
What we could do
----------------
Option1: make mysqlbinlog accept --replicate-rewrite-db option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Make mysqlbinlog accept --replicate-rewrite-db options and process them to the
same extent as replication slave would process --replicate-rewrite-db option.
Option2: Add database-agnostic RBR events and --strip-db option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Right now RBR events require a databasename. It is not possible to have RBR
event stream that won't mention which database the events are for. When I
tried to use debugger and specify empty database name, attempt to apply the
binlog resulted in this error:
090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on
opening tables,
We could do as follows:
- Make the server interpret empty database name in RBR event (i.e. in a
Table_map_log_event) as "use current database". Binlog slave thread
probably should not allow such events as it doesn't have a natural current
database.
- Add a mysqlbinlog --strip-db option that would
= not produce any "USE dbname" statements
= change databasename for all RBR events to be empty
That way, mysqlbinlog output will be database-agnostic and apply to the
current database.
(this will have the usual limitations that we assume that all statements in
the binlog refer to the current database).
Option3: Enhance database rewrite
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If there is a need to support database change for statements that use
dbname.tablename notation and are replicated as statements (i.e. are DDL
statements and/or DML statements that are binlogged as statements),
then that could be supported as follows:
- Make the server's parser recognize special form of comments
/* !database-alias(oldname,newname) */
and save the mapping somewhere
- Put the hooks in table open and name resolution code to use the saved
mapping.
Once we've done the above, it will be easy to perform a complete,
no-compromise or restrictions database name change in binary log.
It will be possible to do the rewrites either on the slave (
--replicate-rewrite-db will work for all kinds of statements), or in
mysqlbinlog (adding a comment is easy and doesn't require mysqlbinlog to
parse the statement).
LOW-LEVEL DESIGN:
Content
-------
1. Adding rewrite-db option
2. Supporting rewrite-db option for RBR events
3. Supporting rewrite-db option for SBR events
(Limited to affecting only USE statements)
4. Current status
1. Adding rewrite-db option
---------------------------
1.1. Syntax:
--rewrite-db='db_from->db_to'
1.2. Add 'OPT_REWRITE_DB' to 'options_client' (in client_priv.h).
1.3. In mysqlbinlog.cc:
- Add { "rewrite-db", OPT_REWRITE_DB, ...} record to my_long_options:
- Add Rpl_filter object to mysqlbinlog.cc
Rpl_filter* binlog_filter;
- Add corresponding switch case to get_one_option():
case OPT_REWRITE_DB:
<extract db-from and db-to strings>
binlog_filter->add_db_rewrite(db_from, db_to);
break;
.
Note. To make Rpl_filter usable in a MYSQL_CLIENT context, few small
additional changes are required:
- In sql_list.cc/h, Sql_alloc::new(size_t) and Sql_alloc::new[](size_t)
uses sql_alloc() which is THD dependent. These are to be modified
as follows:
#ifdef MYSQL_CLIENT
extern MEM_ROOT sql_list_client_mem_root; // defined in sql_list.cc
#endif
class Sql_alloc
{ ...
static void *operator new(size_t size) throw ()
{
#ifndef MYSQL_CLIENT
return sql_alloc(size);
#else
return alloc_root(&sql_list_client_mem_root, size);
#endif
}
static void *operator new[](size_t size) throw ()
{
#ifndef MYSQL_CLIENT
return sql_alloc(size);
#else
return alloc_root(&sql_list_client_mem_root, size);
#endif
}
...
}
- In rpl_filter.cc:
Rpl_filter::Rpl_filter() :
...
{
#ifdef MYSQL_CLIENT
init_alloc_root(&sql_list_client_mem_root, ...);
#endif
...
}
Rpl_filter::~Rpl_filter()
{ ...
#ifdef MYSQL_CLIENT
free_root(&sql_list_client_mem_root, ...);
#endif
}
2. Supporting rewrite-db for RBR events
---------------------------------------
In binlog, each row operation event is preceded by Table map event(s) which maps
table id(s) to database and table names. So, it's enough to support rewriting
database name in a Table map.
2.1. Add rewrite_db() member to Table_map_log_event:
int Table_map_log_event::rewrite_db(
const char* new_db,
size_t new_db_len,
const Format_description_log_event* desc)
{
/* 1. In temp_buf member (possibly reallocating it) rewrite
event length, db length, and db parts
2. Change m_dblen and m_dbnam members
*/
}
Comment. This function assumes that temp_buf member contains Table map
binlog representaion (temp_buf is used for creating corresponding
BINLOG statement).
2.2. In mysqlbinlog modify corresponding switch case in the
process_event() function:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
Log_event *ev, ...)
{
...
switch (ev_type) {
...
case TABLE_MAP_EVENT:
{
Table_map_log_event *map= ((Table_map_log_event *)ev);
if (shall_skip_database(map->get_db_name()))
{ ...
}
// WL36
size_t new_len= 0;
const char* new_db= binlog_filter->get_rewrite_db(
map->get_db_name(), &new_len);
if (new_len && map->rewrite_db(new_db, new_len,
glob_description_event))
{ error("Could not rewrite database name");
goto err;
}
}
case WRITE_ROWS_EVENT:
case DELETE_ROWS_EVENT:
case UPDATE_ROWS_EVENT:
...
}
...
}
Comment. Rpl_filter::get_rewrite_db(db_from, &len): if filter contains
a (db_from, db_to) pair, this function returns pointer to db_to and
sets len = db_to length; otherwise, it returns db_from and does not
change len value.
3. Supporting rewrite-db for SBR events
---------------------------------------
Limited to emiting USE <db_to> instead of USE <db_from>.
USE statements can be emited by mysqlbinlog as a result of processing the
following events (see process_event() function):
- Query_log_event
- Load_log_event
- Execute_load_query_log_event [ :public Query_log_event ]
- Create_file_log_event [ :public Load_log_event ]
TODO. Needed to check this list carefully (not sure for Create_file_log_event)
Notes.
- In replication, only Query_log_event and Load_log_event uses
rpl_filter->get_rewrite_db();
- In mysqlbinlog (process_event), Execute_load_query_log_event
and Create_file_log_event are processed in separate switch
cases. And Load_log_event is processed in the default switch case.
Conditions for emiting use-statement:
- LOG_EVENT_SUPPRESS_USE_F is OFF for the event
(e.g. it is ON for 'create database' statement)
- event's db name differs from db_name in PRINT_EVENT_INFO
(PRINT_EVENT_INFO keeps db name of the last issued USE statement;
initially, this db name is empty).
3.1. In mysqlbinlog.cc
- Add the following function:
void print_use_stmt(Log_event* event, PRINT_EVENT_INFO* pinfo)
{
if (event->flags & LOG_EVENT_SUPPRESS_USE_F)
return;
/*
- For events listed above get db_from = event->db;
- If db_from is the same as pinfo->db then return;
- If there is rewrite-db rule db_from->db_to,
set db = db_to. Else set db = db_from;
- Print "use <db>" to mysqlbinlog output
- Set pinfo->db = db_from
(this suppresses emiting use-statements by corresponding
log_event's print-function)
*/
}
- In process_event() function add switch case for Load_log_event and
add print_use_stmt() invocations where needed (according to the
events list above), e.g.:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
Log_event *ev, ...)
{
...
switch (ev_type) {
case QUERY_EVENT:
if (shall_skip_database(((Query_log_event*)ev)->db))
goto end;
if (opt_base64_output_mode == BASE64_OUTPUT_ALWAYS)
{
// Possibly in case of rewite-db rule for ev->db
// a warning should be emited here (see note below)
... write_event_header_and_base64(ev, ...) ...
}
else
{
print_use_stmt((Query_log_event*)ev, print_event_info);
ev->print(result_file, print_event_info);
}
break;
...
case LOAD_EVENT:
print_use_stmt((Load_log_event*)ev, print_event_info);
break;
default:
...
}
...
}
Note. write_event_header_and_base64() does not print use-statement. It
produces BINLOG statement using ev->temp_buf content (i.e. the binary
log representation of the event). We don't rewrite temp_buf here with
db_to name (as we do it for Table map event) - this implies the
limitation 3 mentioned above.
Question: Is supporting of rewite_db + --base64-output really needed
currently?
4. Current status
-----------------
The outlined design (implemented for mysql-5.1.37) is tested for
simple test-cases.
TODO. 1. Check list of events which can emit use-statement.
2. Supporting of rewite_db + --base64-output ?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
Hi everyone,
I'm currently working on a Windows installer for MariaDB, and I have two
options for you to consider. This mail covers the first of them.
The first and currently biggest contender is CPack + NSIS. This
combination has two very big things going for it: It's the same that
MySQL uses, and it integrates really well with the CMake system. In
fact, all you have to do with this solution is to install NSIS on your
system and run "cpack.exe" in a directory where you already built the
solution.
NSIS creates a single binary exe file that installs in C:\Program
Files\MariaDB-5.1.47 (for example).
NSIS is very limited in what you can actually do with the system. For
example, there is no support in there for asking the user if he wants to
delete the database files, they just vanish. This is potentially
*extremely* bad. However, I have a theory on how to work around this
particular problem, by hacking the nsis.cmake file.
NSIS does not support upgrading of packages. Instead, it does "upgrades"
by allowing packages with different versions to install next to each
other. So if you installed the 5.1.47 version and want to upgrade to
5.1.49, you simply install 5.1.49, copy your database files over (over,
even better, use database files in a different directory). When you are
ready, you can remove the 5.1.47 package.
This clearly has some advantages, but it's just not the way most
software updates run. When you update to a newer version of the software
on most Windows software, and certainly on all systems using apt or RPM,
you just replace the old version with the new one.
There is no support for setting up the database in the installer, or
setting up MariaDB as a service. CMake+NSIS is just a dumb file copy
system. MySQL works around this by running another executable at the end
of the install process and this program does the setup. IMHO, that's a
very good solution, and it also allows the user to run the setup program
again later. But it's still a workaround due to the limited installer
system.
NSIS would be my choice for an installer right now. But because of the
limitations, I'd consider this a temporary solution until we have a
better one. See my next mail for a better but much more complex system.
Comments, please.
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
2
3
Hi again,
The other contender for installer system of MariaDB 2010, is CPack +
WiX. This is a much more powerful solution, but also one that will take
a lot longer to implement.
CPack doesn't actually support WiX yet, but there is a patch out there
to implement the support. This patch is so simple, I don't understand
why they didn't just add it already. All it does is copy the built files
into a directory structure, and call the WiX binaries. It doesn't output
a specification file for the installer, like the CPack NSIS integration
does.
Instead, the implementor has to supply a .xsl file which the WiX
binaries takes as input for creating a .xml file, which another WiX
binary uses to build the package.
The actual package is a single .msi file which runs like any other
graphical Windows installer.
Without CPack, the implementor writes the .xml file by hand. The CPack
integration makes it simpler to identify the files that will be
installed. If the implementor writes the .xml file manually, we have to
always keep the cmake built files and the WiX spec in sync. So even
though the CPack integration is really small, it does make sense.
WiX is capable of very powerful installers that would work exactly like
I'd hope to achieve. This means seamless upgrading, user account
creation (for setting MariaDB up as a service), service installation
etc. These are all things that NSIS just can't do directly, where we'd
be trying to bend the system to support what we want.
The downside of using WiX is that it's going to take a lot longer to
implement a good installer than it is to implement a simple installer
with NSIS. I already have a patch for a complete installer with NSIS,
albeit one that doesn't ask about deleting database files or with the
ability to set up as a service. Making it to this point with WiX is not
that easy.
I'm convinced that once the WiX installer is done, it's going to be easy
to maintain it. Probably as easy as maintaining the NSIS system. And
implementing features in the installer will be a lot simpler with WiX,
because the system is designed to be powerful.
I would like to hear some discussion about this. Should I start spending
the longer time on this, or go with the simple NSIS solution for now?
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
3
2

[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2868: Fixed compiler warnings
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2868
committer: Michael Widenius <monty(a)askmonty.org>
branch nick: maria-5.1
timestamp: Wed 2010-06-16 01:00:51 +0300
message:
Fixed compiler warnings
modified:
sql/log_event.cc
storage/maria/ma_state.c
storage/maria/maria_chk.c
storage/myisam/mi_dynrec.c
support-files/compiler_warnings.supp
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0

[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2867: Don't flush pinned pages in checkpoint (fix for my last push)
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2867
committer: Michael Widenius <monty(a)askmonty.org>
branch nick: maria-5.1
timestamp: Wed 2010-06-16 00:39:28 +0300
message:
Don't flush pinned pages in checkpoint (fix for my last push)
modified:
storage/maria/ma_pagecache.c
storage/maria/unittest/ma_pagecache_single.c
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0

[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2866: merged
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
Merge authors:
Bo Thorsen (bo.thorsen)
Michael Widenius (monty)
Sergei (sergii)
------------------------------------------------------------
revno: 2866 [merge]
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: 5.1
timestamp: Mon 2010-06-14 19:05:32 +0200
message:
merged
modified:
CMakeLists.txt
client/mysqldump.c
client/mysqltest.cc
mysql-test/r/mysqldump.result
mysql-test/r/openssl_1.result
mysql-test/suite/maria/r/maria-recover.result
mysql-test/suite/maria/r/maria3.result
mysql-test/suite/maria/t/maria3.test
storage/maria/ha_maria.cc
storage/maria/ha_maria.h
storage/maria/ma_blockrec.h
storage/maria/ma_init.c
storage/maria/ma_open.c
storage/maria/ma_pagecache.c
storage/maria/ma_recovery.c
storage/maria/ma_state.c
storage/maria/ma_static.c
storage/maria/maria_def.h
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0

[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2865: mtr: when applying @opt_extra_mysqld_opt for --help,
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2865
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: 5.1
timestamp: Mon 2010-06-14 18:57:30 +0200
message:
mtr: when applying @opt_extra_mysqld_opt for --help,
filter out --binlog-format - it makes mysqld to fail without --log-bin,
and we don't need either anyway for --help to work.
modified:
mysql-test/mysql-test-run.pl
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0

[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2864: ugly-ugly. $with_plugin_innobase was hard-coded in configure.in in
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2864
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: 5.1
timestamp: Thu 2010-06-10 19:35:18 +0200
message:
ugly-ugly. $with_plugin_innobase was hard-coded in configure.in in
modified:
storage/xtradb/plug.in
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0

[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2863: fixed for mysql-test-run to
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2863
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: 5.1
timestamp: Thu 2010-06-10 11:11:52 +0200
message:
fixed for mysql-test-run to
* fully support --mysqld=--plugin-load=xxxx
* uniformly support all loadable plugins, no need to hard-code
every new plugin in mtr
* autodetect MTR_VS_CONFIG on windows
removed:
mysql-test/suite/pbxt/t/udf-master.opt
mysql-test/suite/rpl/t/rpl_plugin_load-master.opt
mysql-test/suite/rpl/t/rpl_plugin_load-slave.opt
mysql-test/suite/rpl/t/rpl_udf-master.opt
mysql-test/suite/rpl/t/rpl_udf-slave.opt
mysql-test/t/fulltext_plugin-master.opt
mysql-test/t/plugin-master.opt
mysql-test/t/plugin_not_embedded-master.opt
mysql-test/t/udf-master.opt
mysql-test/t/udf_query_cache-master.opt
modified:
mysql-test/include/have_example_plugin.inc
mysql-test/include/have_simple_parser.inc
mysql-test/include/have_udf.inc
mysql-test/include/rpl_udf.inc
mysql-test/lib/My/File/Path.pm
mysql-test/lib/mtr_cases.pm
mysql-test/mysql-test-run.pl
mysql-test/r/information_schema.result
mysql-test/r/innodb_ignore_builtin.result
mysql-test/suite/pbxt/t/udf.test
mysql-test/t/bug46261-master.opt
mysql-test/t/bug46261.test
mysql-test/t/information_schema.test
mysql-test/t/innodb_ignore_builtin.test
mysql-test/t/mysqld_option_err.test
mysql-test/t/plugin.test
mysql-test/t/plugin_load-master.opt
mysql-test/t/plugin_not_embedded.test
mysql-test/t/udf.test
mysql-test/t/udf_query_cache.test
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0

[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2862: allow federated and innodb_plugin to be built
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2862
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: 5.1
timestamp: Wed 2010-06-09 23:29:18 +0200
message:
allow federated and innodb_plugin to be built
renamed:
storage/federated/plug.in.disabled => storage/federated/plug.in
storage/innodb_plugin/plug.in.disabled => storage/innodb_plugin/plug.in
modified:
storage/federated/Makefile.am
storage/federatedx/Makefile.am
storage/federatedx/ha_federatedx.cc
storage/federatedx/plug.in
storage/xtradb/CMakeLists.txt
storage/xtradb/Makefile.am
storage/xtradb/handler/ha_innodb.cc
storage/xtradb/plug.in
storage/federated/plug.in
storage/innodb_plugin/plug.in
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0

[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2861: fix questionable UNIV_EXPECT's in the xtradb that confused old gcc.
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2861
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: 5.1
timestamp: Wed 2010-06-09 13:53:51 +0200
message:
fix questionable UNIV_EXPECT's in the xtradb that confused old gcc.
modified:
storage/xtradb/include/rem0rec.ic
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0

[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2860: Automerge MariaDB 5.1.47 release into main.
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
Merge authors:
<Dao-Gang.Qu(a)sun.com>
<Li-Bing.Song(a)sun.com>
Aleksandr Kuzminsky (akuzminsky)
Alexander Barkov <bar(a)mysql.com>
Alexander Nozdrin <alik(a)sun.com>...
Related merge proposals:
https://code.launchpad.net/~paul-mccullagh/maria/add-xtstat-util/+merge/250…
proposed by: Paul McCullagh (paul-mccullagh)
https://code.launchpad.net/~paul-mccullagh/maria/pbxt-1.0.11/+merge/24882
proposed by: Paul McCullagh (paul-mccullagh)
------------------------------------------------------------
revno: 2860 [merge]
committer: knielsen(a)knielsen-hq.org
branch nick: mariadb-5.1
timestamp: Mon 2010-05-31 10:43:34 +0200
message:
Automerge MariaDB 5.1.47 release into main.
removed:
mysql-test/include/ctype_innodb_like.inc
mysql-test/include/have_innodb.inc
mysql-test/include/innodb_trx_weight.inc
mysql-test/r/innodb-autoinc-44030.result
mysql-test/r/innodb-autoinc.result
mysql-test/r/innodb_bug21704.result
mysql-test/r/innodb_bug38231.result
mysql-test/r/innodb_bug40565.result
mysql-test/r/innodb_bug42101-nonzero.result
mysql-test/r/innodb_bug42101.result
mysql-test/r/innodb_bug44032.result
mysql-test/r/innodb_bug44369.result
mysql-test/r/innodb_bug45357.result
mysql-test/r/innodb_bug46000.result
mysql-test/r/innodb_bug47777.result
mysql-test/suite/innodb/include/have_innodb_plugin.inc
mysql-test/suite/innodb/include/innodb-index.inc
mysql-test/suite/innodb/r/innodb-analyze.result
mysql-test/suite/innodb/r/innodb-consistent.result
mysql-test/suite/innodb/r/innodb-index.result
mysql-test/suite/innodb/r/innodb-index_ucs2.result
mysql-test/suite/innodb/r/innodb-timeout.result
mysql-test/suite/innodb/r/innodb-use-sys-malloc.result
mysql-test/suite/innodb/r/innodb-zip.result
mysql-test/suite/innodb/r/innodb_bug36169.result
mysql-test/suite/innodb/r/innodb_bug36172.result
mysql-test/suite/innodb/r/innodb_bug40360.result
mysql-test/suite/innodb/r/innodb_bug41904.result
mysql-test/suite/innodb/r/innodb_bug44571.result
mysql-test/suite/innodb/r/innodb_bug46676.result
mysql-test/suite/innodb/r/innodb_bug47167.result
mysql-test/suite/innodb/r/innodb_information_schema.result
mysql-test/suite/innodb/t/disabled.def
mysql-test/suite/innodb/t/innodb-analyze.test
mysql-test/suite/innodb/t/innodb-consistent-master.opt
mysql-test/suite/innodb/t/innodb-consistent.test
mysql-test/suite/innodb/t/innodb-index.test
mysql-test/suite/innodb/t/innodb-index_ucs2.test
mysql-test/suite/innodb/t/innodb-timeout.test
mysql-test/suite/innodb/t/innodb-use-sys-malloc-master.opt
mysql-test/suite/innodb/t/innodb-use-sys-malloc.test
mysql-test/suite/innodb/t/innodb-zip.test
mysql-test/suite/innodb/t/innodb_bug36169.test
mysql-test/suite/innodb/t/innodb_bug36172.test
mysql-test/suite/innodb/t/innodb_bug40360.test
mysql-test/suite/innodb/t/innodb_bug41904.test
mysql-test/suite/innodb/t/innodb_bug44571.test
mysql-test/suite/innodb/t/innodb_bug46676.test
mysql-test/suite/innodb/t/innodb_bug47167.test
mysql-test/suite/innodb/t/innodb_information_schema.test
mysql-test/t/innodb-autoinc-44030.test
mysql-test/t/innodb-autoinc.test
mysql-test/t/innodb_bug21704.test
mysql-test/t/innodb_bug38231.test
mysql-test/t/innodb_bug40565.test
mysql-test/t/innodb_bug42101-nonzero-master.opt
mysql-test/t/innodb_bug42101-nonzero.test
mysql-test/t/innodb_bug42101.test
mysql-test/t/innodb_bug44032.test
mysql-test/t/innodb_bug44369.test
mysql-test/t/innodb_bug45357.test
mysql-test/t/innodb_bug46000.test
mysql-test/t/innodb_bug47777.test
storage/innobase/
storage/innobase/CMakeLists.txt
storage/innobase/Makefile.am
storage/innobase/btr/
storage/innobase/btr/btr0btr.c
storage/innobase/btr/btr0cur.c
storage/innobase/btr/btr0pcur.c
storage/innobase/btr/btr0sea.c
storage/innobase/buf/
storage/innobase/buf/buf0buf.c
storage/innobase/buf/buf0flu.c
storage/innobase/buf/buf0lru.c
storage/innobase/buf/buf0rea.c
storage/innobase/data/
storage/innobase/data/data0data.c
storage/innobase/data/data0type.c
storage/innobase/dict/
storage/innobase/dict/dict0boot.c
storage/innobase/dict/dict0crea.c
storage/innobase/dict/dict0dict.c
storage/innobase/dict/dict0load.c
storage/innobase/dict/dict0mem.c
storage/innobase/dyn/
storage/innobase/dyn/dyn0dyn.c
storage/innobase/eval/
storage/innobase/eval/eval0eval.c
storage/innobase/eval/eval0proc.c
storage/innobase/fil/
storage/innobase/fil/fil0fil.c
storage/innobase/fsp/
storage/innobase/fsp/fsp0fsp.c
storage/innobase/fut/
storage/innobase/fut/fut0fut.c
storage/innobase/fut/fut0lst.c
storage/innobase/ha/
storage/innobase/ha/ha0ha.c
storage/innobase/ha/hash0hash.c
storage/innobase/handler/
storage/innobase/handler/ha_innodb.cc
storage/innobase/handler/ha_innodb.h
storage/innobase/ibuf/
storage/innobase/ibuf/ibuf0ibuf.c
storage/innobase/include/
storage/innobase/include/btr0btr.h
storage/innobase/include/btr0btr.ic
storage/innobase/include/btr0cur.h
storage/innobase/include/btr0cur.ic
storage/innobase/include/btr0pcur.h
storage/innobase/include/btr0pcur.ic
storage/innobase/include/btr0sea.h
storage/innobase/include/btr0sea.ic
storage/innobase/include/btr0types.h
storage/innobase/include/buf0buf.h
storage/innobase/include/buf0buf.ic
storage/innobase/include/buf0flu.h
storage/innobase/include/buf0flu.ic
storage/innobase/include/buf0lru.h
storage/innobase/include/buf0lru.ic
storage/innobase/include/buf0rea.h
storage/innobase/include/buf0types.h
storage/innobase/include/data0data.h
storage/innobase/include/data0data.ic
storage/innobase/include/data0type.h
storage/innobase/include/data0type.ic
storage/innobase/include/data0types.h
storage/innobase/include/db0err.h
storage/innobase/include/dict0boot.h
storage/innobase/include/dict0boot.ic
storage/innobase/include/dict0crea.h
storage/innobase/include/dict0crea.ic
storage/innobase/include/dict0dict.h
storage/innobase/include/dict0dict.ic
storage/innobase/include/dict0load.h
storage/innobase/include/dict0load.ic
storage/innobase/include/dict0mem.h
storage/innobase/include/dict0mem.ic
storage/innobase/include/dict0types.h
storage/innobase/include/dyn0dyn.h
storage/innobase/include/dyn0dyn.ic
storage/innobase/include/eval0eval.h
storage/innobase/include/eval0eval.ic
storage/innobase/include/eval0proc.h
storage/innobase/include/eval0proc.ic
storage/innobase/include/fil0fil.h
storage/innobase/include/fsp0fsp.h
storage/innobase/include/fsp0fsp.ic
storage/innobase/include/fsp0types.h
storage/innobase/include/fut0fut.h
storage/innobase/include/fut0fut.ic
storage/innobase/include/fut0lst.h
storage/innobase/include/fut0lst.ic
storage/innobase/include/ha0ha.h
storage/innobase/include/ha0ha.ic
storage/innobase/include/ha_prototypes.h
storage/innobase/include/hash0hash.h
storage/innobase/include/hash0hash.ic
storage/innobase/include/ibuf0ibuf.h
storage/innobase/include/ibuf0ibuf.ic
storage/innobase/include/ibuf0types.h
storage/innobase/include/lock0iter.h
storage/innobase/include/lock0lock.h
storage/innobase/include/lock0lock.ic
storage/innobase/include/lock0priv.h
storage/innobase/include/lock0priv.ic
storage/innobase/include/lock0types.h
storage/innobase/include/log0log.h
storage/innobase/include/log0log.ic
storage/innobase/include/log0recv.h
storage/innobase/include/log0recv.ic
storage/innobase/include/mach0data.h
storage/innobase/include/mach0data.ic
storage/innobase/include/mem0dbg.h
storage/innobase/include/mem0dbg.ic
storage/innobase/include/mem0mem.h
storage/innobase/include/mem0mem.ic
storage/innobase/include/mem0pool.h
storage/innobase/include/mem0pool.ic
storage/innobase/include/mtr0log.h
storage/innobase/include/mtr0log.ic
storage/innobase/include/mtr0mtr.h
storage/innobase/include/mtr0mtr.ic
storage/innobase/include/mtr0types.h
storage/innobase/include/os0file.h
storage/innobase/include/os0proc.h
storage/innobase/include/os0proc.ic
storage/innobase/include/os0sync.h
storage/innobase/include/os0sync.ic
storage/innobase/include/os0thread.h
storage/innobase/include/os0thread.ic
storage/innobase/include/page0cur.h
storage/innobase/include/page0cur.ic
storage/innobase/include/page0page.h
storage/innobase/include/page0page.ic
storage/innobase/include/page0types.h
storage/innobase/include/pars0grm.h
storage/innobase/include/pars0opt.h
storage/innobase/include/pars0opt.ic
storage/innobase/include/pars0pars.h
storage/innobase/include/pars0pars.ic
storage/innobase/include/pars0sym.h
storage/innobase/include/pars0sym.ic
storage/innobase/include/pars0types.h
storage/innobase/include/que0que.h
storage/innobase/include/que0que.ic
storage/innobase/include/que0types.h
storage/innobase/include/read0read.h
storage/innobase/include/read0read.ic
storage/innobase/include/read0types.h
storage/innobase/include/rem0cmp.h
storage/innobase/include/rem0cmp.ic
storage/innobase/include/rem0rec.h
storage/innobase/include/rem0rec.ic
storage/innobase/include/rem0types.h
storage/innobase/include/row0ins.h
storage/innobase/include/row0ins.ic
storage/innobase/include/row0mysql.h
storage/innobase/include/row0mysql.ic
storage/innobase/include/row0purge.h
storage/innobase/include/row0purge.ic
storage/innobase/include/row0row.h
storage/innobase/include/row0row.ic
storage/innobase/include/row0sel.h
storage/innobase/include/row0sel.ic
storage/innobase/include/row0types.h
storage/innobase/include/row0uins.h
storage/innobase/include/row0uins.ic
storage/innobase/include/row0umod.h
storage/innobase/include/row0umod.ic
storage/innobase/include/row0undo.h
storage/innobase/include/row0undo.ic
storage/innobase/include/row0upd.h
storage/innobase/include/row0upd.ic
storage/innobase/include/row0vers.h
storage/innobase/include/row0vers.ic
storage/innobase/include/srv0que.h
storage/innobase/include/srv0srv.h
storage/innobase/include/srv0srv.ic
storage/innobase/include/srv0start.h
storage/innobase/include/sync0arr.h
storage/innobase/include/sync0arr.ic
storage/innobase/include/sync0rw.h
storage/innobase/include/sync0rw.ic
storage/innobase/include/sync0sync.h
storage/innobase/include/sync0sync.ic
storage/innobase/include/sync0types.h
storage/innobase/include/thr0loc.h
storage/innobase/include/thr0loc.ic
storage/innobase/include/trx0purge.h
storage/innobase/include/trx0purge.ic
storage/innobase/include/trx0rec.h
storage/innobase/include/trx0rec.ic
storage/innobase/include/trx0roll.h
storage/innobase/include/trx0roll.ic
storage/innobase/include/trx0rseg.h
storage/innobase/include/trx0rseg.ic
storage/innobase/include/trx0sys.h
storage/innobase/include/trx0sys.ic
storage/innobase/include/trx0trx.h
storage/innobase/include/trx0trx.ic
storage/innobase/include/trx0types.h
storage/innobase/include/trx0undo.h
storage/innobase/include/trx0undo.ic
storage/innobase/include/trx0xa.h
storage/innobase/include/univ.i
storage/innobase/include/usr0sess.h
storage/innobase/include/usr0sess.ic
storage/innobase/include/usr0types.h
storage/innobase/include/ut0byte.h
storage/innobase/include/ut0byte.ic
storage/innobase/include/ut0dbg.h
storage/innobase/include/ut0list.h
storage/innobase/include/ut0list.ic
storage/innobase/include/ut0lst.h
storage/innobase/include/ut0mem.h
storage/innobase/include/ut0mem.ic
storage/innobase/include/ut0rnd.h
storage/innobase/include/ut0rnd.ic
storage/innobase/include/ut0sort.h
storage/innobase/include/ut0ut.h
storage/innobase/include/ut0ut.ic
storage/innobase/include/ut0vec.h
storage/innobase/include/ut0vec.ic
storage/innobase/include/ut0wqueue.h
storage/innobase/lock/
storage/innobase/lock/lock0iter.c
storage/innobase/lock/lock0lock.c
storage/innobase/log/
storage/innobase/log/log0log.c
storage/innobase/log/log0recv.c
storage/innobase/mach/
storage/innobase/mach/mach0data.c
storage/innobase/mem/
storage/innobase/mem/mem0dbg.c
storage/innobase/mem/mem0mem.c
storage/innobase/mem/mem0pool.c
storage/innobase/mtr/
storage/innobase/mtr/mtr0log.c
storage/innobase/mtr/mtr0mtr.c
storage/innobase/os/
storage/innobase/os/os0file.c
storage/innobase/os/os0proc.c
storage/innobase/os/os0sync.c
storage/innobase/os/os0thread.c
storage/innobase/page/
storage/innobase/page/page0cur.c
storage/innobase/page/page0page.c
storage/innobase/pars/
storage/innobase/pars/lexyy.c
storage/innobase/pars/make_bison.sh
storage/innobase/pars/make_flex.sh
storage/innobase/pars/pars0grm.c
storage/innobase/pars/pars0grm.h
storage/innobase/pars/pars0grm.y
storage/innobase/pars/pars0lex.l
storage/innobase/pars/pars0opt.c
storage/innobase/pars/pars0pars.c
storage/innobase/pars/pars0sym.c
storage/innobase/plug.in.disabled
storage/innobase/que/
storage/innobase/que/que0que.c
storage/innobase/read/
storage/innobase/read/read0read.c
storage/innobase/rem/
storage/innobase/rem/rem0cmp.c
storage/innobase/rem/rem0rec.c
storage/innobase/row/
storage/innobase/row/row0ins.c
storage/innobase/row/row0mysql.c
storage/innobase/row/row0purge.c
storage/innobase/row/row0row.c
storage/innobase/row/row0sel.c
storage/innobase/row/row0uins.c
storage/innobase/row/row0umod.c
storage/innobase/row/row0undo.c
storage/innobase/row/row0upd.c
storage/innobase/row/row0vers.c
storage/innobase/srv/
storage/innobase/srv/srv0que.c
storage/innobase/srv/srv0srv.c
storage/innobase/srv/srv0start.c
storage/innobase/sync/
storage/innobase/sync/sync0arr.c
storage/innobase/sync/sync0rw.c
storage/innobase/sync/sync0sync.c
storage/innobase/thr/
storage/innobase/thr/thr0loc.c
storage/innobase/trx/
storage/innobase/trx/trx0purge.c
storage/innobase/trx/trx0rec.c
storage/innobase/trx/trx0roll.c
storage/innobase/trx/trx0rseg.c
storage/innobase/trx/trx0sys.c
storage/innobase/trx/trx0trx.c
storage/innobase/trx/trx0undo.c
storage/innobase/usr/
storage/innobase/usr/usr0sess.c
storage/innobase/ut/
storage/innobase/ut/ut0byte.c
storage/innobase/ut/ut0dbg.c
storage/innobase/ut/ut0list.c
storage/innobase/ut/ut0mem.c
storage/innobase/ut/ut0rnd.c
storage/innobase/ut/ut0ut.c
storage/innobase/ut/ut0vec.c
storage/innobase/ut/ut0wqueue.c
storage/innodb_plugin/
storage/innodb_plugin/CMakeLists.txt
storage/innodb_plugin/COPYING
storage/innodb_plugin/COPYING.Google
storage/innodb_plugin/COPYING.Percona
storage/innodb_plugin/COPYING.Sun_Microsystems
storage/innodb_plugin/ChangeLog
storage/innodb_plugin/Doxyfile
storage/innodb_plugin/Makefile.am
storage/innodb_plugin/btr/
storage/innodb_plugin/btr/btr0btr.c
storage/innodb_plugin/btr/btr0cur.c
storage/innodb_plugin/btr/btr0pcur.c
storage/innodb_plugin/btr/btr0sea.c
storage/innodb_plugin/buf/
storage/innodb_plugin/buf/buf0buddy.c
storage/innodb_plugin/buf/buf0buf.c
storage/innodb_plugin/buf/buf0flu.c
storage/innodb_plugin/buf/buf0lru.c
storage/innodb_plugin/buf/buf0rea.c
storage/innodb_plugin/compile-innodb
storage/innodb_plugin/compile-innodb-debug
storage/innodb_plugin/data/
storage/innodb_plugin/data/data0data.c
storage/innodb_plugin/data/data0type.c
storage/innodb_plugin/dict/
storage/innodb_plugin/dict/dict0boot.c
storage/innodb_plugin/dict/dict0crea.c
storage/innodb_plugin/dict/dict0dict.c
storage/innodb_plugin/dict/dict0load.c
storage/innodb_plugin/dict/dict0mem.c
storage/innodb_plugin/dyn/
storage/innodb_plugin/dyn/dyn0dyn.c
storage/innodb_plugin/eval/
storage/innodb_plugin/eval/eval0eval.c
storage/innodb_plugin/eval/eval0proc.c
storage/innodb_plugin/fil/
storage/innodb_plugin/fil/fil0fil.c
storage/innodb_plugin/fsp/
storage/innodb_plugin/fsp/fsp0fsp.c
storage/innodb_plugin/fut/
storage/innodb_plugin/fut/fut0fut.c
storage/innodb_plugin/fut/fut0lst.c
storage/innodb_plugin/ha/
storage/innodb_plugin/ha/ha0ha.c
storage/innodb_plugin/ha/ha0storage.c
storage/innodb_plugin/ha/hash0hash.c
storage/innodb_plugin/ha_innodb.def
storage/innodb_plugin/handler/
storage/innodb_plugin/handler/ha_innodb.cc
storage/innodb_plugin/handler/ha_innodb.h
storage/innodb_plugin/handler/handler0alter.cc
storage/innodb_plugin/handler/i_s.cc
storage/innodb_plugin/handler/i_s.h
storage/innodb_plugin/handler/mysql_addons.cc
storage/innodb_plugin/ibuf/
storage/innodb_plugin/ibuf/ibuf0ibuf.c
storage/innodb_plugin/include/
storage/innodb_plugin/include/btr0btr.h
storage/innodb_plugin/include/btr0btr.ic
storage/innodb_plugin/include/btr0cur.h
storage/innodb_plugin/include/btr0cur.ic
storage/innodb_plugin/include/btr0pcur.h
storage/innodb_plugin/include/btr0pcur.ic
storage/innodb_plugin/include/btr0sea.h
storage/innodb_plugin/include/btr0sea.ic
storage/innodb_plugin/include/btr0types.h
storage/innodb_plugin/include/buf0buddy.h
storage/innodb_plugin/include/buf0buddy.ic
storage/innodb_plugin/include/buf0buf.h
storage/innodb_plugin/include/buf0buf.ic
storage/innodb_plugin/include/buf0flu.h
storage/innodb_plugin/include/buf0flu.ic
storage/innodb_plugin/include/buf0lru.h
storage/innodb_plugin/include/buf0lru.ic
storage/innodb_plugin/include/buf0rea.h
storage/innodb_plugin/include/buf0types.h
storage/innodb_plugin/include/data0data.h
storage/innodb_plugin/include/data0data.ic
storage/innodb_plugin/include/data0type.h
storage/innodb_plugin/include/data0type.ic
storage/innodb_plugin/include/data0types.h
storage/innodb_plugin/include/db0err.h
storage/innodb_plugin/include/dict0boot.h
storage/innodb_plugin/include/dict0boot.ic
storage/innodb_plugin/include/dict0crea.h
storage/innodb_plugin/include/dict0crea.ic
storage/innodb_plugin/include/dict0dict.h
storage/innodb_plugin/include/dict0dict.ic
storage/innodb_plugin/include/dict0load.h
storage/innodb_plugin/include/dict0load.ic
storage/innodb_plugin/include/dict0mem.h
storage/innodb_plugin/include/dict0mem.ic
storage/innodb_plugin/include/dict0types.h
storage/innodb_plugin/include/dyn0dyn.h
storage/innodb_plugin/include/dyn0dyn.ic
storage/innodb_plugin/include/eval0eval.h
storage/innodb_plugin/include/eval0eval.ic
storage/innodb_plugin/include/eval0proc.h
storage/innodb_plugin/include/eval0proc.ic
storage/innodb_plugin/include/fil0fil.h
storage/innodb_plugin/include/fsp0fsp.h
storage/innodb_plugin/include/fsp0fsp.ic
storage/innodb_plugin/include/fsp0types.h
storage/innodb_plugin/include/fut0fut.h
storage/innodb_plugin/include/fut0fut.ic
storage/innodb_plugin/include/fut0lst.h
storage/innodb_plugin/include/fut0lst.ic
storage/innodb_plugin/include/ha0ha.h
storage/innodb_plugin/include/ha0ha.ic
storage/innodb_plugin/include/ha0storage.h
storage/innodb_plugin/include/ha0storage.ic
storage/innodb_plugin/include/ha_prototypes.h
storage/innodb_plugin/include/handler0alter.h
storage/innodb_plugin/include/hash0hash.h
storage/innodb_plugin/include/hash0hash.ic
storage/innodb_plugin/include/ibuf0ibuf.h
storage/innodb_plugin/include/ibuf0ibuf.ic
storage/innodb_plugin/include/ibuf0types.h
storage/innodb_plugin/include/lock0iter.h
storage/innodb_plugin/include/lock0lock.h
storage/innodb_plugin/include/lock0lock.ic
storage/innodb_plugin/include/lock0priv.h
storage/innodb_plugin/include/lock0priv.ic
storage/innodb_plugin/include/lock0types.h
storage/innodb_plugin/include/log0log.h
storage/innodb_plugin/include/log0log.ic
storage/innodb_plugin/include/log0recv.h
storage/innodb_plugin/include/log0recv.ic
storage/innodb_plugin/include/mach0data.h
storage/innodb_plugin/include/mach0data.ic
storage/innodb_plugin/include/mem0dbg.h
storage/innodb_plugin/include/mem0dbg.ic
storage/innodb_plugin/include/mem0mem.h
storage/innodb_plugin/include/mem0mem.ic
storage/innodb_plugin/include/mem0pool.h
storage/innodb_plugin/include/mem0pool.ic
storage/innodb_plugin/include/mtr0log.h
storage/innodb_plugin/include/mtr0log.ic
storage/innodb_plugin/include/mtr0mtr.h
storage/innodb_plugin/include/mtr0mtr.ic
storage/innodb_plugin/include/mtr0types.h
storage/innodb_plugin/include/mysql_addons.h
storage/innodb_plugin/include/os0file.h
storage/innodb_plugin/include/os0proc.h
storage/innodb_plugin/include/os0proc.ic
storage/innodb_plugin/include/os0sync.h
storage/innodb_plugin/include/os0sync.ic
storage/innodb_plugin/include/os0thread.h
storage/innodb_plugin/include/os0thread.ic
storage/innodb_plugin/include/page0cur.h
storage/innodb_plugin/include/page0cur.ic
storage/innodb_plugin/include/page0page.h
storage/innodb_plugin/include/page0page.ic
storage/innodb_plugin/include/page0types.h
storage/innodb_plugin/include/page0zip.h
storage/innodb_plugin/include/page0zip.ic
storage/innodb_plugin/include/pars0grm.h
storage/innodb_plugin/include/pars0opt.h
storage/innodb_plugin/include/pars0opt.ic
storage/innodb_plugin/include/pars0pars.h
storage/innodb_plugin/include/pars0pars.ic
storage/innodb_plugin/include/pars0sym.h
storage/innodb_plugin/include/pars0sym.ic
storage/innodb_plugin/include/pars0types.h
storage/innodb_plugin/include/que0que.h
storage/innodb_plugin/include/que0que.ic
storage/innodb_plugin/include/que0types.h
storage/innodb_plugin/include/read0read.h
storage/innodb_plugin/include/read0read.ic
storage/innodb_plugin/include/read0types.h
storage/innodb_plugin/include/rem0cmp.h
storage/innodb_plugin/include/rem0cmp.ic
storage/innodb_plugin/include/rem0rec.h
storage/innodb_plugin/include/rem0rec.ic
storage/innodb_plugin/include/rem0types.h
storage/innodb_plugin/include/row0ext.h
storage/innodb_plugin/include/row0ext.ic
storage/innodb_plugin/include/row0ins.h
storage/innodb_plugin/include/row0ins.ic
storage/innodb_plugin/include/row0merge.h
storage/innodb_plugin/include/row0mysql.h
storage/innodb_plugin/include/row0mysql.ic
storage/innodb_plugin/include/row0purge.h
storage/innodb_plugin/include/row0purge.ic
storage/innodb_plugin/include/row0row.h
storage/innodb_plugin/include/row0row.ic
storage/innodb_plugin/include/row0sel.h
storage/innodb_plugin/include/row0sel.ic
storage/innodb_plugin/include/row0types.h
storage/innodb_plugin/include/row0uins.h
storage/innodb_plugin/include/row0uins.ic
storage/innodb_plugin/include/row0umod.h
storage/innodb_plugin/include/row0umod.ic
storage/innodb_plugin/include/row0undo.h
storage/innodb_plugin/include/row0undo.ic
storage/innodb_plugin/include/row0upd.h
storage/innodb_plugin/include/row0upd.ic
storage/innodb_plugin/include/row0vers.h
storage/innodb_plugin/include/row0vers.ic
storage/innodb_plugin/include/srv0que.h
storage/innodb_plugin/include/srv0srv.h
storage/innodb_plugin/include/srv0srv.ic
storage/innodb_plugin/include/srv0start.h
storage/innodb_plugin/include/sync0arr.h
storage/innodb_plugin/include/sync0arr.ic
storage/innodb_plugin/include/sync0rw.h
storage/innodb_plugin/include/sync0rw.ic
storage/innodb_plugin/include/sync0sync.h
storage/innodb_plugin/include/sync0sync.ic
storage/innodb_plugin/include/sync0types.h
storage/innodb_plugin/include/thr0loc.h
storage/innodb_plugin/include/thr0loc.ic
storage/innodb_plugin/include/trx0i_s.h
storage/innodb_plugin/include/trx0purge.h
storage/innodb_plugin/include/trx0purge.ic
storage/innodb_plugin/include/trx0rec.h
storage/innodb_plugin/include/trx0rec.ic
storage/innodb_plugin/include/trx0roll.h
storage/innodb_plugin/include/trx0roll.ic
storage/innodb_plugin/include/trx0rseg.h
storage/innodb_plugin/include/trx0rseg.ic
storage/innodb_plugin/include/trx0sys.h
storage/innodb_plugin/include/trx0sys.ic
storage/innodb_plugin/include/trx0trx.h
storage/innodb_plugin/include/trx0trx.ic
storage/innodb_plugin/include/trx0types.h
storage/innodb_plugin/include/trx0undo.h
storage/innodb_plugin/include/trx0undo.ic
storage/innodb_plugin/include/trx0xa.h
storage/innodb_plugin/include/univ.i
storage/innodb_plugin/include/usr0sess.h
storage/innodb_plugin/include/usr0sess.ic
storage/innodb_plugin/include/usr0types.h
storage/innodb_plugin/include/ut0auxconf.h
storage/innodb_plugin/include/ut0byte.h
storage/innodb_plugin/include/ut0byte.ic
storage/innodb_plugin/include/ut0dbg.h
storage/innodb_plugin/include/ut0list.h
storage/innodb_plugin/include/ut0list.ic
storage/innodb_plugin/include/ut0lst.h
storage/innodb_plugin/include/ut0mem.h
storage/innodb_plugin/include/ut0mem.ic
storage/innodb_plugin/include/ut0rnd.h
storage/innodb_plugin/include/ut0rnd.ic
storage/innodb_plugin/include/ut0sort.h
storage/innodb_plugin/include/ut0ut.h
storage/innodb_plugin/include/ut0ut.ic
storage/innodb_plugin/include/ut0vec.h
storage/innodb_plugin/include/ut0vec.ic
storage/innodb_plugin/include/ut0wqueue.h
storage/innodb_plugin/lock/
storage/innodb_plugin/lock/lock0iter.c
storage/innodb_plugin/lock/lock0lock.c
storage/innodb_plugin/log/
storage/innodb_plugin/log/log0log.c
storage/innodb_plugin/log/log0recv.c
storage/innodb_plugin/mach/
storage/innodb_plugin/mach/mach0data.c
storage/innodb_plugin/mem/
storage/innodb_plugin/mem/mem0dbg.c
storage/innodb_plugin/mem/mem0mem.c
storage/innodb_plugin/mem/mem0pool.c
storage/innodb_plugin/mtr/
storage/innodb_plugin/mtr/mtr0log.c
storage/innodb_plugin/mtr/mtr0mtr.c
storage/innodb_plugin/mysql-test/
storage/innodb_plugin/mysql-test/ctype_innodb_like.inc
storage/innodb_plugin/mysql-test/have_innodb.inc
storage/innodb_plugin/mysql-test/innodb-analyze.result
storage/innodb_plugin/mysql-test/innodb-analyze.test
storage/innodb_plugin/mysql-test/innodb-autoinc.result
storage/innodb_plugin/mysql-test/innodb-autoinc.test
storage/innodb_plugin/mysql-test/innodb-consistent-master.opt
storage/innodb_plugin/mysql-test/innodb-consistent.result
storage/innodb_plugin/mysql-test/innodb-consistent.test
storage/innodb_plugin/mysql-test/innodb-index.inc
storage/innodb_plugin/mysql-test/innodb-index.result
storage/innodb_plugin/mysql-test/innodb-index.test
storage/innodb_plugin/mysql-test/innodb-index_ucs2.result
storage/innodb_plugin/mysql-test/innodb-index_ucs2.test
storage/innodb_plugin/mysql-test/innodb-lock.result
storage/innodb_plugin/mysql-test/innodb-lock.test
storage/innodb_plugin/mysql-test/innodb-master.opt
storage/innodb_plugin/mysql-test/innodb-replace.result
storage/innodb_plugin/mysql-test/innodb-replace.test
storage/innodb_plugin/mysql-test/innodb-semi-consistent-master.opt
storage/innodb_plugin/mysql-test/innodb-semi-consistent.result
storage/innodb_plugin/mysql-test/innodb-semi-consistent.test
storage/innodb_plugin/mysql-test/innodb-timeout.result
storage/innodb_plugin/mysql-test/innodb-timeout.test
storage/innodb_plugin/mysql-test/innodb-use-sys-malloc-master.opt
storage/innodb_plugin/mysql-test/innodb-use-sys-malloc.result
storage/innodb_plugin/mysql-test/innodb-use-sys-malloc.test
storage/innodb_plugin/mysql-test/innodb-zip.result
storage/innodb_plugin/mysql-test/innodb-zip.test
storage/innodb_plugin/mysql-test/innodb.result
storage/innodb_plugin/mysql-test/innodb.test
storage/innodb_plugin/mysql-test/innodb_bug21704.result
storage/innodb_plugin/mysql-test/innodb_bug21704.test
storage/innodb_plugin/mysql-test/innodb_bug34053.result
storage/innodb_plugin/mysql-test/innodb_bug34053.test
storage/innodb_plugin/mysql-test/innodb_bug34300.result
storage/innodb_plugin/mysql-test/innodb_bug34300.test
storage/innodb_plugin/mysql-test/innodb_bug35220.result
storage/innodb_plugin/mysql-test/innodb_bug35220.test
storage/innodb_plugin/mysql-test/innodb_bug36169.result
storage/innodb_plugin/mysql-test/innodb_bug36169.test
storage/innodb_plugin/mysql-test/innodb_bug36172.result
storage/innodb_plugin/mysql-test/innodb_bug36172.test
storage/innodb_plugin/mysql-test/innodb_bug40360.result
storage/innodb_plugin/mysql-test/innodb_bug40360.test
storage/innodb_plugin/mysql-test/innodb_bug40565.result
storage/innodb_plugin/mysql-test/innodb_bug40565.test
storage/innodb_plugin/mysql-test/innodb_bug41904.result
storage/innodb_plugin/mysql-test/innodb_bug41904.test
storage/innodb_plugin/mysql-test/innodb_bug42101-nonzero-master.opt
storage/innodb_plugin/mysql-test/innodb_bug42101-nonzero.result
storage/innodb_plugin/mysql-test/innodb_bug42101-nonzero.test
storage/innodb_plugin/mysql-test/innodb_bug42101.result
storage/innodb_plugin/mysql-test/innodb_bug42101.test
storage/innodb_plugin/mysql-test/innodb_bug44032.result
storage/innodb_plugin/mysql-test/innodb_bug44032.test
storage/innodb_plugin/mysql-test/innodb_bug44369.result
storage/innodb_plugin/mysql-test/innodb_bug44369.test
storage/innodb_plugin/mysql-test/innodb_bug44571.result
storage/innodb_plugin/mysql-test/innodb_bug44571.test
storage/innodb_plugin/mysql-test/innodb_bug45357.result
storage/innodb_plugin/mysql-test/innodb_bug45357.test
storage/innodb_plugin/mysql-test/innodb_bug46000.result
storage/innodb_plugin/mysql-test/innodb_bug46000.test
storage/innodb_plugin/mysql-test/innodb_file_format.result
storage/innodb_plugin/mysql-test/innodb_file_format.test
storage/innodb_plugin/mysql-test/innodb_information_schema.result
storage/innodb_plugin/mysql-test/innodb_information_schema.test
storage/innodb_plugin/mysql-test/innodb_trx_weight.inc
storage/innodb_plugin/mysql-test/innodb_trx_weight.result
storage/innodb_plugin/mysql-test/innodb_trx_weight.test
storage/innodb_plugin/mysql-test/patches/
storage/innodb_plugin/mysql-test/patches/README
storage/innodb_plugin/mysql-test/patches/index_merge_innodb-explain.diff
storage/innodb_plugin/mysql-test/patches/information_schema.diff
storage/innodb_plugin/mysql-test/patches/innodb-index.diff
storage/innodb_plugin/mysql-test/patches/innodb_file_per_table.diff
storage/innodb_plugin/mysql-test/patches/innodb_lock_wait_timeout.diff
storage/innodb_plugin/mysql-test/patches/innodb_thread_concurrency_basic.diff
storage/innodb_plugin/mysql-test/patches/partition_innodb.diff
storage/innodb_plugin/os/
storage/innodb_plugin/os/os0file.c
storage/innodb_plugin/os/os0proc.c
storage/innodb_plugin/os/os0sync.c
storage/innodb_plugin/os/os0thread.c
storage/innodb_plugin/page/
storage/innodb_plugin/page/page0cur.c
storage/innodb_plugin/page/page0page.c
storage/innodb_plugin/page/page0zip.c
storage/innodb_plugin/pars/
storage/innodb_plugin/pars/lexyy.c
storage/innodb_plugin/pars/make_bison.sh
storage/innodb_plugin/pars/make_flex.sh
storage/innodb_plugin/pars/pars0grm.c
storage/innodb_plugin/pars/pars0grm.y
storage/innodb_plugin/pars/pars0lex.l
storage/innodb_plugin/pars/pars0opt.c
storage/innodb_plugin/pars/pars0pars.c
storage/innodb_plugin/pars/pars0sym.c
storage/innodb_plugin/plug.in.disabled
storage/innodb_plugin/que/
storage/innodb_plugin/que/que0que.c
storage/innodb_plugin/read/
storage/innodb_plugin/read/read0read.c
storage/innodb_plugin/rem/
storage/innodb_plugin/rem/rem0cmp.c
storage/innodb_plugin/rem/rem0rec.c
storage/innodb_plugin/revert_gen.sh
storage/innodb_plugin/row/
storage/innodb_plugin/row/row0ext.c
storage/innodb_plugin/row/row0ins.c
storage/innodb_plugin/row/row0merge.c
storage/innodb_plugin/row/row0mysql.c
storage/innodb_plugin/row/row0purge.c
storage/innodb_plugin/row/row0row.c
storage/innodb_plugin/row/row0sel.c
storage/innodb_plugin/row/row0uins.c
storage/innodb_plugin/row/row0umod.c
storage/innodb_plugin/row/row0undo.c
storage/innodb_plugin/row/row0upd.c
storage/innodb_plugin/row/row0vers.c
storage/innodb_plugin/scripts/
storage/innodb_plugin/scripts/export.sh
storage/innodb_plugin/scripts/install_innodb_plugins.sql
storage/innodb_plugin/scripts/install_innodb_plugins_win.sql
storage/innodb_plugin/setup.sh
storage/innodb_plugin/srv/
storage/innodb_plugin/srv/srv0que.c
storage/innodb_plugin/srv/srv0srv.c
storage/innodb_plugin/srv/srv0start.c
storage/innodb_plugin/sync/
storage/innodb_plugin/sync/sync0arr.c
storage/innodb_plugin/sync/sync0rw.c
storage/innodb_plugin/sync/sync0sync.c
storage/innodb_plugin/thr/
storage/innodb_plugin/thr/thr0loc.c
storage/innodb_plugin/trx/
storage/innodb_plugin/trx/trx0i_s.c
storage/innodb_plugin/trx/trx0purge.c
storage/innodb_plugin/trx/trx0rec.c
storage/innodb_plugin/trx/trx0roll.c
storage/innodb_plugin/trx/trx0rseg.c
storage/innodb_plugin/trx/trx0sys.c
storage/innodb_plugin/trx/trx0trx.c
storage/innodb_plugin/trx/trx0undo.c
storage/innodb_plugin/usr/
storage/innodb_plugin/usr/usr0sess.c
storage/innodb_plugin/ut/
storage/innodb_plugin/ut/ut0auxconf_atomic_pthread_t_gcc.c
storage/innodb_plugin/ut/ut0auxconf_atomic_pthread_t_solaris.c
storage/innodb_plugin/ut/ut0auxconf_have_gcc_atomics.c
storage/innodb_plugin/ut/ut0auxconf_have_solaris_atomics.c
storage/innodb_plugin/ut/ut0auxconf_pause.c
storage/innodb_plugin/ut/ut0auxconf_sizeof_pthread_t.c
storage/innodb_plugin/ut/ut0byte.c
storage/innodb_plugin/ut/ut0dbg.c
storage/innodb_plugin/ut/ut0list.c
storage/innodb_plugin/ut/ut0mem.c
storage/innodb_plugin/ut/ut0rnd.c
storage/innodb_plugin/ut/ut0ut.c
storage/innodb_plugin/ut/ut0vec.c
storage/innodb_plugin/ut/ut0wqueue.c
added:
include/my_valgrind.h
mysql-test/include/ctype_innodb_like.inc
mysql-test/include/have_innodb.inc
mysql-test/include/have_innodb_plugin.inc
mysql-test/include/innodb_trx_weight.inc
mysql-test/include/min_null_cond.inc
mysql-test/include/not_binlog_format_row.inc
mysql-test/include/view_alias.inc
mysql-test/r/bug39022.result
mysql-test/r/bug46261.result
mysql-test/r/log_tables_upgrade.result
mysql-test/r/no_binlog.result
mysql-test/r/partition_debug_sync.result
mysql-test/r/plugin_not_embedded.result
mysql-test/r/view_alias.result
mysql-test/std_data/binlog_savepoint.000001
mysql-test/std_data/bug46565.ARZ
mysql-test/std_data/bug46565.frm
mysql-test/std_data/bug48265.frm
mysql-test/std_data/bug48449.frm
mysql-test/std_data/bug49823.CSM
mysql-test/std_data/bug49823.CSV
mysql-test/std_data/bug49823.frm
mysql-test/suite/engines/
mysql-test/suite/engines/README
mysql-test/suite/engines/funcs/
mysql-test/suite/engines/funcs/r/
mysql-test/suite/engines/funcs/r/ai_init_alter_table.result
mysql-test/suite/engines/funcs/r/ai_init_create_table.result
mysql-test/suite/engines/funcs/r/ai_init_insert.result
mysql-test/suite/engines/funcs/r/ai_init_insert_id.result
mysql-test/suite/engines/funcs/r/ai_overflow_error.result
mysql-test/suite/engines/funcs/r/ai_reset_by_truncate.result
mysql-test/suite/engines/funcs/r/ai_sql_auto_is_null.result
mysql-test/suite/engines/funcs/r/an_calendar.result
mysql-test/suite/engines/funcs/r/an_number.result
mysql-test/suite/engines/funcs/r/an_string.result
mysql-test/suite/engines/funcs/r/comment_column.result
mysql-test/suite/engines/funcs/r/comment_column2.result
mysql-test/suite/engines/funcs/r/comment_table.result
mysql-test/suite/engines/funcs/r/crash_manycolumns_number.result
mysql-test/suite/engines/funcs/r/crash_manycolumns_string.result
mysql-test/suite/engines/funcs/r/crash_manyindexes_number.result
mysql-test/suite/engines/funcs/r/crash_manyindexes_string.result
mysql-test/suite/engines/funcs/r/crash_manytables_number.result
mysql-test/suite/engines/funcs/r/crash_manytables_string.result
mysql-test/suite/engines/funcs/r/date_function.result
mysql-test/suite/engines/funcs/r/datetime_function.result
mysql-test/suite/engines/funcs/r/db_alter_character_set.result
mysql-test/suite/engines/funcs/r/db_alter_character_set_collate.result
mysql-test/suite/engines/funcs/r/db_alter_collate_ascii.result
mysql-test/suite/engines/funcs/r/db_alter_collate_utf8.result
mysql-test/suite/engines/funcs/r/db_create_character_set.result
mysql-test/suite/engines/funcs/r/db_create_character_set_collate.result
mysql-test/suite/engines/funcs/r/db_create_drop.result
mysql-test/suite/engines/funcs/r/db_create_error.result
mysql-test/suite/engines/funcs/r/db_create_error_reserved.result
mysql-test/suite/engines/funcs/r/db_create_if_not_exists.result
mysql-test/suite/engines/funcs/r/db_drop_error.result
mysql-test/suite/engines/funcs/r/db_use_error.result
mysql-test/suite/engines/funcs/r/de_autoinc.result
mysql-test/suite/engines/funcs/r/de_calendar_range.result
mysql-test/suite/engines/funcs/r/de_ignore.result
mysql-test/suite/engines/funcs/r/de_limit.result
mysql-test/suite/engines/funcs/r/de_multi_db_table.result
mysql-test/suite/engines/funcs/r/de_multi_db_table_using.result
mysql-test/suite/engines/funcs/r/de_multi_table.result
mysql-test/suite/engines/funcs/r/de_multi_table_using.result
mysql-test/suite/engines/funcs/r/de_number_range.result
mysql-test/suite/engines/funcs/r/de_quick.result
mysql-test/suite/engines/funcs/r/de_string_range.result
mysql-test/suite/engines/funcs/r/de_truncate.result
mysql-test/suite/engines/funcs/r/de_truncate_autoinc.result
mysql-test/suite/engines/funcs/r/fu_aggregate_avg_number.result
mysql-test/suite/engines/funcs/r/fu_aggregate_count_number.result
mysql-test/suite/engines/funcs/r/fu_aggregate_max_number.result
mysql-test/suite/engines/funcs/r/fu_aggregate_max_subquery.result
mysql-test/suite/engines/funcs/r/fu_aggregate_min_number.result
mysql-test/suite/engines/funcs/r/fu_aggregate_sum_number.result
mysql-test/suite/engines/funcs/r/general_no_data.result
mysql-test/suite/engines/funcs/r/general_not_null.result
mysql-test/suite/engines/funcs/r/general_null.result
mysql-test/suite/engines/funcs/r/in_calendar_2_unique_constraints_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_calendar_pk_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_calendar_pk_constraint_error.result
mysql-test/suite/engines/funcs/r/in_calendar_pk_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_calendar_unique_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_calendar_unique_constraint_error.result
mysql-test/suite/engines/funcs/r/in_calendar_unique_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_enum_null.result
mysql-test/suite/engines/funcs/r/in_enum_null_boundary_error.result
mysql-test/suite/engines/funcs/r/in_enum_null_large_error.result
mysql-test/suite/engines/funcs/r/in_insert_select.result
mysql-test/suite/engines/funcs/r/in_insert_select_autoinc.result
mysql-test/suite/engines/funcs/r/in_insert_select_unique_violation.result
mysql-test/suite/engines/funcs/r/in_lob_boundary_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_calendar_pk_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_multicolumn_calendar_pk_constraint_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_calendar_pk_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_multicolumn_calendar_unique_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_multicolumn_calendar_unique_constraint_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_calendar_unique_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_multicolumn_number_pk_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_multicolumn_number_pk_constraint_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_number_pk_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_multicolumn_number_unique_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_multicolumn_number_unique_constraint_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_number_unique_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_multicolumn_string_pk_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_multicolumn_string_pk_constraint_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_string_pk_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_multicolumn_string_unique_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_multicolumn_string_unique_constraint_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_string_unique_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_number_2_unique_constraints_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_number_boundary_error.result
mysql-test/suite/engines/funcs/r/in_number_decimal_boundary_error.result
mysql-test/suite/engines/funcs/r/in_number_length.result
mysql-test/suite/engines/funcs/r/in_number_null.result
mysql-test/suite/engines/funcs/r/in_number_pk_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_number_pk_constraint_error.result
mysql-test/suite/engines/funcs/r/in_number_pk_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_number_unique_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_number_unique_constraint_error.result
mysql-test/suite/engines/funcs/r/in_number_unique_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_set_null.result
mysql-test/suite/engines/funcs/r/in_set_null_boundary_error.result
mysql-test/suite/engines/funcs/r/in_set_null_large.result
mysql-test/suite/engines/funcs/r/in_string_2_unique_constraints_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_string_boundary_error.result
mysql-test/suite/engines/funcs/r/in_string_not_null.result
mysql-test/suite/engines/funcs/r/in_string_null.result
mysql-test/suite/engines/funcs/r/in_string_pk_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_string_pk_constraint_error.result
mysql-test/suite/engines/funcs/r/in_string_pk_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_string_set_enum_fail.result
mysql-test/suite/engines/funcs/r/in_string_unique_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_string_unique_constraint_error.result
mysql-test/suite/engines/funcs/r/in_string_unique_constraint_ignore.result
mysql-test/suite/engines/funcs/r/ix_drop.result
mysql-test/suite/engines/funcs/r/ix_drop_error.result
mysql-test/suite/engines/funcs/r/ix_index_decimals.result
mysql-test/suite/engines/funcs/r/ix_index_lob.result
mysql-test/suite/engines/funcs/r/ix_index_non_string.result
mysql-test/suite/engines/funcs/r/ix_index_string.result
mysql-test/suite/engines/funcs/r/ix_index_string_length.result
mysql-test/suite/engines/funcs/r/ix_unique_decimals.result
mysql-test/suite/engines/funcs/r/ix_unique_lob.result
mysql-test/suite/engines/funcs/r/ix_unique_non_string.result
mysql-test/suite/engines/funcs/r/ix_unique_string.result
mysql-test/suite/engines/funcs/r/ix_unique_string_length.result
mysql-test/suite/engines/funcs/r/ix_using_order.result
mysql-test/suite/engines/funcs/r/jp_comment_column.result
mysql-test/suite/engines/funcs/r/jp_comment_older_compatibility1.result
mysql-test/suite/engines/funcs/r/jp_comment_table.result
mysql-test/suite/engines/funcs/r/ld_all_number_string_calendar_types.result
mysql-test/suite/engines/funcs/r/ld_bit.result
mysql-test/suite/engines/funcs/r/ld_enum_set.result
mysql-test/suite/engines/funcs/r/ld_less_columns.result
mysql-test/suite/engines/funcs/r/ld_more_columns_truncated.result
mysql-test/suite/engines/funcs/r/ld_null.result
mysql-test/suite/engines/funcs/r/ld_quote.result
mysql-test/suite/engines/funcs/r/ld_simple.result
mysql-test/suite/engines/funcs/r/ld_starting.result
mysql-test/suite/engines/funcs/r/ld_unique_error1.result
mysql-test/suite/engines/funcs/r/ld_unique_error1_local.result
mysql-test/suite/engines/funcs/r/ld_unique_error2.result
mysql-test/suite/engines/funcs/r/ld_unique_error2_local.result
mysql-test/suite/engines/funcs/r/ld_unique_error3.result
mysql-test/suite/engines/funcs/r/ld_unique_error3_local.result
mysql-test/suite/engines/funcs/r/ps_number_length.result
mysql-test/suite/engines/funcs/r/ps_number_null.result
mysql-test/suite/engines/funcs/r/ps_string_not_null.result
mysql-test/suite/engines/funcs/r/ps_string_null.result
mysql-test/suite/engines/funcs/r/re_number_range.result
mysql-test/suite/engines/funcs/r/re_number_range_set.result
mysql-test/suite/engines/funcs/r/re_number_select.result
mysql-test/suite/engines/funcs/r/re_string_range.result
mysql-test/suite/engines/funcs/r/re_string_range_set.result
mysql-test/suite/engines/funcs/r/rpl000010.result
mysql-test/suite/engines/funcs/r/rpl000011.result
mysql-test/suite/engines/funcs/r/rpl000013.result
mysql-test/suite/engines/funcs/r/rpl000017.result
mysql-test/suite/engines/funcs/r/rpl_000015.result
mysql-test/suite/engines/funcs/r/rpl_LD_INFILE.result
mysql-test/suite/engines/funcs/r/rpl_REDIRECT.result
mysql-test/suite/engines/funcs/r/rpl_alter.result
mysql-test/suite/engines/funcs/r/rpl_alter_db.result
mysql-test/suite/engines/funcs/r/rpl_bit.result
mysql-test/suite/engines/funcs/r/rpl_bit_npk.result
mysql-test/suite/engines/funcs/r/rpl_change_master.result
mysql-test/suite/engines/funcs/r/rpl_create_database.result
mysql-test/suite/engines/funcs/r/rpl_do_grant.result
mysql-test/suite/engines/funcs/r/rpl_drop.result
mysql-test/suite/engines/funcs/r/rpl_drop_db.result
mysql-test/suite/engines/funcs/r/rpl_dual_pos_advance.result
mysql-test/suite/engines/funcs/r/rpl_empty_master_crash.result
mysql-test/suite/engines/funcs/r/rpl_err_ignoredtable.result
mysql-test/suite/engines/funcs/r/rpl_flushlog_loop.result
mysql-test/suite/engines/funcs/r/rpl_free_items.result
mysql-test/suite/engines/funcs/r/rpl_get_lock.result
mysql-test/suite/engines/funcs/r/rpl_ignore_grant.result
mysql-test/suite/engines/funcs/r/rpl_ignore_revoke.result
mysql-test/suite/engines/funcs/r/rpl_ignore_table_update.result
mysql-test/suite/engines/funcs/r/rpl_init_slave.result
mysql-test/suite/engines/funcs/r/rpl_insert.result
mysql-test/suite/engines/funcs/r/rpl_insert_select.result
mysql-test/suite/engines/funcs/r/rpl_loaddata2.result
mysql-test/suite/engines/funcs/r/rpl_loaddata_m.result
mysql-test/suite/engines/funcs/r/rpl_loaddata_s.result
mysql-test/suite/engines/funcs/r/rpl_loaddatalocal.result
mysql-test/suite/engines/funcs/r/rpl_loadfile.result
mysql-test/suite/engines/funcs/r/rpl_log_pos.result
mysql-test/suite/engines/funcs/r/rpl_many_optimize.result
mysql-test/suite/engines/funcs/r/rpl_master_pos_wait.result
mysql-test/suite/engines/funcs/r/rpl_misc_functions.result
mysql-test/suite/engines/funcs/r/rpl_multi_delete.result
mysql-test/suite/engines/funcs/r/rpl_multi_delete2.result
mysql-test/suite/engines/funcs/r/rpl_multi_update4.result
mysql-test/suite/engines/funcs/r/rpl_ps.result
mysql-test/suite/engines/funcs/r/rpl_rbr_to_sbr.result
mysql-test/suite/engines/funcs/r/rpl_relayspace.result
mysql-test/suite/engines/funcs/r/rpl_replicate_ignore_db.result
mysql-test/suite/engines/funcs/r/rpl_row_NOW.result
mysql-test/suite/engines/funcs/r/rpl_row_USER.result
mysql-test/suite/engines/funcs/r/rpl_row_drop.result
mysql-test/suite/engines/funcs/r/rpl_row_func001.result
mysql-test/suite/engines/funcs/r/rpl_row_inexist_tbl.result
mysql-test/suite/engines/funcs/r/rpl_row_max_relay_size.result
mysql-test/suite/engines/funcs/r/rpl_row_reset_slave.result
mysql-test/suite/engines/funcs/r/rpl_row_sp001.result
mysql-test/suite/engines/funcs/r/rpl_row_sp005.result
mysql-test/suite/engines/funcs/r/rpl_row_sp008.result
mysql-test/suite/engines/funcs/r/rpl_row_sp009.result
mysql-test/suite/engines/funcs/r/rpl_row_sp010.result
mysql-test/suite/engines/funcs/r/rpl_row_sp011.result
mysql-test/suite/engines/funcs/r/rpl_row_sp012.result
mysql-test/suite/engines/funcs/r/rpl_row_stop_middle.result
mysql-test/suite/engines/funcs/r/rpl_row_trig001.result
mysql-test/suite/engines/funcs/r/rpl_row_trig002.result
mysql-test/suite/engines/funcs/r/rpl_row_trig003.result
mysql-test/suite/engines/funcs/r/rpl_row_until.result
mysql-test/suite/engines/funcs/r/rpl_row_view01.result
mysql-test/suite/engines/funcs/r/rpl_server_id1.result
mysql-test/suite/engines/funcs/r/rpl_server_id2.result
mysql-test/suite/engines/funcs/r/rpl_session_var.result
mysql-test/suite/engines/funcs/r/rpl_sf.result
mysql-test/suite/engines/funcs/r/rpl_skip_error.result
mysql-test/suite/engines/funcs/r/rpl_slave_status.result
mysql-test/suite/engines/funcs/r/rpl_sp.result
mysql-test/suite/engines/funcs/r/rpl_sp004.result
mysql-test/suite/engines/funcs/r/rpl_sp_effects.result
mysql-test/suite/engines/funcs/r/rpl_start_stop_slave.result
mysql-test/suite/engines/funcs/r/rpl_stm_max_relay_size.result
mysql-test/suite/engines/funcs/r/rpl_stm_mystery22.result
mysql-test/suite/engines/funcs/r/rpl_stm_no_op.result
mysql-test/suite/engines/funcs/r/rpl_stm_reset_slave.result
mysql-test/suite/engines/funcs/r/rpl_switch_stm_row_mixed.result
mysql-test/suite/engines/funcs/r/rpl_temp_table.result
mysql-test/suite/engines/funcs/r/rpl_temporary.result
mysql-test/suite/engines/funcs/r/rpl_trigger.result
mysql-test/suite/engines/funcs/r/rpl_trunc_temp.result
mysql-test/suite/engines/funcs/r/rpl_user_variables.result
mysql-test/suite/engines/funcs/r/rpl_variables.result
mysql-test/suite/engines/funcs/r/rpl_view.result
mysql-test/suite/engines/funcs/r/se_join_cross.result
mysql-test/suite/engines/funcs/r/se_join_default.result
mysql-test/suite/engines/funcs/r/se_join_inner.result
mysql-test/suite/engines/funcs/r/se_join_left.result
mysql-test/suite/engines/funcs/r/se_join_left_outer.result
mysql-test/suite/engines/funcs/r/se_join_natural_left.result
mysql-test/suite/engines/funcs/r/se_join_natural_left_outer.result
mysql-test/suite/engines/funcs/r/se_join_natural_right.result
mysql-test/suite/engines/funcs/r/se_join_natural_right_outer.result
mysql-test/suite/engines/funcs/r/se_join_right.result
mysql-test/suite/engines/funcs/r/se_join_right_outer.result
mysql-test/suite/engines/funcs/r/se_join_straight.result
mysql-test/suite/engines/funcs/r/se_rowid.result
mysql-test/suite/engines/funcs/r/se_string_distinct.result
mysql-test/suite/engines/funcs/r/se_string_from.result
mysql-test/suite/engines/funcs/r/se_string_groupby.result
mysql-test/suite/engines/funcs/r/se_string_having.result
mysql-test/suite/engines/funcs/r/se_string_limit.result
mysql-test/suite/engines/funcs/r/se_string_orderby.result
mysql-test/suite/engines/funcs/r/se_string_union.result
mysql-test/suite/engines/funcs/r/se_string_where.result
mysql-test/suite/engines/funcs/r/se_string_where_and.result
mysql-test/suite/engines/funcs/r/se_string_where_or.result
mysql-test/suite/engines/funcs/r/sf_alter.result
mysql-test/suite/engines/funcs/r/sf_cursor.result
mysql-test/suite/engines/funcs/r/sf_simple1.result
mysql-test/suite/engines/funcs/r/sp_alter.result
mysql-test/suite/engines/funcs/r/sp_cursor.result
mysql-test/suite/engines/funcs/r/sp_simple1.result
mysql-test/suite/engines/funcs/r/sq_all.result
mysql-test/suite/engines/funcs/r/sq_any.result
mysql-test/suite/engines/funcs/r/sq_corr.result
mysql-test/suite/engines/funcs/r/sq_error.result
mysql-test/suite/engines/funcs/r/sq_exists.result
mysql-test/suite/engines/funcs/r/sq_from.result
mysql-test/suite/engines/funcs/r/sq_in.result
mysql-test/suite/engines/funcs/r/sq_row.result
mysql-test/suite/engines/funcs/r/sq_scalar.result
mysql-test/suite/engines/funcs/r/sq_some.result
mysql-test/suite/engines/funcs/r/ta_2part_column_to_pk.result
mysql-test/suite/engines/funcs/r/ta_2part_diff_string_to_pk.result
mysql-test/suite/engines/funcs/r/ta_2part_diff_to_pk.result
mysql-test/suite/engines/funcs/r/ta_2part_string_to_pk.result
mysql-test/suite/engines/funcs/r/ta_3part_column_to_pk.result
mysql-test/suite/engines/funcs/r/ta_3part_string_to_pk.result
mysql-test/suite/engines/funcs/r/ta_add_column.result
mysql-test/suite/engines/funcs/r/ta_add_column2.result
mysql-test/suite/engines/funcs/r/ta_add_column_first.result
mysql-test/suite/engines/funcs/r/ta_add_column_first2.result
mysql-test/suite/engines/funcs/r/ta_add_column_middle.result
mysql-test/suite/engines/funcs/r/ta_add_column_middle2.result
mysql-test/suite/engines/funcs/r/ta_add_string.result
mysql-test/suite/engines/funcs/r/ta_add_string2.result
mysql-test/suite/engines/funcs/r/ta_add_string_first.result
mysql-test/suite/engines/funcs/r/ta_add_string_first2.result
mysql-test/suite/engines/funcs/r/ta_add_string_middle.result
mysql-test/suite/engines/funcs/r/ta_add_string_middle2.result
mysql-test/suite/engines/funcs/r/ta_add_string_unique_index.result
mysql-test/suite/engines/funcs/r/ta_add_unique_index.result
mysql-test/suite/engines/funcs/r/ta_column_from_unsigned.result
mysql-test/suite/engines/funcs/r/ta_column_from_zerofill.result
mysql-test/suite/engines/funcs/r/ta_column_to_index.result
mysql-test/suite/engines/funcs/r/ta_column_to_not_null.result
mysql-test/suite/engines/funcs/r/ta_column_to_null.result
mysql-test/suite/engines/funcs/r/ta_column_to_pk.result
mysql-test/suite/engines/funcs/r/ta_column_to_unsigned.result
mysql-test/suite/engines/funcs/r/ta_column_to_zerofill.result
mysql-test/suite/engines/funcs/r/ta_drop_column.result
mysql-test/suite/engines/funcs/r/ta_drop_index.result
mysql-test/suite/engines/funcs/r/ta_drop_pk_autoincrement.result
mysql-test/suite/engines/funcs/r/ta_drop_pk_number.result
mysql-test/suite/engines/funcs/r/ta_drop_pk_string.result
mysql-test/suite/engines/funcs/r/ta_drop_string_index.result
mysql-test/suite/engines/funcs/r/ta_orderby.result
mysql-test/suite/engines/funcs/r/ta_rename.result
mysql-test/suite/engines/funcs/r/ta_set_drop_default.result
mysql-test/suite/engines/funcs/r/ta_string_drop_column.result
mysql-test/suite/engines/funcs/r/ta_string_to_index.result
mysql-test/suite/engines/funcs/r/ta_string_to_not_null.result
mysql-test/suite/engines/funcs/r/ta_string_to_null.result
mysql-test/suite/engines/funcs/r/ta_string_to_pk.result
mysql-test/suite/engines/funcs/r/tc_column_autoincrement.result
mysql-test/suite/engines/funcs/r/tc_column_comment.result
mysql-test/suite/engines/funcs/r/tc_column_comment_string.result
mysql-test/suite/engines/funcs/r/tc_column_default_decimal.result
mysql-test/suite/engines/funcs/r/tc_column_default_number.result
mysql-test/suite/engines/funcs/r/tc_column_default_string.result
mysql-test/suite/engines/funcs/r/tc_column_enum.result
mysql-test/suite/engines/funcs/r/tc_column_enum_long.result
mysql-test/suite/engines/funcs/r/tc_column_key.result
mysql-test/suite/engines/funcs/r/tc_column_key_length.result
mysql-test/suite/engines/funcs/r/tc_column_length.result
mysql-test/suite/engines/funcs/r/tc_column_length_decimals.result
mysql-test/suite/engines/funcs/r/tc_column_length_zero.result
mysql-test/suite/engines/funcs/r/tc_column_not_null.result
mysql-test/suite/engines/funcs/r/tc_column_null.result
mysql-test/suite/engines/funcs/r/tc_column_primary_key_number.result
mysql-test/suite/engines/funcs/r/tc_column_primary_key_string.result
mysql-test/suite/engines/funcs/r/tc_column_serial.result
mysql-test/suite/engines/funcs/r/tc_column_set.result
mysql-test/suite/engines/funcs/r/tc_column_set_long.result
mysql-test/suite/engines/funcs/r/tc_column_unique_key.result
mysql-test/suite/engines/funcs/r/tc_column_unique_key_string.result
mysql-test/suite/engines/funcs/r/tc_column_unsigned.result
mysql-test/suite/engines/funcs/r/tc_column_zerofill.result
mysql-test/suite/engines/funcs/r/tc_drop_table.result
mysql-test/suite/engines/funcs/r/tc_multicolumn_different.result
mysql-test/suite/engines/funcs/r/tc_multicolumn_same.result
mysql-test/suite/engines/funcs/r/tc_multicolumn_same_string.result
mysql-test/suite/engines/funcs/r/tc_partition_analyze.result
mysql-test/suite/engines/funcs/r/tc_partition_change_from_range_to_hash_key.result
mysql-test/suite/engines/funcs/r/tc_partition_check.result
mysql-test/suite/engines/funcs/r/tc_partition_hash.result
mysql-test/suite/engines/funcs/r/tc_partition_hash_date_function.result
mysql-test/suite/engines/funcs/r/tc_partition_key.result
mysql-test/suite/engines/funcs/r/tc_partition_linear_key.result
mysql-test/suite/engines/funcs/r/tc_partition_list_directory.result
mysql-test/suite/engines/funcs/r/tc_partition_list_error.result
mysql-test/suite/engines/funcs/r/tc_partition_optimize.result
mysql-test/suite/engines/funcs/r/tc_partition_rebuild.result
mysql-test/suite/engines/funcs/r/tc_partition_remove.result
mysql-test/suite/engines/funcs/r/tc_partition_reorg_divide.result
mysql-test/suite/engines/funcs/r/tc_partition_reorg_hash_key.result
mysql-test/suite/engines/funcs/r/tc_partition_reorg_merge.result
mysql-test/suite/engines/funcs/r/tc_partition_repair.result
mysql-test/suite/engines/funcs/r/tc_partition_sub1.result
mysql-test/suite/engines/funcs/r/tc_partition_sub2.result
mysql-test/suite/engines/funcs/r/tc_partition_value.result
mysql-test/suite/engines/funcs/r/tc_partition_value_error.result
mysql-test/suite/engines/funcs/r/tc_partition_value_specific.result
mysql-test/suite/engines/funcs/r/tc_rename.result
mysql-test/suite/engines/funcs/r/tc_rename_across_database.result
mysql-test/suite/engines/funcs/r/tc_rename_error.result
mysql-test/suite/engines/funcs/r/tc_structure_comment.result
mysql-test/suite/engines/funcs/r/tc_structure_create_like.result
mysql-test/suite/engines/funcs/r/tc_structure_create_like_string.result
mysql-test/suite/engines/funcs/r/tc_structure_create_select.result
mysql-test/suite/engines/funcs/r/tc_structure_create_select_string.result
mysql-test/suite/engines/funcs/r/tc_structure_string_comment.result
mysql-test/suite/engines/funcs/r/tc_temporary_column.result
mysql-test/suite/engines/funcs/r/tc_temporary_column_length.result
mysql-test/suite/engines/funcs/r/time_function.result
mysql-test/suite/engines/funcs/r/tr_all_type_triggers.result
mysql-test/suite/engines/funcs/r/tr_delete.result
mysql-test/suite/engines/funcs/r/tr_delete_new_error.result
mysql-test/suite/engines/funcs/r/tr_insert.result
mysql-test/suite/engines/funcs/r/tr_insert_after_error.result
mysql-test/suite/engines/funcs/r/tr_insert_old_error.result
mysql-test/suite/engines/funcs/r/tr_update.result
mysql-test/suite/engines/funcs/r/tr_update_after_error.result
mysql-test/suite/engines/funcs/r/up_calendar_range.result
mysql-test/suite/engines/funcs/r/up_ignore.result
mysql-test/suite/engines/funcs/r/up_limit.result
mysql-test/suite/engines/funcs/r/up_multi_db_table.result
mysql-test/suite/engines/funcs/r/up_multi_table.result
mysql-test/suite/engines/funcs/r/up_nullcheck.result
mysql-test/suite/engines/funcs/r/up_number_range.result
mysql-test/suite/engines/funcs/r/up_string_range.result
mysql-test/suite/engines/funcs/t/
mysql-test/suite/engines/funcs/t/ai_init_alter_table.test
mysql-test/suite/engines/funcs/t/ai_init_create_table.test
mysql-test/suite/engines/funcs/t/ai_init_insert.test
mysql-test/suite/engines/funcs/t/ai_init_insert_id.test
mysql-test/suite/engines/funcs/t/ai_overflow_error.test
mysql-test/suite/engines/funcs/t/ai_reset_by_truncate.test
mysql-test/suite/engines/funcs/t/ai_sql_auto_is_null.test
mysql-test/suite/engines/funcs/t/an_calendar.test
mysql-test/suite/engines/funcs/t/an_number.test
mysql-test/suite/engines/funcs/t/an_string.test
mysql-test/suite/engines/funcs/t/comment_column.test
mysql-test/suite/engines/funcs/t/comment_column2.test
mysql-test/suite/engines/funcs/t/comment_table.test
mysql-test/suite/engines/funcs/t/crash_manycolumns_number.test
mysql-test/suite/engines/funcs/t/crash_manycolumns_string.test
mysql-test/suite/engines/funcs/t/crash_manyindexes_number.test
mysql-test/suite/engines/funcs/t/crash_manyindexes_string.test
mysql-test/suite/engines/funcs/t/crash_manytables_number.test
mysql-test/suite/engines/funcs/t/crash_manytables_string.test
mysql-test/suite/engines/funcs/t/data1.inc
mysql-test/suite/engines/funcs/t/data2.inc
mysql-test/suite/engines/funcs/t/date_function.test
mysql-test/suite/engines/funcs/t/datetime_function.test
mysql-test/suite/engines/funcs/t/db_alter_character_set.test
mysql-test/suite/engines/funcs/t/db_alter_character_set_collate.test
mysql-test/suite/engines/funcs/t/db_alter_collate_ascii.test
mysql-test/suite/engines/funcs/t/db_alter_collate_utf8.test
mysql-test/suite/engines/funcs/t/db_create_character_set.test
mysql-test/suite/engines/funcs/t/db_create_character_set_collate.test
mysql-test/suite/engines/funcs/t/db_create_drop.test
mysql-test/suite/engines/funcs/t/db_create_error.test
mysql-test/suite/engines/funcs/t/db_create_error_reserved.test
mysql-test/suite/engines/funcs/t/db_create_if_not_exists.test
mysql-test/suite/engines/funcs/t/db_drop_error.test
mysql-test/suite/engines/funcs/t/db_use_error.test
mysql-test/suite/engines/funcs/t/de_autoinc.test
mysql-test/suite/engines/funcs/t/de_calendar_range.test
mysql-test/suite/engines/funcs/t/de_ignore.test
mysql-test/suite/engines/funcs/t/de_limit.test
mysql-test/suite/engines/funcs/t/de_multi_db_table.test
mysql-test/suite/engines/funcs/t/de_multi_db_table_using.test
mysql-test/suite/engines/funcs/t/de_multi_table.test
mysql-test/suite/engines/funcs/t/de_multi_table_using.test
mysql-test/suite/engines/funcs/t/de_number_range.test
mysql-test/suite/engines/funcs/t/de_quick.test
mysql-test/suite/engines/funcs/t/de_string_range.test
mysql-test/suite/engines/funcs/t/de_truncate.test
mysql-test/suite/engines/funcs/t/de_truncate_autoinc.test
mysql-test/suite/engines/funcs/t/disabled.def
mysql-test/suite/engines/funcs/t/fu_aggregate_avg_number.test
mysql-test/suite/engines/funcs/t/fu_aggregate_count_number.test
mysql-test/suite/engines/funcs/t/fu_aggregate_max_number.test
mysql-test/suite/engines/funcs/t/fu_aggregate_max_subquery.test
mysql-test/suite/engines/funcs/t/fu_aggregate_min_number.test
mysql-test/suite/engines/funcs/t/fu_aggregate_sum_number.test
mysql-test/suite/engines/funcs/t/general_no_data.test
mysql-test/suite/engines/funcs/t/general_not_null.test
mysql-test/suite/engines/funcs/t/general_null.test
mysql-test/suite/engines/funcs/t/in_calendar_2_unique_constraints_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_calendar_pk_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_calendar_pk_constraint_error.test
mysql-test/suite/engines/funcs/t/in_calendar_pk_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_calendar_unique_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_calendar_unique_constraint_error.test
mysql-test/suite/engines/funcs/t/in_calendar_unique_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_enum_null.test
mysql-test/suite/engines/funcs/t/in_enum_null_boundary_error.test
mysql-test/suite/engines/funcs/t/in_enum_null_large_error.test
mysql-test/suite/engines/funcs/t/in_insert_select.test
mysql-test/suite/engines/funcs/t/in_insert_select_autoinc.test
mysql-test/suite/engines/funcs/t/in_insert_select_unique_violation.test
mysql-test/suite/engines/funcs/t/in_lob_boundary_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_calendar_pk_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_multicolumn_calendar_pk_constraint_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_calendar_pk_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_multicolumn_calendar_unique_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_multicolumn_calendar_unique_constraint_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_calendar_unique_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_multicolumn_number_pk_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_multicolumn_number_pk_constraint_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_number_pk_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_multicolumn_number_unique_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_multicolumn_number_unique_constraint_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_number_unique_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_multicolumn_string_pk_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_multicolumn_string_pk_constraint_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_string_pk_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_multicolumn_string_unique_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_multicolumn_string_unique_constraint_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_string_unique_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_number_2_unique_constraints_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_number_boundary_error.test
mysql-test/suite/engines/funcs/t/in_number_decimal_boundary_error.test
mysql-test/suite/engines/funcs/t/in_number_length.test
mysql-test/suite/engines/funcs/t/in_number_null.test
mysql-test/suite/engines/funcs/t/in_number_pk_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_number_pk_constraint_error.test
mysql-test/suite/engines/funcs/t/in_number_pk_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_number_unique_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_number_unique_constraint_error.test
mysql-test/suite/engines/funcs/t/in_number_unique_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_set_null.test
mysql-test/suite/engines/funcs/t/in_set_null_boundary_error.test
mysql-test/suite/engines/funcs/t/in_set_null_large.test
mysql-test/suite/engines/funcs/t/in_string_2_unique_constraints_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_string_boundary_error.test
mysql-test/suite/engines/funcs/t/in_string_not_null.test
mysql-test/suite/engines/funcs/t/in_string_null.test
mysql-test/suite/engines/funcs/t/in_string_pk_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_string_pk_constraint_error.test
mysql-test/suite/engines/funcs/t/in_string_pk_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_string_set_enum_fail.test
mysql-test/suite/engines/funcs/t/in_string_unique_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_string_unique_constraint_error.test
mysql-test/suite/engines/funcs/t/in_string_unique_constraint_ignore.test
mysql-test/suite/engines/funcs/t/ix_drop.test
mysql-test/suite/engines/funcs/t/ix_drop_error.test
mysql-test/suite/engines/funcs/t/ix_index_decimals.test
mysql-test/suite/engines/funcs/t/ix_index_lob.test
mysql-test/suite/engines/funcs/t/ix_index_non_string.test
mysql-test/suite/engines/funcs/t/ix_index_string.test
mysql-test/suite/engines/funcs/t/ix_index_string_length.test
mysql-test/suite/engines/funcs/t/ix_unique_decimals.test
mysql-test/suite/engines/funcs/t/ix_unique_lob.test
mysql-test/suite/engines/funcs/t/ix_unique_non_string.test
mysql-test/suite/engines/funcs/t/ix_unique_string.test
mysql-test/suite/engines/funcs/t/ix_unique_string_length.test
mysql-test/suite/engines/funcs/t/ix_using_order.test
mysql-test/suite/engines/funcs/t/jp_comment_column.test
mysql-test/suite/engines/funcs/t/jp_comment_older_compatibility1.test
mysql-test/suite/engines/funcs/t/jp_comment_table.test
mysql-test/suite/engines/funcs/t/ld_all_number_string_calendar_types.test
mysql-test/suite/engines/funcs/t/ld_bit.test
mysql-test/suite/engines/funcs/t/ld_enum_set.test
mysql-test/suite/engines/funcs/t/ld_less_columns.test
mysql-test/suite/engines/funcs/t/ld_more_columns_truncated.test
mysql-test/suite/engines/funcs/t/ld_null.test
mysql-test/suite/engines/funcs/t/ld_quote.test
mysql-test/suite/engines/funcs/t/ld_simple.test
mysql-test/suite/engines/funcs/t/ld_starting.test
mysql-test/suite/engines/funcs/t/ld_unique_error1.test
mysql-test/suite/engines/funcs/t/ld_unique_error1_local.test
mysql-test/suite/engines/funcs/t/ld_unique_error2.test
mysql-test/suite/engines/funcs/t/ld_unique_error2_local.test
mysql-test/suite/engines/funcs/t/ld_unique_error3.test
mysql-test/suite/engines/funcs/t/ld_unique_error3_local.test
mysql-test/suite/engines/funcs/t/load_bit.inc
mysql-test/suite/engines/funcs/t/load_enum_set.inc
mysql-test/suite/engines/funcs/t/load_less_columns.inc
mysql-test/suite/engines/funcs/t/load_more_columns.inc
mysql-test/suite/engines/funcs/t/load_null.inc
mysql-test/suite/engines/funcs/t/load_null2.inc
mysql-test/suite/engines/funcs/t/load_quote.inc
mysql-test/suite/engines/funcs/t/load_simple.inc
mysql-test/suite/engines/funcs/t/load_starting.inc
mysql-test/suite/engines/funcs/t/load_unique_error1.inc
mysql-test/suite/engines/funcs/t/load_unique_error2.inc
mysql-test/suite/engines/funcs/t/load_unique_error3.inc
mysql-test/suite/engines/funcs/t/ps_number_length.test
mysql-test/suite/engines/funcs/t/ps_number_null.test
mysql-test/suite/engines/funcs/t/ps_string_not_null.test
mysql-test/suite/engines/funcs/t/ps_string_null.test
mysql-test/suite/engines/funcs/t/re_number_range.test
mysql-test/suite/engines/funcs/t/re_number_range_set.test
mysql-test/suite/engines/funcs/t/re_number_select.test
mysql-test/suite/engines/funcs/t/re_string_range.test
mysql-test/suite/engines/funcs/t/re_string_range_set.test
mysql-test/suite/engines/funcs/t/rpl000010-slave.opt
mysql-test/suite/engines/funcs/t/rpl000010.test
mysql-test/suite/engines/funcs/t/rpl000011.test
mysql-test/suite/engines/funcs/t/rpl000013.test
mysql-test/suite/engines/funcs/t/rpl000017-slave.opt
mysql-test/suite/engines/funcs/t/rpl000017.test
mysql-test/suite/engines/funcs/t/rpl_000015.test
mysql-test/suite/engines/funcs/t/rpl_LD_INFILE.test
mysql-test/suite/engines/funcs/t/rpl_REDIRECT.test
mysql-test/suite/engines/funcs/t/rpl_alter.test
mysql-test/suite/engines/funcs/t/rpl_alter_db.test
mysql-test/suite/engines/funcs/t/rpl_bit.test
mysql-test/suite/engines/funcs/t/rpl_bit_npk.test
mysql-test/suite/engines/funcs/t/rpl_change_master.test
mysql-test/suite/engines/funcs/t/rpl_create_database-master.opt
mysql-test/suite/engines/funcs/t/rpl_create_database-slave.opt
mysql-test/suite/engines/funcs/t/rpl_create_database.test
mysql-test/suite/engines/funcs/t/rpl_do_grant.test
mysql-test/suite/engines/funcs/t/rpl_drop.test
mysql-test/suite/engines/funcs/t/rpl_drop_db.test
mysql-test/suite/engines/funcs/t/rpl_dual_pos_advance-master.opt
mysql-test/suite/engines/funcs/t/rpl_dual_pos_advance.test
mysql-test/suite/engines/funcs/t/rpl_empty_master_crash-master.opt
mysql-test/suite/engines/funcs/t/rpl_empty_master_crash.test
mysql-test/suite/engines/funcs/t/rpl_err_ignoredtable-slave.opt
mysql-test/suite/engines/funcs/t/rpl_err_ignoredtable.test
mysql-test/suite/engines/funcs/t/rpl_flushlog_loop.test
mysql-test/suite/engines/funcs/t/rpl_free_items-slave.opt
mysql-test/suite/engines/funcs/t/rpl_free_items.test
mysql-test/suite/engines/funcs/t/rpl_get_lock.test
mysql-test/suite/engines/funcs/t/rpl_ignore_grant-slave.opt
mysql-test/suite/engines/funcs/t/rpl_ignore_grant.test
mysql-test/suite/engines/funcs/t/rpl_ignore_revoke-slave.opt
mysql-test/suite/engines/funcs/t/rpl_ignore_revoke.test
mysql-test/suite/engines/funcs/t/rpl_ignore_table_update-slave.opt
mysql-test/suite/engines/funcs/t/rpl_ignore_table_update.test
mysql-test/suite/engines/funcs/t/rpl_init_slave-slave.opt
mysql-test/suite/engines/funcs/t/rpl_init_slave.test
mysql-test/suite/engines/funcs/t/rpl_insert.test
mysql-test/suite/engines/funcs/t/rpl_insert_select.test
mysql-test/suite/engines/funcs/t/rpl_loaddata2.test
mysql-test/suite/engines/funcs/t/rpl_loaddata_m-master.opt
mysql-test/suite/engines/funcs/t/rpl_loaddata_m.test
mysql-test/suite/engines/funcs/t/rpl_loaddata_s-slave.opt
mysql-test/suite/engines/funcs/t/rpl_loaddata_s.test
mysql-test/suite/engines/funcs/t/rpl_loaddatalocal.test
mysql-test/suite/engines/funcs/t/rpl_loadfile.test
mysql-test/suite/engines/funcs/t/rpl_log_pos.test
mysql-test/suite/engines/funcs/t/rpl_many_optimize.test
mysql-test/suite/engines/funcs/t/rpl_master_pos_wait.test
mysql-test/suite/engines/funcs/t/rpl_misc_functions.test
mysql-test/suite/engines/funcs/t/rpl_multi_delete-slave.opt
mysql-test/suite/engines/funcs/t/rpl_multi_delete.test
mysql-test/suite/engines/funcs/t/rpl_multi_delete2-slave.opt
mysql-test/suite/engines/funcs/t/rpl_multi_delete2.test
mysql-test/suite/engines/funcs/t/rpl_multi_update4-slave.opt
mysql-test/suite/engines/funcs/t/rpl_multi_update4.test
mysql-test/suite/engines/funcs/t/rpl_ps.test
mysql-test/suite/engines/funcs/t/rpl_rbr_to_sbr.test
mysql-test/suite/engines/funcs/t/rpl_relayspace-slave.opt
mysql-test/suite/engines/funcs/t/rpl_relayspace.test
mysql-test/suite/engines/funcs/t/rpl_replicate_ignore_db-slave.opt
mysql-test/suite/engines/funcs/t/rpl_replicate_ignore_db.test
mysql-test/suite/engines/funcs/t/rpl_row_NOW.test
mysql-test/suite/engines/funcs/t/rpl_row_USER.test
mysql-test/suite/engines/funcs/t/rpl_row_drop.test
mysql-test/suite/engines/funcs/t/rpl_row_func001.test
mysql-test/suite/engines/funcs/t/rpl_row_inexist_tbl-slave.opt
mysql-test/suite/engines/funcs/t/rpl_row_inexist_tbl.test
mysql-test/suite/engines/funcs/t/rpl_row_max_relay_size.test
mysql-test/suite/engines/funcs/t/rpl_row_reset_slave.test
mysql-test/suite/engines/funcs/t/rpl_row_sp001.test
mysql-test/suite/engines/funcs/t/rpl_row_sp005.test
mysql-test/suite/engines/funcs/t/rpl_row_sp008.test
mysql-test/suite/engines/funcs/t/rpl_row_sp009.test
mysql-test/suite/engines/funcs/t/rpl_row_sp010.test
mysql-test/suite/engines/funcs/t/rpl_row_sp011.test
mysql-test/suite/engines/funcs/t/rpl_row_sp012.test
mysql-test/suite/engines/funcs/t/rpl_row_stop_middle.test
mysql-test/suite/engines/funcs/t/rpl_row_trig001.test
mysql-test/suite/engines/funcs/t/rpl_row_trig002.test
mysql-test/suite/engines/funcs/t/rpl_row_trig003.test
mysql-test/suite/engines/funcs/t/rpl_row_until.test
mysql-test/suite/engines/funcs/t/rpl_row_view01.test
mysql-test/suite/engines/funcs/t/rpl_server_id1.test
mysql-test/suite/engines/funcs/t/rpl_server_id2-slave.opt
mysql-test/suite/engines/funcs/t/rpl_server_id2.test
mysql-test/suite/engines/funcs/t/rpl_session_var.test
mysql-test/suite/engines/funcs/t/rpl_sf.test
mysql-test/suite/engines/funcs/t/rpl_skip_error-slave.opt
mysql-test/suite/engines/funcs/t/rpl_skip_error.test
mysql-test/suite/engines/funcs/t/rpl_slave_status.test
mysql-test/suite/engines/funcs/t/rpl_sp-master.opt
mysql-test/suite/engines/funcs/t/rpl_sp-slave.opt
mysql-test/suite/engines/funcs/t/rpl_sp.test
mysql-test/suite/engines/funcs/t/rpl_sp004.test
mysql-test/suite/engines/funcs/t/rpl_sp_effects-master.opt
mysql-test/suite/engines/funcs/t/rpl_sp_effects-slave.opt
mysql-test/suite/engines/funcs/t/rpl_sp_effects.test
mysql-test/suite/engines/funcs/t/rpl_start_stop_slave.test
mysql-test/suite/engines/funcs/t/rpl_stm_max_relay_size.test
mysql-test/suite/engines/funcs/t/rpl_stm_mystery22.test
mysql-test/suite/engines/funcs/t/rpl_stm_no_op.test
mysql-test/suite/engines/funcs/t/rpl_stm_reset_slave.test
mysql-test/suite/engines/funcs/t/rpl_switch_stm_row_mixed.test
mysql-test/suite/engines/funcs/t/rpl_temp_table.test
mysql-test/suite/engines/funcs/t/rpl_temporary.test
mysql-test/suite/engines/funcs/t/rpl_trigger.test
mysql-test/suite/engines/funcs/t/rpl_trunc_temp.test
mysql-test/suite/engines/funcs/t/rpl_user_variables.test
mysql-test/suite/engines/funcs/t/rpl_variables-master.opt
mysql-test/suite/engines/funcs/t/rpl_variables.test
mysql-test/suite/engines/funcs/t/rpl_view-slave.opt
mysql-test/suite/engines/funcs/t/rpl_view.test
mysql-test/suite/engines/funcs/t/se_join_cross.test
mysql-test/suite/engines/funcs/t/se_join_default.test
mysql-test/suite/engines/funcs/t/se_join_inner.test
mysql-test/suite/engines/funcs/t/se_join_left.test
mysql-test/suite/engines/funcs/t/se_join_left_outer.test
mysql-test/suite/engines/funcs/t/se_join_natural_left.test
mysql-test/suite/engines/funcs/t/se_join_natural_left_outer.test
mysql-test/suite/engines/funcs/t/se_join_natural_right.test
mysql-test/suite/engines/funcs/t/se_join_natural_right_outer.test
mysql-test/suite/engines/funcs/t/se_join_right.test
mysql-test/suite/engines/funcs/t/se_join_right_outer.test
mysql-test/suite/engines/funcs/t/se_join_straight.test
mysql-test/suite/engines/funcs/t/se_rowid.test
mysql-test/suite/engines/funcs/t/se_string_distinct.test
mysql-test/suite/engines/funcs/t/se_string_from.test
mysql-test/suite/engines/funcs/t/se_string_groupby.test
mysql-test/suite/engines/funcs/t/se_string_having.test
mysql-test/suite/engines/funcs/t/se_string_limit.test
mysql-test/suite/engines/funcs/t/se_string_orderby.test
mysql-test/suite/engines/funcs/t/se_string_union.test
mysql-test/suite/engines/funcs/t/se_string_where.test
mysql-test/suite/engines/funcs/t/se_string_where_and.test
mysql-test/suite/engines/funcs/t/se_string_where_or.test
mysql-test/suite/engines/funcs/t/sf_alter.test
mysql-test/suite/engines/funcs/t/sf_cursor.test
mysql-test/suite/engines/funcs/t/sf_simple1.test
mysql-test/suite/engines/funcs/t/sp_alter.test
mysql-test/suite/engines/funcs/t/sp_cursor.test
mysql-test/suite/engines/funcs/t/sp_simple1.test
mysql-test/suite/engines/funcs/t/sq_all.test
mysql-test/suite/engines/funcs/t/sq_any.test
mysql-test/suite/engines/funcs/t/sq_corr.test
mysql-test/suite/engines/funcs/t/sq_error.test
mysql-test/suite/engines/funcs/t/sq_exists.test
mysql-test/suite/engines/funcs/t/sq_from.test
mysql-test/suite/engines/funcs/t/sq_in.test
mysql-test/suite/engines/funcs/t/sq_row.test
mysql-test/suite/engines/funcs/t/sq_scalar.test
mysql-test/suite/engines/funcs/t/sq_some.test
mysql-test/suite/engines/funcs/t/ta_2part_column_to_pk.test
mysql-test/suite/engines/funcs/t/ta_2part_diff_string_to_pk.test
mysql-test/suite/engines/funcs/t/ta_2part_diff_to_pk.test
mysql-test/suite/engines/funcs/t/ta_2part_string_to_pk.test
mysql-test/suite/engines/funcs/t/ta_3part_column_to_pk.test
mysql-test/suite/engines/funcs/t/ta_3part_string_to_pk.test
mysql-test/suite/engines/funcs/t/ta_add_column.test
mysql-test/suite/engines/funcs/t/ta_add_column2.test
mysql-test/suite/engines/funcs/t/ta_add_column_first.test
mysql-test/suite/engines/funcs/t/ta_add_column_first2.test
mysql-test/suite/engines/funcs/t/ta_add_column_middle.test
mysql-test/suite/engines/funcs/t/ta_add_column_middle2.test
mysql-test/suite/engines/funcs/t/ta_add_string.test
mysql-test/suite/engines/funcs/t/ta_add_string2.test
mysql-test/suite/engines/funcs/t/ta_add_string_first.test
mysql-test/suite/engines/funcs/t/ta_add_string_first2.test
mysql-test/suite/engines/funcs/t/ta_add_string_middle.test
mysql-test/suite/engines/funcs/t/ta_add_string_middle2.test
mysql-test/suite/engines/funcs/t/ta_add_string_unique_index.test
mysql-test/suite/engines/funcs/t/ta_add_unique_index.test
mysql-test/suite/engines/funcs/t/ta_column_from_unsigned.test
mysql-test/suite/engines/funcs/t/ta_column_from_zerofill.test
mysql-test/suite/engines/funcs/t/ta_column_to_index.test
mysql-test/suite/engines/funcs/t/ta_column_to_not_null.test
mysql-test/suite/engines/funcs/t/ta_column_to_null.test
mysql-test/suite/engines/funcs/t/ta_column_to_pk.test
mysql-test/suite/engines/funcs/t/ta_column_to_unsigned.test
mysql-test/suite/engines/funcs/t/ta_column_to_zerofill.test
mysql-test/suite/engines/funcs/t/ta_drop_column.test
mysql-test/suite/engines/funcs/t/ta_drop_index.test
mysql-test/suite/engines/funcs/t/ta_drop_pk_autoincrement.test
mysql-test/suite/engines/funcs/t/ta_drop_pk_number.test
mysql-test/suite/engines/funcs/t/ta_drop_pk_string.test
mysql-test/suite/engines/funcs/t/ta_drop_string_index.test
mysql-test/suite/engines/funcs/t/ta_orderby.test
mysql-test/suite/engines/funcs/t/ta_rename.test
mysql-test/suite/engines/funcs/t/ta_set_drop_default.test
mysql-test/suite/engines/funcs/t/ta_string_drop_column.test
mysql-test/suite/engines/funcs/t/ta_string_to_index.test
mysql-test/suite/engines/funcs/t/ta_string_to_not_null.test
mysql-test/suite/engines/funcs/t/ta_string_to_null.test
mysql-test/suite/engines/funcs/t/ta_string_to_pk.test
mysql-test/suite/engines/funcs/t/tc_column_autoincrement.test
mysql-test/suite/engines/funcs/t/tc_column_comment.test
mysql-test/suite/engines/funcs/t/tc_column_comment_string.test
mysql-test/suite/engines/funcs/t/tc_column_default_decimal.test
mysql-test/suite/engines/funcs/t/tc_column_default_number.test
mysql-test/suite/engines/funcs/t/tc_column_default_string.test
mysql-test/suite/engines/funcs/t/tc_column_enum.test
mysql-test/suite/engines/funcs/t/tc_column_enum_long.test
mysql-test/suite/engines/funcs/t/tc_column_key.test
mysql-test/suite/engines/funcs/t/tc_column_key_length.test
mysql-test/suite/engines/funcs/t/tc_column_length.test
mysql-test/suite/engines/funcs/t/tc_column_length_decimals.test
mysql-test/suite/engines/funcs/t/tc_column_length_zero.test
mysql-test/suite/engines/funcs/t/tc_column_not_null.test
mysql-test/suite/engines/funcs/t/tc_column_null.test
mysql-test/suite/engines/funcs/t/tc_column_primary_key_number.test
mysql-test/suite/engines/funcs/t/tc_column_primary_key_string.test
mysql-test/suite/engines/funcs/t/tc_column_serial.test
mysql-test/suite/engines/funcs/t/tc_column_set.test
mysql-test/suite/engines/funcs/t/tc_column_set_long.test
mysql-test/suite/engines/funcs/t/tc_column_unique_key.test
mysql-test/suite/engines/funcs/t/tc_column_unique_key_string.test
mysql-test/suite/engines/funcs/t/tc_column_unsigned.test
mysql-test/suite/engines/funcs/t/tc_column_zerofill.test
mysql-test/suite/engines/funcs/t/tc_drop_table.test
mysql-test/suite/engines/funcs/t/tc_multicolumn_different.test
mysql-test/suite/engines/funcs/t/tc_multicolumn_same.test
mysql-test/suite/engines/funcs/t/tc_multicolumn_same_string.test
mysql-test/suite/engines/funcs/t/tc_partition_analyze.test
mysql-test/suite/engines/funcs/t/tc_partition_change_from_range_to_hash_key.test
mysql-test/suite/engines/funcs/t/tc_partition_check.test
mysql-test/suite/engines/funcs/t/tc_partition_hash.test
mysql-test/suite/engines/funcs/t/tc_partition_hash_date_function.test
mysql-test/suite/engines/funcs/t/tc_partition_key.test
mysql-test/suite/engines/funcs/t/tc_partition_linear_key.test
mysql-test/suite/engines/funcs/t/tc_partition_list_directory.test
mysql-test/suite/engines/funcs/t/tc_partition_list_error.test
mysql-test/suite/engines/funcs/t/tc_partition_optimize.test
mysql-test/suite/engines/funcs/t/tc_partition_rebuild.test
mysql-test/suite/engines/funcs/t/tc_partition_remove.test
mysql-test/suite/engines/funcs/t/tc_partition_reorg_divide.test
mysql-test/suite/engines/funcs/t/tc_partition_reorg_hash_key.test
mysql-test/suite/engines/funcs/t/tc_partition_reorg_merge.test
mysql-test/suite/engines/funcs/t/tc_partition_repair.test
mysql-test/suite/engines/funcs/t/tc_partition_sub1.test
mysql-test/suite/engines/funcs/t/tc_partition_sub2.test
mysql-test/suite/engines/funcs/t/tc_partition_value.test
mysql-test/suite/engines/funcs/t/tc_partition_value_error.test
mysql-test/suite/engines/funcs/t/tc_partition_value_specific.test
mysql-test/suite/engines/funcs/t/tc_rename.test
mysql-test/suite/engines/funcs/t/tc_rename_across_database.test
mysql-test/suite/engines/funcs/t/tc_rename_error.test
mysql-test/suite/engines/funcs/t/tc_structure_comment.test
mysql-test/suite/engines/funcs/t/tc_structure_create_like.test
mysql-test/suite/engines/funcs/t/tc_structure_create_like_string.test
mysql-test/suite/engines/funcs/t/tc_structure_create_select.test
mysql-test/suite/engines/funcs/t/tc_structure_create_select_string.test
mysql-test/suite/engines/funcs/t/tc_structure_string_comment.test
mysql-test/suite/engines/funcs/t/tc_temporary_column.test
mysql-test/suite/engines/funcs/t/tc_temporary_column_length.test
mysql-test/suite/engines/funcs/t/time_function.test
mysql-test/suite/engines/funcs/t/tr_all_type_triggers.test
mysql-test/suite/engines/funcs/t/tr_delete.test
mysql-test/suite/engines/funcs/t/tr_delete_new_error.test
mysql-test/suite/engines/funcs/t/tr_insert.test
mysql-test/suite/engines/funcs/t/tr_insert_after_error.test
mysql-test/suite/engines/funcs/t/tr_insert_old_error.test
mysql-test/suite/engines/funcs/t/tr_update.test
mysql-test/suite/engines/funcs/t/tr_update_after_error.test
mysql-test/suite/engines/funcs/t/up_calendar_range.test
mysql-test/suite/engines/funcs/t/up_ignore.test
mysql-test/suite/engines/funcs/t/up_limit.test
mysql-test/suite/engines/funcs/t/up_multi_db_table.test
mysql-test/suite/engines/funcs/t/up_multi_table.test
mysql-test/suite/engines/funcs/t/up_nullcheck.test
mysql-test/suite/engines/funcs/t/up_number_range.test
mysql-test/suite/engines/funcs/t/up_string_range.test
mysql-test/suite/engines/funcs/t/wait_show_pattern.inc
mysql-test/suite/engines/funcs/t/wait_slave_status.inc
mysql-test/suite/engines/iuds/
mysql-test/suite/engines/iuds/r/
mysql-test/suite/engines/iuds/r/delete_decimal.result
mysql-test/suite/engines/iuds/r/delete_time.result
mysql-test/suite/engines/iuds/r/delete_year.result
mysql-test/suite/engines/iuds/r/insert_calendar.result
mysql-test/suite/engines/iuds/r/insert_decimal.result
mysql-test/suite/engines/iuds/r/insert_number.result
mysql-test/suite/engines/iuds/r/insert_time.result
mysql-test/suite/engines/iuds/r/insert_year.result
mysql-test/suite/engines/iuds/r/strings_charsets_update_delete.result
mysql-test/suite/engines/iuds/r/strings_update_delete.result
mysql-test/suite/engines/iuds/r/type_bit_iuds.result
mysql-test/suite/engines/iuds/r/update_decimal.result
mysql-test/suite/engines/iuds/r/update_delete_calendar.result
mysql-test/suite/engines/iuds/r/update_delete_number.result
mysql-test/suite/engines/iuds/r/update_time.result
mysql-test/suite/engines/iuds/r/update_year.result
mysql-test/suite/engines/iuds/t/
mysql-test/suite/engines/iuds/t/delete_decimal.test
mysql-test/suite/engines/iuds/t/delete_time.test
mysql-test/suite/engines/iuds/t/delete_year.test
mysql-test/suite/engines/iuds/t/disabled.def
mysql-test/suite/engines/iuds/t/hindi.txt
mysql-test/suite/engines/iuds/t/insert_calendar.test
mysql-test/suite/engines/iuds/t/insert_decimal.test
mysql-test/suite/engines/iuds/t/insert_number.test
mysql-test/suite/engines/iuds/t/insert_time.test
mysql-test/suite/engines/iuds/t/insert_year.test
mysql-test/suite/engines/iuds/t/sample.txt
mysql-test/suite/engines/iuds/t/strings_charsets_update_delete.test
mysql-test/suite/engines/iuds/t/strings_update_delete.test
mysql-test/suite/engines/iuds/t/type_bit_iuds.test
mysql-test/suite/engines/iuds/t/update_decimal.test
mysql-test/suite/engines/iuds/t/update_delete_calendar.test
mysql-test/suite/engines/iuds/t/update_delete_number.test
mysql-test/suite/engines/iuds/t/update_time.test
mysql-test/suite/engines/iuds/t/update_year.test
mysql-test/suite/engines/rr_trx/
mysql-test/suite/engines/rr_trx/check_consistency.sql
mysql-test/suite/engines/rr_trx/include/
mysql-test/suite/engines/rr_trx/include/check_for_error_rollback.inc
mysql-test/suite/engines/rr_trx/include/check_for_error_rollback_skip.inc
mysql-test/suite/engines/rr_trx/include/check_repeatable_read_all_columns.inc
mysql-test/suite/engines/rr_trx/include/record_query_all_columns.inc
mysql-test/suite/engines/rr_trx/include/rr_init.test
mysql-test/suite/engines/rr_trx/init_innodb.txt
mysql-test/suite/engines/rr_trx/r/
mysql-test/suite/engines/rr_trx/r/init_innodb.result
mysql-test/suite/engines/rr_trx/r/rr_c_count_not_zero.result
mysql-test/suite/engines/rr_trx/r/rr_c_stats.result
mysql-test/suite/engines/rr_trx/r/rr_i_40-44.result
mysql-test/suite/engines/rr_trx/r/rr_id_3.result
mysql-test/suite/engines/rr_trx/r/rr_id_900.result
mysql-test/suite/engines/rr_trx/r/rr_insert_select_2.result
mysql-test/suite/engines/rr_trx/r/rr_iud_rollback-multi-50.result
mysql-test/suite/engines/rr_trx/r/rr_replace_7-8.result
mysql-test/suite/engines/rr_trx/r/rr_s_select-uncommitted.result
mysql-test/suite/engines/rr_trx/r/rr_sc_select-limit-nolimit_4.result
mysql-test/suite/engines/rr_trx/r/rr_sc_select-same_2.result
mysql-test/suite/engines/rr_trx/r/rr_sc_sum_total.result
mysql-test/suite/engines/rr_trx/r/rr_u_10-19.result
mysql-test/suite/engines/rr_trx/r/rr_u_10-19_nolimit.result
mysql-test/suite/engines/rr_trx/r/rr_u_4.result
mysql-test/suite/engines/rr_trx/run.txt
mysql-test/suite/engines/rr_trx/run_stress_tx_rr.pl
mysql-test/suite/engines/rr_trx/t/
mysql-test/suite/engines/rr_trx/t/init_innodb.test
mysql-test/suite/engines/rr_trx/t/rr_c_count_not_zero.test
mysql-test/suite/engines/rr_trx/t/rr_c_stats.test
mysql-test/suite/engines/rr_trx/t/rr_i_40-44.test
mysql-test/suite/engines/rr_trx/t/rr_id_3.test
mysql-test/suite/engines/rr_trx/t/rr_id_900.test
mysql-test/suite/engines/rr_trx/t/rr_insert_select_2.test
mysql-test/suite/engines/rr_trx/t/rr_iud_rollback-multi-50.test
mysql-test/suite/engines/rr_trx/t/rr_replace_7-8.test
mysql-test/suite/engines/rr_trx/t/rr_s_select-uncommitted.test
mysql-test/suite/engines/rr_trx/t/rr_sc_select-limit-nolimit_4.test
mysql-test/suite/engines/rr_trx/t/rr_sc_select-same_2.test
mysql-test/suite/engines/rr_trx/t/rr_sc_sum_total.test
mysql-test/suite/engines/rr_trx/t/rr_u_10-19.test
mysql-test/suite/engines/rr_trx/t/rr_u_10-19_nolimit.test
mysql-test/suite/engines/rr_trx/t/rr_u_4.test
mysql-test/suite/innodb/r/innodb-autoinc-44030.result
mysql-test/suite/innodb/r/innodb-autoinc.result
mysql-test/suite/innodb/r/innodb-lock.result
mysql-test/suite/innodb/r/innodb-replace.result
mysql-test/suite/innodb/r/innodb-semi-consistent.result
mysql-test/suite/innodb/r/innodb-use-sys-malloc.result
mysql-test/suite/innodb/r/innodb_bug21704.result
mysql-test/suite/innodb/r/innodb_bug34053.result
mysql-test/suite/innodb/r/innodb_bug35220.result
mysql-test/suite/innodb/r/innodb_bug38231.result
mysql-test/suite/innodb/r/innodb_bug40565.result
mysql-test/suite/innodb/r/innodb_bug42101-nonzero.result
mysql-test/suite/innodb/r/innodb_bug42101.result
mysql-test/suite/innodb/r/innodb_bug44369.result
mysql-test/suite/innodb/r/innodb_bug45357.result
mysql-test/suite/innodb/r/innodb_bug46000.result
mysql-test/suite/innodb/r/innodb_bug47621.result
mysql-test/suite/innodb/r/innodb_bug47777.result
mysql-test/suite/innodb/r/innodb_bug51920.result
mysql-test/suite/innodb/r/innodb_bug52663.result
mysql-test/suite/innodb/r/innodb_misc1.result
mysql-test/suite/innodb/r/innodb_trx_weight.result
mysql-test/suite/innodb/t/disabled.def
mysql-test/suite/innodb/t/innodb-autoinc-44030.test
mysql-test/suite/innodb/t/innodb-autoinc.test
mysql-test/suite/innodb/t/innodb-lock.test
mysql-test/suite/innodb/t/innodb-master.opt
mysql-test/suite/innodb/t/innodb-replace.test
mysql-test/suite/innodb/t/innodb-semi-consistent-master.opt
mysql-test/suite/innodb/t/innodb-semi-consistent.test
mysql-test/suite/innodb/t/innodb_bug21704.test
mysql-test/suite/innodb/t/innodb_bug34053.test
mysql-test/suite/innodb/t/innodb_bug35220.test
mysql-test/suite/innodb/t/innodb_bug38231.test
mysql-test/suite/innodb/t/innodb_bug40565.test
mysql-test/suite/innodb/t/innodb_bug42101-nonzero-master.opt
mysql-test/suite/innodb/t/innodb_bug42101-nonzero.test
mysql-test/suite/innodb/t/innodb_bug42101.test
mysql-test/suite/innodb/t/innodb_bug44369.test
mysql-test/suite/innodb/t/innodb_bug45357.test
mysql-test/suite/innodb/t/innodb_bug46000.test
mysql-test/suite/innodb/t/innodb_bug47621.test
mysql-test/suite/innodb/t/innodb_bug47777.test
mysql-test/suite/innodb/t/innodb_bug51920.test
mysql-test/suite/innodb/t/innodb_bug52663-master.opt
mysql-test/suite/innodb/t/innodb_bug52663.test
mysql-test/suite/innodb/t/innodb_misc1-master.opt
mysql-test/suite/innodb/t/innodb_misc1.test
mysql-test/suite/innodb/t/innodb_trx_weight.test
mysql-test/suite/innodb_plugin/
mysql-test/suite/innodb_plugin/include/
mysql-test/suite/innodb_plugin/include/ctype_innodb_like.inc
mysql-test/suite/innodb_plugin/include/innodb-index.inc
mysql-test/suite/innodb_plugin/include/innodb_trx_weight.inc
mysql-test/suite/innodb_plugin/r/
mysql-test/suite/innodb_plugin/r/innodb-analyze.result
mysql-test/suite/innodb_plugin/r/innodb-autoinc-44030.result
mysql-test/suite/innodb_plugin/r/innodb-autoinc.result
mysql-test/suite/innodb_plugin/r/innodb-consistent.result
mysql-test/suite/innodb_plugin/r/innodb-index.result
mysql-test/suite/innodb_plugin/r/innodb-index_ucs2.result
mysql-test/suite/innodb_plugin/r/innodb-lock.result
mysql-test/suite/innodb_plugin/r/innodb-replace.result
mysql-test/suite/innodb_plugin/r/innodb-semi-consistent.result
mysql-test/suite/innodb_plugin/r/innodb-timeout.result
mysql-test/suite/innodb_plugin/r/innodb-use-sys-malloc.result
mysql-test/suite/innodb_plugin/r/innodb-zip.result
mysql-test/suite/innodb_plugin/r/innodb.result
mysql-test/suite/innodb_plugin/r/innodb_bug21704.result
mysql-test/suite/innodb_plugin/r/innodb_bug34053.result
mysql-test/suite/innodb_plugin/r/innodb_bug34300.result
mysql-test/suite/innodb_plugin/r/innodb_bug35220.result
mysql-test/suite/innodb_plugin/r/innodb_bug36169.result
mysql-test/suite/innodb_plugin/r/innodb_bug36172.result
mysql-test/suite/innodb_plugin/r/innodb_bug38231.result
mysql-test/suite/innodb_plugin/r/innodb_bug39438.result
mysql-test/suite/innodb_plugin/r/innodb_bug40360.result
mysql-test/suite/innodb_plugin/r/innodb_bug40565.result
mysql-test/suite/innodb_plugin/r/innodb_bug41904.result
mysql-test/suite/innodb_plugin/r/innodb_bug42101-nonzero.result
mysql-test/suite/innodb_plugin/r/innodb_bug42101.result
mysql-test/suite/innodb_plugin/r/innodb_bug44032.result
mysql-test/suite/innodb_plugin/r/innodb_bug44369.result
mysql-test/suite/innodb_plugin/r/innodb_bug44571.result
mysql-test/suite/innodb_plugin/r/innodb_bug45357.result
mysql-test/suite/innodb_plugin/r/innodb_bug46000.result
mysql-test/suite/innodb_plugin/r/innodb_bug46676.result
mysql-test/suite/innodb_plugin/r/innodb_bug47167.result
mysql-test/suite/innodb_plugin/r/innodb_bug47621.result
mysql-test/suite/innodb_plugin/r/innodb_bug47622.result
mysql-test/suite/innodb_plugin/r/innodb_bug47777.result
mysql-test/suite/innodb_plugin/r/innodb_bug51378.result
mysql-test/suite/innodb_plugin/r/innodb_bug51920.result
mysql-test/suite/innodb_plugin/r/innodb_bug52663.result
mysql-test/suite/innodb_plugin/r/innodb_bug52745.result
mysql-test/suite/innodb_plugin/r/innodb_file_format.result
mysql-test/suite/innodb_plugin/r/innodb_information_schema.result
mysql-test/suite/innodb_plugin/r/innodb_trx_weight.result
mysql-test/suite/innodb_plugin/t/
mysql-test/suite/innodb_plugin/t/innodb-analyze.test
mysql-test/suite/innodb_plugin/t/innodb-autoinc-44030.test
mysql-test/suite/innodb_plugin/t/innodb-autoinc.test
mysql-test/suite/innodb_plugin/t/innodb-consistent-master.opt
mysql-test/suite/innodb_plugin/t/innodb-consistent.test
mysql-test/suite/innodb_plugin/t/innodb-index.test
mysql-test/suite/innodb_plugin/t/innodb-index_ucs2.test
mysql-test/suite/innodb_plugin/t/innodb-lock.test
mysql-test/suite/innodb_plugin/t/innodb-master.opt
mysql-test/suite/innodb_plugin/t/innodb-replace.test
mysql-test/suite/innodb_plugin/t/innodb-semi-consistent-master.opt
mysql-test/suite/innodb_plugin/t/innodb-semi-consistent.test
mysql-test/suite/innodb_plugin/t/innodb-timeout.test
mysql-test/suite/innodb_plugin/t/innodb-use-sys-malloc-master.opt
mysql-test/suite/innodb_plugin/t/innodb-use-sys-malloc.test
mysql-test/suite/innodb_plugin/t/innodb-zip.test
mysql-test/suite/innodb_plugin/t/innodb.test
mysql-test/suite/innodb_plugin/t/innodb_bug21704.test
mysql-test/suite/innodb_plugin/t/innodb_bug34053.test
mysql-test/suite/innodb_plugin/t/innodb_bug34300.test
mysql-test/suite/innodb_plugin/t/innodb_bug35220.test
mysql-test/suite/innodb_plugin/t/innodb_bug36169.test
mysql-test/suite/innodb_plugin/t/innodb_bug36172.test
mysql-test/suite/innodb_plugin/t/innodb_bug38231.test
mysql-test/suite/innodb_plugin/t/innodb_bug39438-master.opt
mysql-test/suite/innodb_plugin/t/innodb_bug39438.test
mysql-test/suite/innodb_plugin/t/innodb_bug40360.test
mysql-test/suite/innodb_plugin/t/innodb_bug40565.test
mysql-test/suite/innodb_plugin/t/innodb_bug41904.test
mysql-test/suite/innodb_plugin/t/innodb_bug42101-nonzero-master.opt
mysql-test/suite/innodb_plugin/t/innodb_bug42101-nonzero.test
mysql-test/suite/innodb_plugin/t/innodb_bug42101.test
mysql-test/suite/innodb_plugin/t/innodb_bug44032.test
mysql-test/suite/innodb_plugin/t/innodb_bug44369.test
mysql-test/suite/innodb_plugin/t/innodb_bug44571.test
mysql-test/suite/innodb_plugin/t/innodb_bug45357.test
mysql-test/suite/innodb_plugin/t/innodb_bug46000.test
mysql-test/suite/innodb_plugin/t/innodb_bug46676.test
mysql-test/suite/innodb_plugin/t/innodb_bug47167.test
mysql-test/suite/innodb_plugin/t/innodb_bug47621.test
mysql-test/suite/innodb_plugin/t/innodb_bug47622.test
mysql-test/suite/innodb_plugin/t/innodb_bug47777.test
mysql-test/suite/innodb_plugin/t/innodb_bug51378.test
mysql-test/suite/innodb_plugin/t/innodb_bug51920.test
mysql-test/suite/innodb_plugin/t/innodb_bug52663.test
mysql-test/suite/innodb_plugin/t/innodb_bug52745.test
mysql-test/suite/innodb_plugin/t/innodb_file_format.test
mysql-test/suite/innodb_plugin/t/innodb_information_schema.test
mysql-test/suite/innodb_plugin/t/innodb_trx_weight.test
mysql-test/suite/rpl/r/rpl_show_slave_running.result
mysql-test/suite/rpl/r/rpl_slow_query_log.result
mysql-test/suite/rpl/r/rpl_stm_sql_mode.result
mysql-test/suite/rpl/r/rpl_typeconv_innodb.result
mysql-test/suite/rpl/t/rpl_begin_commit_rollback-master.opt
mysql-test/suite/rpl/t/rpl_show_slave_running.test
mysql-test/suite/rpl/t/rpl_slow_query_log-slave.opt
mysql-test/suite/rpl/t/rpl_slow_query_log.test
mysql-test/suite/rpl/t/rpl_stm_sql_mode.test
mysql-test/suite/rpl/t/rpl_typeconv-slave.opt
mysql-test/suite/rpl/t/rpl_typeconv_innodb.test
mysql-test/suite/sys_vars/r/secure_file_priv.result
mysql-test/suite/sys_vars/t/secure_file_priv-master.opt
mysql-test/suite/sys_vars/t/secure_file_priv.test
mysql-test/t/bug39022.test
mysql-test/t/bug46261-master.opt
mysql-test/t/bug46261.test
mysql-test/t/log_tables_upgrade.test
mysql-test/t/no_binlog.test
mysql-test/t/partition_debug_sync.test
mysql-test/t/plugin_not_embedded-master.opt
mysql-test/t/plugin_not_embedded.test
mysql-test/t/view_alias.test
storage/innobase/
storage/innobase/CMakeLists.txt
storage/innobase/Makefile.am
storage/innobase/btr/
storage/innobase/btr/btr0btr.c
storage/innobase/btr/btr0cur.c
storage/innobase/btr/btr0pcur.c
storage/innobase/btr/btr0sea.c
storage/innobase/buf/
storage/innobase/buf/buf0buf.c
storage/innobase/buf/buf0flu.c
storage/innobase/buf/buf0lru.c
storage/innobase/buf/buf0rea.c
storage/innobase/data/
storage/innobase/data/data0data.c
storage/innobase/data/data0type.c
storage/innobase/dict/
storage/innobase/dict/dict0boot.c
storage/innobase/dict/dict0crea.c
storage/innobase/dict/dict0dict.c
storage/innobase/dict/dict0load.c
storage/innobase/dict/dict0mem.c
storage/innobase/dyn/
storage/innobase/dyn/dyn0dyn.c
storage/innobase/eval/
storage/innobase/eval/eval0eval.c
storage/innobase/eval/eval0proc.c
storage/innobase/fil/
storage/innobase/fil/fil0fil.c
storage/innobase/fsp/
storage/innobase/fsp/fsp0fsp.c
storage/innobase/fut/
storage/innobase/fut/fut0fut.c
storage/innobase/fut/fut0lst.c
storage/innobase/ha/
storage/innobase/ha/ha0ha.c
storage/innobase/ha/hash0hash.c
storage/innobase/handler/
storage/innobase/handler/ha_innodb.cc
storage/innobase/handler/ha_innodb.h
storage/innobase/ibuf/
storage/innobase/ibuf/ibuf0ibuf.c
storage/innobase/include/
storage/innobase/include/btr0btr.h
storage/innobase/include/btr0btr.ic
storage/innobase/include/btr0cur.h
storage/innobase/include/btr0cur.ic
storage/innobase/include/btr0pcur.h
storage/innobase/include/btr0pcur.ic
storage/innobase/include/btr0sea.h
storage/innobase/include/btr0sea.ic
storage/innobase/include/btr0types.h
storage/innobase/include/buf0buf.h
storage/innobase/include/buf0buf.ic
storage/innobase/include/buf0flu.h
storage/innobase/include/buf0flu.ic
storage/innobase/include/buf0lru.h
storage/innobase/include/buf0lru.ic
storage/innobase/include/buf0rea.h
storage/innobase/include/buf0types.h
storage/innobase/include/data0data.h
storage/innobase/include/data0data.ic
storage/innobase/include/data0type.h
storage/innobase/include/data0type.ic
storage/innobase/include/data0types.h
storage/innobase/include/db0err.h
storage/innobase/include/dict0boot.h
storage/innobase/include/dict0boot.ic
storage/innobase/include/dict0crea.h
storage/innobase/include/dict0crea.ic
storage/innobase/include/dict0dict.h
storage/innobase/include/dict0dict.ic
storage/innobase/include/dict0load.h
storage/innobase/include/dict0load.ic
storage/innobase/include/dict0mem.h
storage/innobase/include/dict0mem.ic
storage/innobase/include/dict0types.h
storage/innobase/include/dyn0dyn.h
storage/innobase/include/dyn0dyn.ic
storage/innobase/include/eval0eval.h
storage/innobase/include/eval0eval.ic
storage/innobase/include/eval0proc.h
storage/innobase/include/eval0proc.ic
storage/innobase/include/fil0fil.h
storage/innobase/include/fsp0fsp.h
storage/innobase/include/fsp0fsp.ic
storage/innobase/include/fsp0types.h
storage/innobase/include/fut0fut.h
storage/innobase/include/fut0fut.ic
storage/innobase/include/fut0lst.h
storage/innobase/include/fut0lst.ic
storage/innobase/include/ha0ha.h
storage/innobase/include/ha0ha.ic
storage/innobase/include/ha_prototypes.h
storage/innobase/include/hash0hash.h
storage/innobase/include/hash0hash.ic
storage/innobase/include/ibuf0ibuf.h
storage/innobase/include/ibuf0ibuf.ic
storage/innobase/include/ibuf0types.h
storage/innobase/include/lock0iter.h
storage/innobase/include/lock0lock.h
storage/innobase/include/lock0lock.ic
storage/innobase/include/lock0priv.h
storage/innobase/include/lock0priv.ic
storage/innobase/include/lock0types.h
storage/innobase/include/log0log.h
storage/innobase/include/log0log.ic
storage/innobase/include/log0recv.h
storage/innobase/include/log0recv.ic
storage/innobase/include/mach0data.h
storage/innobase/include/mach0data.ic
storage/innobase/include/mem0dbg.h
storage/innobase/include/mem0dbg.ic
storage/innobase/include/mem0mem.h
storage/innobase/include/mem0mem.ic
storage/innobase/include/mem0pool.h
storage/innobase/include/mem0pool.ic
storage/innobase/include/mtr0log.h
storage/innobase/include/mtr0log.ic
storage/innobase/include/mtr0mtr.h
storage/innobase/include/mtr0mtr.ic
storage/innobase/include/mtr0types.h
storage/innobase/include/os0file.h
storage/innobase/include/os0proc.h
storage/innobase/include/os0proc.ic
storage/innobase/include/os0sync.h
storage/innobase/include/os0sync.ic
storage/innobase/include/os0thread.h
storage/innobase/include/os0thread.ic
storage/innobase/include/page0cur.h
storage/innobase/include/page0cur.ic
storage/innobase/include/page0page.h
storage/innobase/include/page0page.ic
storage/innobase/include/page0types.h
storage/innobase/include/pars0grm.h
storage/innobase/include/pars0opt.h
storage/innobase/include/pars0opt.ic
storage/innobase/include/pars0pars.h
storage/innobase/include/pars0pars.ic
storage/innobase/include/pars0sym.h
storage/innobase/include/pars0sym.ic
storage/innobase/include/pars0types.h
storage/innobase/include/que0que.h
storage/innobase/include/que0que.ic
storage/innobase/include/que0types.h
storage/innobase/include/read0read.h
storage/innobase/include/read0read.ic
storage/innobase/include/read0types.h
storage/innobase/include/rem0cmp.h
storage/innobase/include/rem0cmp.ic
storage/innobase/include/rem0rec.h
storage/innobase/include/rem0rec.ic
storage/innobase/include/rem0types.h
storage/innobase/include/row0ins.h
storage/innobase/include/row0ins.ic
storage/innobase/include/row0mysql.h
storage/innobase/include/row0mysql.ic
storage/innobase/include/row0purge.h
storage/innobase/include/row0purge.ic
storage/innobase/include/row0row.h
storage/innobase/include/row0row.ic
storage/innobase/include/row0sel.h
storage/innobase/include/row0sel.ic
storage/innobase/include/row0types.h
storage/innobase/include/row0uins.h
storage/innobase/include/row0uins.ic
storage/innobase/include/row0umod.h
storage/innobase/include/row0umod.ic
storage/innobase/include/row0undo.h
storage/innobase/include/row0undo.ic
storage/innobase/include/row0upd.h
storage/innobase/include/row0upd.ic
storage/innobase/include/row0vers.h
storage/innobase/include/row0vers.ic
storage/innobase/include/srv0que.h
storage/innobase/include/srv0srv.h
storage/innobase/include/srv0srv.ic
storage/innobase/include/srv0start.h
storage/innobase/include/sync0arr.h
storage/innobase/include/sync0arr.ic
storage/innobase/include/sync0rw.h
storage/innobase/include/sync0rw.ic
storage/innobase/include/sync0sync.h
storage/innobase/include/sync0sync.ic
storage/innobase/include/sync0types.h
storage/innobase/include/thr0loc.h
storage/innobase/include/thr0loc.ic
storage/innobase/include/trx0purge.h
storage/innobase/include/trx0purge.ic
storage/innobase/include/trx0rec.h
storage/innobase/include/trx0rec.ic
storage/innobase/include/trx0roll.h
storage/innobase/include/trx0roll.ic
storage/innobase/include/trx0rseg.h
storage/innobase/include/trx0rseg.ic
storage/innobase/include/trx0sys.h
storage/innobase/include/trx0sys.ic
storage/innobase/include/trx0trx.h
storage/innobase/include/trx0trx.ic
storage/innobase/include/trx0types.h
storage/innobase/include/trx0undo.h
storage/innobase/include/trx0undo.ic
storage/innobase/include/trx0xa.h
storage/innobase/include/univ.i
storage/innobase/include/usr0sess.h
storage/innobase/include/usr0sess.ic
storage/innobase/include/usr0types.h
storage/innobase/include/ut0byte.h
storage/innobase/include/ut0byte.ic
storage/innobase/include/ut0dbg.h
storage/innobase/include/ut0list.h
storage/innobase/include/ut0list.ic
storage/innobase/include/ut0lst.h
storage/innobase/include/ut0mem.h
storage/innobase/include/ut0mem.ic
storage/innobase/include/ut0rnd.h
storage/innobase/include/ut0rnd.ic
storage/innobase/include/ut0sort.h
storage/innobase/include/ut0ut.h
storage/innobase/include/ut0ut.ic
storage/innobase/include/ut0vec.h
storage/innobase/include/ut0vec.ic
storage/innobase/include/ut0wqueue.h
storage/innobase/lock/
storage/innobase/lock/lock0iter.c
storage/innobase/lock/lock0lock.c
storage/innobase/log/
storage/innobase/log/log0log.c
storage/innobase/log/log0recv.c
storage/innobase/mach/
storage/innobase/mach/mach0data.c
storage/innobase/mem/
storage/innobase/mem/mem0dbg.c
storage/innobase/mem/mem0mem.c
storage/innobase/mem/mem0pool.c
storage/innobase/mtr/
storage/innobase/mtr/mtr0log.c
storage/innobase/mtr/mtr0mtr.c
storage/innobase/mysql-test/
storage/innobase/os/
storage/innobase/os/os0file.c
storage/innobase/os/os0proc.c
storage/innobase/os/os0sync.c
storage/innobase/os/os0thread.c
storage/innobase/page/
storage/innobase/page/page0cur.c
storage/innobase/page/page0page.c
storage/innobase/pars/
storage/innobase/pars/lexyy.c
storage/innobase/pars/make_bison.sh
storage/innobase/pars/make_flex.sh
storage/innobase/pars/pars0grm.c
storage/innobase/pars/pars0grm.h
storage/innobase/pars/pars0grm.y
storage/innobase/pars/pars0lex.l
storage/innobase/pars/pars0opt.c
storage/innobase/pars/pars0pars.c
storage/innobase/pars/pars0sym.c
storage/innobase/plug.in.disabled
storage/innobase/que/
storage/innobase/que/que0que.c
storage/innobase/read/
storage/innobase/read/read0read.c
storage/innobase/rem/
storage/innobase/rem/rem0cmp.c
storage/innobase/rem/rem0rec.c
storage/innobase/row/
storage/innobase/row/row0ins.c
storage/innobase/row/row0mysql.c
storage/innobase/row/row0purge.c
storage/innobase/row/row0row.c
storage/innobase/row/row0sel.c
storage/innobase/row/row0uins.c
storage/innobase/row/row0umod.c
storage/innobase/row/row0undo.c
storage/innobase/row/row0upd.c
storage/innobase/row/row0vers.c
storage/innobase/srv/
storage/innobase/srv/srv0que.c
storage/innobase/srv/srv0srv.c
storage/innobase/srv/srv0start.c
storage/innobase/sync/
storage/innobase/sync/sync0arr.c
storage/innobase/sync/sync0rw.c
storage/innobase/sync/sync0sync.c
storage/innobase/thr/
storage/innobase/thr/thr0loc.c
storage/innobase/trx/
storage/innobase/trx/trx0purge.c
storage/innobase/trx/trx0rec.c
storage/innobase/trx/trx0roll.c
storage/innobase/trx/trx0rseg.c
storage/innobase/trx/trx0sys.c
storage/innobase/trx/trx0trx.c
storage/innobase/trx/trx0undo.c
storage/innobase/usr/
storage/innobase/usr/usr0sess.c
storage/innobase/ut/
storage/innobase/ut/ut0byte.c
storage/innobase/ut/ut0dbg.c
storage/innobase/ut/ut0list.c
storage/innobase/ut/ut0mem.c
storage/innobase/ut/ut0rnd.c
storage/innobase/ut/ut0ut.c
storage/innobase/ut/ut0vec.c
storage/innobase/ut/ut0wqueue.c
storage/innodb_plugin/
storage/innodb_plugin/CMakeLists.txt
storage/innodb_plugin/COPYING
storage/innodb_plugin/COPYING.Google
storage/innodb_plugin/COPYING.Percona
storage/innodb_plugin/COPYING.Sun_Microsystems
storage/innodb_plugin/ChangeLog
storage/innodb_plugin/Doxyfile
storage/innodb_plugin/Makefile.am
storage/innodb_plugin/btr/
storage/innodb_plugin/btr/btr0btr.c
storage/innodb_plugin/btr/btr0cur.c
storage/innodb_plugin/btr/btr0pcur.c
storage/innodb_plugin/btr/btr0sea.c
storage/innodb_plugin/buf/
storage/innodb_plugin/buf/buf0buddy.c
storage/innodb_plugin/buf/buf0buf.c
storage/innodb_plugin/buf/buf0flu.c
storage/innodb_plugin/buf/buf0lru.c
storage/innodb_plugin/buf/buf0rea.c
storage/innodb_plugin/compile-innodb
storage/innodb_plugin/compile-innodb-debug
storage/innodb_plugin/data/
storage/innodb_plugin/data/data0data.c
storage/innodb_plugin/data/data0type.c
storage/innodb_plugin/dict/
storage/innodb_plugin/dict/dict0boot.c
storage/innodb_plugin/dict/dict0crea.c
storage/innodb_plugin/dict/dict0dict.c
storage/innodb_plugin/dict/dict0load.c
storage/innodb_plugin/dict/dict0mem.c
storage/innodb_plugin/dyn/
storage/innodb_plugin/dyn/dyn0dyn.c
storage/innodb_plugin/eval/
storage/innodb_plugin/eval/eval0eval.c
storage/innodb_plugin/eval/eval0proc.c
storage/innodb_plugin/fil/
storage/innodb_plugin/fil/fil0fil.c
storage/innodb_plugin/fsp/
storage/innodb_plugin/fsp/fsp0fsp.c
storage/innodb_plugin/fut/
storage/innodb_plugin/fut/fut0fut.c
storage/innodb_plugin/fut/fut0lst.c
storage/innodb_plugin/ha/
storage/innodb_plugin/ha/ha0ha.c
storage/innodb_plugin/ha/ha0storage.c
storage/innodb_plugin/ha/hash0hash.c
storage/innodb_plugin/ha_innodb.def
storage/innodb_plugin/handler/
storage/innodb_plugin/handler/ha_innodb.cc
storage/innodb_plugin/handler/ha_innodb.h
storage/innodb_plugin/handler/handler0alter.cc
storage/innodb_plugin/handler/i_s.cc
storage/innodb_plugin/handler/i_s.h
storage/innodb_plugin/handler/mysql_addons.cc
storage/innodb_plugin/ibuf/
storage/innodb_plugin/ibuf/ibuf0ibuf.c
storage/innodb_plugin/include/
storage/innodb_plugin/include/btr0btr.h
storage/innodb_plugin/include/btr0btr.ic
storage/innodb_plugin/include/btr0cur.h
storage/innodb_plugin/include/btr0cur.ic
storage/innodb_plugin/include/btr0pcur.h
storage/innodb_plugin/include/btr0pcur.ic
storage/innodb_plugin/include/btr0sea.h
storage/innodb_plugin/include/btr0sea.ic
storage/innodb_plugin/include/btr0types.h
storage/innodb_plugin/include/buf0buddy.h
storage/innodb_plugin/include/buf0buddy.ic
storage/innodb_plugin/include/buf0buf.h
storage/innodb_plugin/include/buf0buf.ic
storage/innodb_plugin/include/buf0flu.h
storage/innodb_plugin/include/buf0flu.ic
storage/innodb_plugin/include/buf0lru.h
storage/innodb_plugin/include/buf0lru.ic
storage/innodb_plugin/include/buf0rea.h
storage/innodb_plugin/include/buf0types.h
storage/innodb_plugin/include/data0data.h
storage/innodb_plugin/include/data0data.ic
storage/innodb_plugin/include/data0type.h
storage/innodb_plugin/include/data0type.ic
storage/innodb_plugin/include/data0types.h
storage/innodb_plugin/include/db0err.h
storage/innodb_plugin/include/dict0boot.h
storage/innodb_plugin/include/dict0boot.ic
storage/innodb_plugin/include/dict0crea.h
storage/innodb_plugin/include/dict0crea.ic
storage/innodb_plugin/include/dict0dict.h
storage/innodb_plugin/include/dict0dict.ic
storage/innodb_plugin/include/dict0load.h
storage/innodb_plugin/include/dict0load.ic
storage/innodb_plugin/include/dict0mem.h
storage/innodb_plugin/include/dict0mem.ic
storage/innodb_plugin/include/dict0types.h
storage/innodb_plugin/include/dyn0dyn.h
storage/innodb_plugin/include/dyn0dyn.ic
storage/innodb_plugin/include/eval0eval.h
storage/innodb_plugin/include/eval0eval.ic
storage/innodb_plugin/include/eval0proc.h
storage/innodb_plugin/include/eval0proc.ic
storage/innodb_plugin/include/fil0fil.h
storage/innodb_plugin/include/fsp0fsp.h
storage/innodb_plugin/include/fsp0fsp.ic
storage/innodb_plugin/include/fsp0types.h
storage/innodb_plugin/include/fut0fut.h
storage/innodb_plugin/include/fut0fut.ic
storage/innodb_plugin/include/fut0lst.h
storage/innodb_plugin/include/fut0lst.ic
storage/innodb_plugin/include/ha0ha.h
storage/innodb_plugin/include/ha0ha.ic
storage/innodb_plugin/include/ha0storage.h
storage/innodb_plugin/include/ha0storage.ic
storage/innodb_plugin/include/ha_prototypes.h
storage/innodb_plugin/include/handler0alter.h
storage/innodb_plugin/include/hash0hash.h
storage/innodb_plugin/include/hash0hash.ic
storage/innodb_plugin/include/ibuf0ibuf.h
storage/innodb_plugin/include/ibuf0ibuf.ic
storage/innodb_plugin/include/ibuf0types.h
storage/innodb_plugin/include/lock0iter.h
storage/innodb_plugin/include/lock0lock.h
storage/innodb_plugin/include/lock0lock.ic
storage/innodb_plugin/include/lock0priv.h
storage/innodb_plugin/include/lock0priv.ic
storage/innodb_plugin/include/lock0types.h
storage/innodb_plugin/include/log0log.h
storage/innodb_plugin/include/log0log.ic
storage/innodb_plugin/include/log0recv.h
storage/innodb_plugin/include/log0recv.ic
storage/innodb_plugin/include/mach0data.h
storage/innodb_plugin/include/mach0data.ic
storage/innodb_plugin/include/mem0dbg.h
storage/innodb_plugin/include/mem0dbg.ic
storage/innodb_plugin/include/mem0mem.h
storage/innodb_plugin/include/mem0mem.ic
storage/innodb_plugin/include/mem0pool.h
storage/innodb_plugin/include/mem0pool.ic
storage/innodb_plugin/include/mtr0log.h
storage/innodb_plugin/include/mtr0log.ic
storage/innodb_plugin/include/mtr0mtr.h
storage/innodb_plugin/include/mtr0mtr.ic
storage/innodb_plugin/include/mtr0types.h
storage/innodb_plugin/include/mysql_addons.h
storage/innodb_plugin/include/os0file.h
storage/innodb_plugin/include/os0proc.h
storage/innodb_plugin/include/os0proc.ic
storage/innodb_plugin/include/os0sync.h
storage/innodb_plugin/include/os0sync.ic
storage/innodb_plugin/include/os0thread.h
storage/innodb_plugin/include/os0thread.ic
storage/innodb_plugin/include/page0cur.h
storage/innodb_plugin/include/page0cur.ic
storage/innodb_plugin/include/page0page.h
storage/innodb_plugin/include/page0page.ic
storage/innodb_plugin/include/page0types.h
storage/innodb_plugin/include/page0zip.h
storage/innodb_plugin/include/page0zip.ic
storage/innodb_plugin/include/pars0grm.h
storage/innodb_plugin/include/pars0opt.h
storage/innodb_plugin/include/pars0opt.ic
storage/innodb_plugin/include/pars0pars.h
storage/innodb_plugin/include/pars0pars.ic
storage/innodb_plugin/include/pars0sym.h
storage/innodb_plugin/include/pars0sym.ic
storage/innodb_plugin/include/pars0types.h
storage/innodb_plugin/include/que0que.h
storage/innodb_plugin/include/que0que.ic
storage/innodb_plugin/include/que0types.h
storage/innodb_plugin/include/read0read.h
storage/innodb_plugin/include/read0read.ic
storage/innodb_plugin/include/read0types.h
storage/innodb_plugin/include/rem0cmp.h
storage/innodb_plugin/include/rem0cmp.ic
storage/innodb_plugin/include/rem0rec.h
storage/innodb_plugin/include/rem0rec.ic
storage/innodb_plugin/include/rem0types.h
storage/innodb_plugin/include/row0ext.h
storage/innodb_plugin/include/row0ext.ic
storage/innodb_plugin/include/row0ins.h
storage/innodb_plugin/include/row0ins.ic
storage/innodb_plugin/include/row0merge.h
storage/innodb_plugin/include/row0mysql.h
storage/innodb_plugin/include/row0mysql.ic
storage/innodb_plugin/include/row0purge.h
storage/innodb_plugin/include/row0purge.ic
storage/innodb_plugin/include/row0row.h
storage/innodb_plugin/include/row0row.ic
storage/innodb_plugin/include/row0sel.h
storage/innodb_plugin/include/row0sel.ic
storage/innodb_plugin/include/row0types.h
storage/innodb_plugin/include/row0uins.h
storage/innodb_plugin/include/row0uins.ic
storage/innodb_plugin/include/row0umod.h
storage/innodb_plugin/include/row0umod.ic
storage/innodb_plugin/include/row0undo.h
storage/innodb_plugin/include/row0undo.ic
storage/innodb_plugin/include/row0upd.h
storage/innodb_plugin/include/row0upd.ic
storage/innodb_plugin/include/row0vers.h
storage/innodb_plugin/include/row0vers.ic
storage/innodb_plugin/include/srv0que.h
storage/innodb_plugin/include/srv0srv.h
storage/innodb_plugin/include/srv0srv.ic
storage/innodb_plugin/include/srv0start.h
storage/innodb_plugin/include/sync0arr.h
storage/innodb_plugin/include/sync0arr.ic
storage/innodb_plugin/include/sync0rw.h
storage/innodb_plugin/include/sync0rw.ic
storage/innodb_plugin/include/sync0sync.h
storage/innodb_plugin/include/sync0sync.ic
storage/innodb_plugin/include/sync0types.h
storage/innodb_plugin/include/thr0loc.h
storage/innodb_plugin/include/thr0loc.ic
storage/innodb_plugin/include/trx0i_s.h
storage/innodb_plugin/include/trx0purge.h
storage/innodb_plugin/include/trx0purge.ic
storage/innodb_plugin/include/trx0rec.h
storage/innodb_plugin/include/trx0rec.ic
storage/innodb_plugin/include/trx0roll.h
storage/innodb_plugin/include/trx0roll.ic
storage/innodb_plugin/include/trx0rseg.h
storage/innodb_plugin/include/trx0rseg.ic
storage/innodb_plugin/include/trx0sys.h
storage/innodb_plugin/include/trx0sys.ic
storage/innodb_plugin/include/trx0trx.h
storage/innodb_plugin/include/trx0trx.ic
storage/innodb_plugin/include/trx0types.h
storage/innodb_plugin/include/trx0undo.h
storage/innodb_plugin/include/trx0undo.ic
storage/innodb_plugin/include/trx0xa.h
storage/innodb_plugin/include/univ.i
storage/innodb_plugin/include/usr0sess.h
storage/innodb_plugin/include/usr0sess.ic
storage/innodb_plugin/include/usr0types.h
storage/innodb_plugin/include/ut0auxconf.h
storage/innodb_plugin/include/ut0byte.h
storage/innodb_plugin/include/ut0byte.ic
storage/innodb_plugin/include/ut0dbg.h
storage/innodb_plugin/include/ut0list.h
storage/innodb_plugin/include/ut0list.ic
storage/innodb_plugin/include/ut0lst.h
storage/innodb_plugin/include/ut0mem.h
storage/innodb_plugin/include/ut0mem.ic
storage/innodb_plugin/include/ut0rbt.h
storage/innodb_plugin/include/ut0rnd.h
storage/innodb_plugin/include/ut0rnd.ic
storage/innodb_plugin/include/ut0sort.h
storage/innodb_plugin/include/ut0ut.h
storage/innodb_plugin/include/ut0ut.ic
storage/innodb_plugin/include/ut0vec.h
storage/innodb_plugin/include/ut0vec.ic
storage/innodb_plugin/include/ut0wqueue.h
storage/innodb_plugin/lock/
storage/innodb_plugin/lock/lock0iter.c
storage/innodb_plugin/lock/lock0lock.c
storage/innodb_plugin/log/
storage/innodb_plugin/log/log0log.c
storage/innodb_plugin/log/log0recv.c
storage/innodb_plugin/mach/
storage/innodb_plugin/mach/mach0data.c
storage/innodb_plugin/mem/
storage/innodb_plugin/mem/mem0dbg.c
storage/innodb_plugin/mem/mem0mem.c
storage/innodb_plugin/mem/mem0pool.c
storage/innodb_plugin/mtr/
storage/innodb_plugin/mtr/mtr0log.c
storage/innodb_plugin/mtr/mtr0mtr.c
storage/innodb_plugin/mysql-test/
storage/innodb_plugin/mysql-test/patches/
storage/innodb_plugin/mysql-test/patches/README
storage/innodb_plugin/mysql-test/patches/index_merge_innodb-explain.diff
storage/innodb_plugin/mysql-test/patches/information_schema.diff
storage/innodb_plugin/mysql-test/patches/innodb_file_per_table.diff
storage/innodb_plugin/mysql-test/patches/innodb_lock_wait_timeout.diff
storage/innodb_plugin/mysql-test/patches/innodb_thread_concurrency_basic.diff
storage/innodb_plugin/mysql-test/patches/partition_innodb.diff
storage/innodb_plugin/os/
storage/innodb_plugin/os/os0file.c
storage/innodb_plugin/os/os0proc.c
storage/innodb_plugin/os/os0sync.c
storage/innodb_plugin/os/os0thread.c
storage/innodb_plugin/page/
storage/innodb_plugin/page/page0cur.c
storage/innodb_plugin/page/page0page.c
storage/innodb_plugin/page/page0zip.c
storage/innodb_plugin/pars/
storage/innodb_plugin/pars/lexyy.c
storage/innodb_plugin/pars/make_bison.sh
storage/innodb_plugin/pars/make_flex.sh
storage/innodb_plugin/pars/pars0grm.c
storage/innodb_plugin/pars/pars0grm.y
storage/innodb_plugin/pars/pars0lex.l
storage/innodb_plugin/pars/pars0opt.c
storage/innodb_plugin/pars/pars0pars.c
storage/innodb_plugin/pars/pars0sym.c
storage/innodb_plugin/plug.in.disabled
storage/innodb_plugin/que/
storage/innodb_plugin/que/que0que.c
storage/innodb_plugin/read/
storage/innodb_plugin/read/read0read.c
storage/innodb_plugin/rem/
storage/innodb_plugin/rem/rem0cmp.c
storage/innodb_plugin/rem/rem0rec.c
storage/innodb_plugin/revert_gen.sh
storage/innodb_plugin/row/
storage/innodb_plugin/row/row0ext.c
storage/innodb_plugin/row/row0ins.c
storage/innodb_plugin/row/row0merge.c
storage/innodb_plugin/row/row0mysql.c
storage/innodb_plugin/row/row0purge.c
storage/innodb_plugin/row/row0row.c
storage/innodb_plugin/row/row0sel.c
storage/innodb_plugin/row/row0uins.c
storage/innodb_plugin/row/row0umod.c
storage/innodb_plugin/row/row0undo.c
storage/innodb_plugin/row/row0upd.c
storage/innodb_plugin/row/row0vers.c
storage/innodb_plugin/scripts/
storage/innodb_plugin/scripts/export.sh
storage/innodb_plugin/scripts/install_innodb_plugins.sql
storage/innodb_plugin/scripts/install_innodb_plugins_win.sql
storage/innodb_plugin/setup.sh
storage/innodb_plugin/srv/
storage/innodb_plugin/srv/srv0que.c
storage/innodb_plugin/srv/srv0srv.c
storage/innodb_plugin/srv/srv0start.c
storage/innodb_plugin/sync/
storage/innodb_plugin/sync/sync0arr.c
storage/innodb_plugin/sync/sync0rw.c
storage/innodb_plugin/sync/sync0sync.c
storage/innodb_plugin/thr/
storage/innodb_plugin/thr/thr0loc.c
storage/innodb_plugin/trx/
storage/innodb_plugin/trx/trx0i_s.c
storage/innodb_plugin/trx/trx0purge.c
storage/innodb_plugin/trx/trx0rec.c
storage/innodb_plugin/trx/trx0roll.c
storage/innodb_plugin/trx/trx0rseg.c
storage/innodb_plugin/trx/trx0sys.c
storage/innodb_plugin/trx/trx0trx.c
storage/innodb_plugin/trx/trx0undo.c
storage/innodb_plugin/usr/
storage/innodb_plugin/usr/usr0sess.c
storage/innodb_plugin/ut/
storage/innodb_plugin/ut/ut0auxconf_atomic_pthread_t_gcc.c
storage/innodb_plugin/ut/ut0auxconf_atomic_pthread_t_solaris.c
storage/innodb_plugin/ut/ut0auxconf_have_gcc_atomics.c
storage/innodb_plugin/ut/ut0auxconf_have_solaris_atomics.c
storage/innodb_plugin/ut/ut0auxconf_pause.c
storage/innodb_plugin/ut/ut0auxconf_sizeof_pthread_t.c
storage/innodb_plugin/ut/ut0byte.c
storage/innodb_plugin/ut/ut0dbg.c
storage/innodb_plugin/ut/ut0list.c
storage/innodb_plugin/ut/ut0mem.c
storage/innodb_plugin/ut/ut0rbt.c
storage/innodb_plugin/ut/ut0rnd.c
storage/innodb_plugin/ut/ut0ut.c
storage/innodb_plugin/ut/ut0vec.c
storage/innodb_plugin/ut/ut0wqueue.c
storage/pbxt/bin/
storage/pbxt/bin/Makefile.am
storage/pbxt/bin/xtstat_xt.cc
storage/xtradb/build/
storage/xtradb/build/debian/
storage/xtradb/build/debian/README.Maintainer
storage/xtradb/build/debian/additions/
storage/xtradb/build/debian/additions/Docs__Images__Makefile.in
storage/xtradb/build/debian/additions/Docs__Makefile.in
storage/xtradb/build/debian/additions/debian-start
storage/xtradb/build/debian/additions/debian-start.inc.sh
storage/xtradb/build/debian/additions/echo_stderr
storage/xtradb/build/debian/additions/innotop/
storage/xtradb/build/debian/additions/innotop/InnoDBParser.pm
storage/xtradb/build/debian/additions/innotop/changelog.innotop
storage/xtradb/build/debian/additions/innotop/innotop
storage/xtradb/build/debian/additions/innotop/innotop.1
storage/xtradb/build/debian/additions/msql2mysql.1
storage/xtradb/build/debian/additions/my.cnf
storage/xtradb/build/debian/additions/my_print_defaults.1
storage/xtradb/build/debian/additions/myisam_ftdump.1
storage/xtradb/build/debian/additions/myisamchk.1
storage/xtradb/build/debian/additions/myisamlog.1
storage/xtradb/build/debian/additions/myisampack.1
storage/xtradb/build/debian/additions/mysql-server.lintian-overrides
storage/xtradb/build/debian/additions/mysql_config.1
storage/xtradb/build/debian/additions/mysql_convert_table_format.1
storage/xtradb/build/debian/additions/mysql_find_rows.1
storage/xtradb/build/debian/additions/mysql_fix_extensions.1
storage/xtradb/build/debian/additions/mysql_install_db.1
storage/xtradb/build/debian/additions/mysql_secure_installation.1
storage/xtradb/build/debian/additions/mysql_setpermission.1
storage/xtradb/build/debian/additions/mysql_tableinfo.1
storage/xtradb/build/debian/additions/mysql_waitpid.1
storage/xtradb/build/debian/additions/mysqlbinlog.1
storage/xtradb/build/debian/additions/mysqlbug.1
storage/xtradb/build/debian/additions/mysqlcheck.1
storage/xtradb/build/debian/additions/mysqld_safe_syslog.cnf
storage/xtradb/build/debian/additions/mysqldumpslow.1
storage/xtradb/build/debian/additions/mysqlimport.1
storage/xtradb/build/debian/additions/mysqlmanager.1
storage/xtradb/build/debian/additions/mysqlreport
storage/xtradb/build/debian/additions/mysqlreport.1
storage/xtradb/build/debian/additions/mysqltest.1
storage/xtradb/build/debian/additions/pack_isam.1
storage/xtradb/build/debian/additions/resolve_stack_dump.1
storage/xtradb/build/debian/additions/resolveip.1
storage/xtradb/build/debian/changelog
storage/xtradb/build/debian/compat
storage/xtradb/build/debian/control
storage/xtradb/build/debian/copyright
storage/xtradb/build/debian/libpercona-xtradb-client-dev.README.Maintainer
storage/xtradb/build/debian/libpercona-xtradb-client-dev.dirs
storage/xtradb/build/debian/libpercona-xtradb-client-dev.docs
storage/xtradb/build/debian/libpercona-xtradb-client-dev.examples
storage/xtradb/build/debian/libpercona-xtradb-client-dev.files
storage/xtradb/build/debian/libpercona-xtradb-client-dev.links
storage/xtradb/build/debian/libpercona-xtradb-client16.dirs
storage/xtradb/build/debian/libpercona-xtradb-client16.docs
storage/xtradb/build/debian/libpercona-xtradb-client16.files
storage/xtradb/build/debian/libpercona-xtradb-client16.postinst
storage/xtradb/build/debian/patches/
storage/xtradb/build/debian/patches/00list
storage/xtradb/build/debian/patches/01_MAKEFILES__Docs_Images_Makefile.in.dpatch
storage/xtradb/build/debian/patches/01_MAKEFILES__Docs_Makefile.in.dpatch
storage/xtradb/build/debian/patches/33_scripts__mysql_create_system_tables__no_test.dpatch
storage/xtradb/build/debian/patches/38_scripts__mysqld_safe.sh__signals.dpatch
storage/xtradb/build/debian/patches/41_scripts__mysql_install_db.sh__no_test.dpatch
storage/xtradb/build/debian/patches/44_scripts__mysql_config__libs.dpatch
storage/xtradb/build/debian/patches/50_mysql-test__db_test.dpatch
storage/xtradb/build/debian/patches/60_percona_support.dpatch
storage/xtradb/build/debian/percona-xtradb-client-5.1.README.Debian
storage/xtradb/build/debian/percona-xtradb-client-5.1.dirs
storage/xtradb/build/debian/percona-xtradb-client-5.1.docs
storage/xtradb/build/debian/percona-xtradb-client-5.1.files
storage/xtradb/build/debian/percona-xtradb-client-5.1.links
storage/xtradb/build/debian/percona-xtradb-client-5.1.lintian-overrides
storage/xtradb/build/debian/percona-xtradb-client-5.1.menu
storage/xtradb/build/debian/percona-xtradb-common.dirs
storage/xtradb/build/debian/percona-xtradb-common.files
storage/xtradb/build/debian/percona-xtradb-common.lintian-overrides
storage/xtradb/build/debian/percona-xtradb-common.postrm
storage/xtradb/build/debian/percona-xtradb-server-5.1.NEWS
storage/xtradb/build/debian/percona-xtradb-server-5.1.README.Debian
storage/xtradb/build/debian/percona-xtradb-server-5.1.config
storage/xtradb/build/debian/percona-xtradb-server-5.1.dirs
storage/xtradb/build/debian/percona-xtradb-server-5.1.docs
storage/xtradb/build/debian/percona-xtradb-server-5.1.files
storage/xtradb/build/debian/percona-xtradb-server-5.1.links
storage/xtradb/build/debian/percona-xtradb-server-5.1.lintian-overrides
storage/xtradb/build/debian/percona-xtradb-server-5.1.logcheck.ignore.paranoid
storage/xtradb/build/debian/percona-xtradb-server-5.1.logcheck.ignore.server
storage/xtradb/build/debian/percona-xtradb-server-5.1.logcheck.ignore.workstation
storage/xtradb/build/debian/percona-xtradb-server-5.1.mysql.init
storage/xtradb/build/debian/percona-xtradb-server-5.1.percona-xtradb-server.logrotate
storage/xtradb/build/debian/percona-xtradb-server-5.1.postinst
storage/xtradb/build/debian/percona-xtradb-server-5.1.postrm
storage/xtradb/build/debian/percona-xtradb-server-5.1.preinst
storage/xtradb/build/debian/percona-xtradb-server-5.1.prerm
storage/xtradb/build/debian/percona-xtradb-server-5.1.templates
storage/xtradb/build/debian/po/
storage/xtradb/build/debian/po/POTFILES.in
storage/xtradb/build/debian/po/ar.po
storage/xtradb/build/debian/po/ca.po
storage/xtradb/build/debian/po/cs.po
storage/xtradb/build/debian/po/da.po
storage/xtradb/build/debian/po/de.po
storage/xtradb/build/debian/po/es.po
storage/xtradb/build/debian/po/eu.po
storage/xtradb/build/debian/po/fr.po
storage/xtradb/build/debian/po/gl.po
storage/xtradb/build/debian/po/it.po
storage/xtradb/build/debian/po/ja.po
storage/xtradb/build/debian/po/nb.po
storage/xtradb/build/debian/po/nl.po
storage/xtradb/build/debian/po/pt.po
storage/xtradb/build/debian/po/pt_BR.po
storage/xtradb/build/debian/po/ro.po
storage/xtradb/build/debian/po/ru.po
storage/xtradb/build/debian/po/sv.po
storage/xtradb/build/debian/po/templates.pot
storage/xtradb/build/debian/po/tr.po
storage/xtradb/build/debian/rules
storage/xtradb/build/debian/source.lintian-overrides
storage/xtradb/build/debian/watch
storage/xtradb/build/percona-sql.spec
renamed:
mysql-test/r/innodb_bug39438.result => mysql-test/suite/innodb/r/innodb_bug39438.result
mysql-test/r/variables+c.result => mysql-test/r/variables_community.result
mysql-test/t/innodb-use-sys-malloc.test => mysql-test/suite/innodb/t/innodb-use-sys-malloc.test
mysql-test/t/innodb_bug39438-master.opt => mysql-test/suite/innodb/t/innodb_bug39438-master.opt
mysql-test/t/innodb_bug39438.test => mysql-test/suite/innodb/t/innodb_bug39438.test
mysql-test/t/variables+c.test => mysql-test/t/variables_community.test
modified:
.bzrignore
COPYING
INSTALL-SOURCE
INSTALL-WIN-SOURCE
client/mysql.cc
client/mysql_upgrade.c
client/mysqladmin.cc
client/mysqlbinlog.cc
client/mysqlcheck.c
client/mysqldump.c
client/mysqlimport.c
client/mysqlshow.c
client/mysqlslap.c
client/mysqltest.cc
cmd-line-utils/readline/rlmbutil.h
configure.in
extra/libevent/event-internal.h
extra/yassl/include/yassl_error.hpp
extra/yassl/src/ssl.cpp
extra/yassl/src/yassl_error.cpp
include/Makefile.am
include/my_global.h
include/my_sys.h
include/mysql/plugin.h
include/mysql/plugin.h.pp
libmysql/libmysql.c
man/comp_err.1
man/innochecksum.1
man/make_win_bin_dist.1
man/msql2mysql.1
man/my_print_defaults.1
man/myisam_ftdump.1
man/myisamchk.1
man/myisamlog.1
man/myisampack.1
man/mysql-stress-test.pl.1
man/mysql-test-run.pl.1
man/mysql.1
man/mysql.server.1
man/mysql_client_test.1
man/mysql_config.1
man/mysql_convert_table_format.1
man/mysql_find_rows.1
man/mysql_fix_extensions.1
man/mysql_fix_privilege_tables.1
man/mysql_install_db.1
man/mysql_secure_installation.1
man/mysql_setpermission.1
man/mysql_tzinfo_to_sql.1
man/mysql_upgrade.1
man/mysql_waitpid.1
man/mysql_zap.1
man/mysqlaccess.1
man/mysqladmin.1
man/mysqlbinlog.1
man/mysqlbug.1
man/mysqlcheck.1
man/mysqld.8
man/mysqld_multi.1
man/mysqld_safe.1
man/mysqldump.1
man/mysqldumpslow.1
man/mysqlhotcopy.1
man/mysqlimport.1
man/mysqlmanager.8
man/mysqlshow.1
man/mysqlslap.1
man/mysqltest.1
man/ndbd.8
man/ndbd_redo_log_reader.1
man/ndbmtd.8
man/perror.1
man/replace.1
man/resolve_stack_dump.1
man/resolveip.1
mysql-test/Makefile.am
mysql-test/collections/default.daily
mysql-test/collections/default.push
mysql-test/extra/rpl_tests/rpl_get_master_version_and_clock.test
mysql-test/extra/rpl_tests/rpl_loaddata.test
mysql-test/include/mtr_warnings.sql
mysql-test/include/test_fieldsize.inc
mysql-test/lib/My/ConfigFactory.pm
mysql-test/lib/My/SafeProcess.pm
mysql-test/lib/My/SafeProcess/safe_process_win.cc
mysql-test/lib/mtr_cases.pm
mysql-test/lib/mtr_gprof.pl
mysql-test/lib/mtr_misc.pl
mysql-test/lib/mtr_report.pm
mysql-test/lib/mtr_stress.pl
mysql-test/lib/v1/mtr_stress.pl
mysql-test/lib/v1/mysql-test-run.pl
mysql-test/mysql-stress-test.pl
mysql-test/mysql-test-run.pl
mysql-test/r/archive.result
mysql-test/r/backup.result
mysql-test/r/bigint.result
mysql-test/r/compare.result
mysql-test/r/csv.result
mysql-test/r/ctype_ldml.result
mysql-test/r/ctype_ucs.result
mysql-test/r/default.result
mysql-test/r/delete.result
mysql-test/r/error_simulation.result
mysql-test/r/explain.result
mysql-test/r/fulltext.result
mysql-test/r/func_concat.result
mysql-test/r/func_gconcat.result
mysql-test/r/func_str.result
mysql-test/r/func_time.result
mysql-test/r/gis-rtree.result
mysql-test/r/group_by.result
mysql-test/r/group_min_max.result
mysql-test/r/handler_myisam.result
mysql-test/r/having.result
mysql-test/r/index_merge_myisam.result
mysql-test/r/information_schema.result
mysql-test/r/information_schema_all_engines.result
mysql-test/r/innodb_mysql.result
mysql-test/r/join.result
mysql-test/r/join_outer.result
mysql-test/r/loaddata.result
mysql-test/r/log_state.result
mysql-test/r/merge.result
mysql-test/r/metadata.result
mysql-test/r/multi_update.result
mysql-test/r/myisam.result
mysql-test/r/mysqlbinlog.result
mysql-test/r/mysqlbinlog_row_innodb.result
mysql-test/r/mysqltest.result
mysql-test/r/partition.result
mysql-test/r/partition_error.result
mysql-test/r/partition_innodb.result
mysql-test/r/partition_pruning.result
mysql-test/r/partition_range.result
mysql-test/r/ps.result
mysql-test/r/query_cache_with_views.result
mysql-test/r/row.result
mysql-test/r/select.result
mysql-test/r/show_check.result
mysql-test/r/skip_name_resolve.result
mysql-test/r/sp-bugs.result
mysql-test/r/sp-error.result
mysql-test/r/sp.result
mysql-test/r/sp_notembedded.result
mysql-test/r/sp_trans.result
mysql-test/r/subselect.result
mysql-test/r/subselect3.result
mysql-test/r/symlink.result
mysql-test/r/table_elim.result
mysql-test/r/trigger.result
mysql-test/r/type_bit.result
mysql-test/r/type_blob.result
mysql-test/r/type_date.result
mysql-test/r/type_datetime.result
mysql-test/r/type_timestamp.result
mysql-test/r/type_year.result
mysql-test/r/union.result
mysql-test/r/update.result
mysql-test/r/variables.result
mysql-test/r/variables_debug.result
mysql-test/r/view.result
mysql-test/r/view_grant.result
mysql-test/r/warnings.result
mysql-test/r/xa.result
mysql-test/suite/binlog/r/binlog_innodb_row.result
mysql-test/suite/binlog/r/binlog_row_mix_innodb_myisam.result
mysql-test/suite/binlog/r/binlog_stm_binlog.result
mysql-test/suite/binlog/r/binlog_stm_mix_innodb_myisam.result
mysql-test/suite/binlog/r/binlog_stm_unsafe_warning.result
mysql-test/suite/binlog/r/binlog_tmp_table.result
mysql-test/suite/binlog/r/binlog_unsafe.result
mysql-test/suite/binlog/t/binlog_innodb_row.test
mysql-test/suite/binlog/t/binlog_killed.test
mysql-test/suite/binlog/t/binlog_stm_binlog.test
mysql-test/suite/binlog/t/binlog_stm_unsafe_warning.test
mysql-test/suite/binlog/t/binlog_tmp_table.test
mysql-test/suite/federated/federated.result
mysql-test/suite/federated/federated.test
mysql-test/suite/funcs_1/r/is_columns_is.result
mysql-test/suite/funcs_1/r/is_tables_is.result
mysql-test/suite/maria/t/maria-recovery-bitmap.test
mysql-test/suite/parts/inc/partition_auto_increment.inc
mysql-test/suite/parts/r/partition_auto_increment_archive.result
mysql-test/suite/parts/r/partition_auto_increment_blackhole.result
mysql-test/suite/parts/r/partition_auto_increment_innodb.result
mysql-test/suite/parts/r/partition_auto_increment_maria.result
mysql-test/suite/parts/r/partition_auto_increment_memory.result
mysql-test/suite/parts/r/partition_auto_increment_myisam.result
mysql-test/suite/parts/r/partition_auto_increment_ndb.result
mysql-test/suite/pbxt/r/default.result
mysql-test/suite/pbxt/r/func_str.result
mysql-test/suite/pbxt/r/group_min_max.result
mysql-test/suite/pbxt/r/join_nested.result
mysql-test/suite/pbxt/r/mysqlshow.result
mysql-test/suite/pbxt/r/negation_elimination.result
mysql-test/suite/pbxt/r/null.result
mysql-test/suite/pbxt/r/order_by.result
mysql-test/suite/pbxt/r/pbxt_ref_int.result
mysql-test/suite/pbxt/r/pbxt_xa.result
mysql-test/suite/pbxt/r/range.result
mysql-test/suite/pbxt/r/select.result
mysql-test/suite/pbxt/r/select_safe.result
mysql-test/suite/pbxt/r/subselect.result
mysql-test/suite/pbxt/r/type_timestamp.result
mysql-test/suite/pbxt/t/pbxt_xa.test
mysql-test/suite/pbxt/t/select_safe.test
mysql-test/suite/rpl/r/rpl_begin_commit_rollback.result
mysql-test/suite/rpl/r/rpl_do_grant.result
mysql-test/suite/rpl/r/rpl_events.result
mysql-test/suite/rpl/r/rpl_get_master_version_and_clock.result
mysql-test/suite/rpl/r/rpl_innodb_mixed_dml.result
mysql-test/suite/rpl/r/rpl_row_create_table.result
mysql-test/suite/rpl/r/rpl_sp.result
mysql-test/suite/rpl/t/disabled.def
mysql-test/suite/rpl/t/rpl_begin_commit_rollback.test
mysql-test/suite/rpl/t/rpl_do_grant.test
mysql-test/suite/rpl/t/rpl_events.test
mysql-test/suite/rpl/t/rpl_get_master_version_and_clock.test
mysql-test/suite/rpl/t/rpl_loaddata_symlink.test
mysql-test/suite/rpl/t/rpl_row_create_table.test
mysql-test/suite/rpl/t/rpl_slave_skip.test
mysql-test/suite/sys_vars/r/log_basic.result
mysql-test/suite/sys_vars/r/log_bin_trust_routine_creators_basic.result
mysql-test/suite/sys_vars/r/myisam_sort_buffer_size_basic_32.result
mysql-test/suite/sys_vars/r/myisam_sort_buffer_size_basic_64.result
mysql-test/suite/sys_vars/r/slow_query_log_func.result
mysql-test/suite/sys_vars/t/innodb_table_locks_func.test
mysql-test/suite/sys_vars/t/slow_query_log_func.test
mysql-test/suite/sys_vars/t/sql_low_priority_updates_func.test
mysql-test/t/archive.test
mysql-test/t/bigint.test
mysql-test/t/csv.test
mysql-test/t/ctype_ldml.test
mysql-test/t/ctype_ucs.test
mysql-test/t/delete.test
mysql-test/t/disabled.def
mysql-test/t/error_simulation.test
mysql-test/t/explain.test
mysql-test/t/fulltext.test
mysql-test/t/func_concat.test
mysql-test/t/func_gconcat.test
mysql-test/t/func_str.test
mysql-test/t/gis-rtree.test
mysql-test/t/group_by.test
mysql-test/t/group_min_max.test
mysql-test/t/handler_myisam.test
mysql-test/t/having.test
mysql-test/t/information_schema_all_engines.test
mysql-test/t/innodb_mysql.test
mysql-test/t/join.test
mysql-test/t/join_outer.test
mysql-test/t/loaddata.test
mysql-test/t/merge.test
mysql-test/t/metadata.test
mysql-test/t/multi_update.test
mysql-test/t/myisam.test
mysql-test/t/mysql_upgrade.test
mysql-test/t/mysqlbinlog.test
mysql-test/t/mysqltest.test
mysql-test/t/partition.test
mysql-test/t/partition_error.test
mysql-test/t/partition_innodb.test
mysql-test/t/partition_innodb_plugin.test
mysql-test/t/partition_innodb_semi_consistent.test
mysql-test/t/partition_pruning.test
mysql-test/t/partition_range.test
mysql-test/t/ps.test
mysql-test/t/query_cache_with_views.test
mysql-test/t/row.test
mysql-test/t/skip_name_resolve.test
mysql-test/t/sp-bugs.test
mysql-test/t/sp_notembedded.test
mysql-test/t/subselect.test
mysql-test/t/symlink.test
mysql-test/t/trigger.test
mysql-test/t/type_bit.test
mysql-test/t/type_date.test
mysql-test/t/type_year.test
mysql-test/t/udf.test
mysql-test/t/update.test
mysql-test/t/variables.test
mysql-test/t/variables_debug.test
mysql-test/t/view.test
mysql-test/t/view_grant.test
mysql-test/t/xa.test
mysys/charset.c
mysys/default.c
mysys/mf_loadpath.c
mysys/mf_pack.c
mysys/my_alloc.c
mysys/my_file.c
mysys/my_getwd.c
mysys/my_init.c
mysys/my_symlink.c
scripts/fill_help_tables.sql
scripts/make_binary_distribution.sh
scripts/make_win_bin_dist
scripts/mysql_system_tables_fix.sql
scripts/mysqld_safe.sh
scripts/mysqlhotcopy.sh
server-tools/instance-manager/options.cc
sql/CMakeLists.txt
sql/debug_sync.cc
sql/debug_sync.h
sql/events.cc
sql/field.cc
sql/field.h
sql/field_conv.cc
sql/filesort.cc
sql/ha_ndbcluster.cc
sql/ha_partition.cc
sql/handler.cc
sql/handler.h
sql/item.cc
sql/item.h
sql/item_cmpfunc.cc
sql/item_cmpfunc.h
sql/item_create.cc
sql/item_create.h
sql/item_func.cc
sql/item_row.cc
sql/item_row.h
sql/item_strfunc.cc
sql/item_strfunc.h
sql/item_subselect.cc
sql/item_subselect.h
sql/item_sum.cc
sql/item_sum.h
sql/item_timefunc.cc
sql/log.cc
sql/log_event.cc
sql/log_event.h
sql/log_event_old.cc
sql/mysql_priv.h
sql/mysqld.cc
sql/opt_range.cc
sql/opt_range.h
sql/opt_sum.cc
sql/protocol.cc
sql/rpl_utility.cc
sql/rpl_utility.h
sql/set_var.cc
sql/share/errmsg.txt
sql/slave.cc
sql/sp.cc
sql/sp_cache.cc
sql/sp_head.cc
sql/sp_head.h
sql/sql_acl.cc
sql/sql_base.cc
sql/sql_class.cc
sql/sql_class.h
sql/sql_delete.cc
sql/sql_insert.cc
sql/sql_lex.cc
sql/sql_lex.h
sql/sql_load.cc
sql/sql_parse.cc
sql/sql_partition.cc
sql/sql_plugin.cc
sql/sql_profile.cc
sql/sql_repl.cc
sql/sql_select.cc
sql/sql_select.h
sql/sql_show.cc
sql/sql_table.cc
sql/sql_trigger.cc
sql/sql_update.cc
sql/sql_view.cc
sql/sql_yacc.yy
sql/table.cc
sql/table.h
storage/archive/ha_archive.cc
storage/csv/ha_tina.cc
storage/example/ha_example.h
storage/federated/ha_federated.cc
storage/federated/ha_federated.h
storage/myisam/ft_boolean_search.c
storage/myisam/ha_myisam.cc
storage/myisam/mi_check.c
storage/myisam/mi_delete_all.c
storage/myisam/mi_delete_table.c
storage/myisam/mi_dynrec.c
storage/myisam/mi_extra.c
storage/myisam/mi_locking.c
storage/myisam/mi_open.c
storage/myisam/mi_page.c
storage/myisam/mi_rnext.c
storage/myisam/mi_write.c
storage/myisam/myisamdef.h
storage/myisam/rt_index.c
storage/myisam/rt_split.c
storage/myisam/sort.c
storage/myisammrg/ha_myisammrg.cc
storage/myisammrg/myrg_open.c
storage/pbxt/ChangeLog
storage/pbxt/Makefile.am
storage/pbxt/src/backup_xt.cc
storage/pbxt/src/cache_xt.cc
storage/pbxt/src/cache_xt.h
storage/pbxt/src/database_xt.cc
storage/pbxt/src/database_xt.h
storage/pbxt/src/datadic_xt.cc
storage/pbxt/src/datadic_xt.h
storage/pbxt/src/datalog_xt.cc
storage/pbxt/src/filesys_xt.h
storage/pbxt/src/ha_pbxt.cc
storage/pbxt/src/index_xt.cc
storage/pbxt/src/index_xt.h
storage/pbxt/src/lock_xt.cc
storage/pbxt/src/lock_xt.h
storage/pbxt/src/locklist_xt.cc
storage/pbxt/src/myxt_xt.cc
storage/pbxt/src/pthread_xt.cc
storage/pbxt/src/pthread_xt.h
storage/pbxt/src/restart_xt.cc
storage/pbxt/src/restart_xt.h
storage/pbxt/src/strutil_xt.cc
storage/pbxt/src/tabcache_xt.cc
storage/pbxt/src/tabcache_xt.h
storage/pbxt/src/table_xt.cc
storage/pbxt/src/table_xt.h
storage/pbxt/src/thread_xt.cc
storage/pbxt/src/thread_xt.h
storage/pbxt/src/trace_xt.cc
storage/pbxt/src/trace_xt.h
storage/pbxt/src/xaction_xt.cc
storage/pbxt/src/xaction_xt.h
storage/pbxt/src/xactlog_xt.cc
storage/pbxt/src/xactlog_xt.h
storage/pbxt/src/xt_config.h
storage/pbxt/src/xt_defs.h
storage/xtradb/btr/btr0btr.c
storage/xtradb/btr/btr0cur.c
storage/xtradb/btr/btr0pcur.c
storage/xtradb/btr/btr0sea.c
storage/xtradb/buf/buf0buddy.c
storage/xtradb/buf/buf0buf.c
storage/xtradb/buf/buf0flu.c
storage/xtradb/buf/buf0rea.c
storage/xtradb/dict/dict0crea.c
storage/xtradb/dict/dict0dict.c
storage/xtradb/dict/dict0mem.c
storage/xtradb/fil/fil0fil.c
storage/xtradb/fsp/fsp0fsp.c
storage/xtradb/handler/ha_innodb.cc
storage/xtradb/handler/ha_innodb.h
storage/xtradb/handler/i_s.cc
storage/xtradb/handler/i_s.h
storage/xtradb/handler/innodb_patch_info.h
storage/xtradb/include/btr0btr.ic
storage/xtradb/include/buf0buddy.h
storage/xtradb/include/buf0buf.h
storage/xtradb/include/buf0buf.ic
storage/xtradb/include/buf0types.h
storage/xtradb/include/dict0dict.h
storage/xtradb/include/dict0mem.h
storage/xtradb/include/fil0fil.h
storage/xtradb/include/fsp0types.h
storage/xtradb/include/fut0fut.ic
storage/xtradb/include/ha_prototypes.h
storage/xtradb/include/page0cur.h
storage/xtradb/include/page0types.h
storage/xtradb/include/srv0srv.h
storage/xtradb/include/trx0sys.h
storage/xtradb/include/univ.i
storage/xtradb/include/ut0rnd.h
storage/xtradb/include/ut0rnd.ic
storage/xtradb/lock/lock0lock.c
storage/xtradb/log/log0log.c
storage/xtradb/log/log0recv.c
storage/xtradb/mtr/mtr0log.c
storage/xtradb/page/page0cur.c
storage/xtradb/page/page0zip.c
storage/xtradb/row/row0ins.c
storage/xtradb/row/row0merge.c
storage/xtradb/row/row0sel.c
storage/xtradb/srv/srv0srv.c
storage/xtradb/srv/srv0start.c
storage/xtradb/trx/trx0i_s.c
storage/xtradb/trx/trx0trx.c
support-files/compiler_warnings.supp
support-files/mysql.spec.sh
The size of the diff (1809184 lines) is larger than your specified limit of 5000 lines
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0

Re: [Maria-developers] [Commits] Rev 2861: fix questionable UNIV_EXPECT's in the xtradb that confused old gcc. in http://bazaar.launchpad.net/~maria-captains/maria/5.1/
by Michael Widenius 16 Jun '10
by Michael Widenius 16 Jun '10
16 Jun '10
Hi!
>>>>> "serg" == serg <serg(a)askmonty.org> writes:
serg> At http://bazaar.launchpad.net/~maria-captains/maria/5.1/
serg> ------------------------------------------------------------
serg> revno: 2861
serg> revision-id: sergii(a)pisem.net-20100609115351-op2cui7bw14y76kp
serg> parent: knielsen(a)knielsen-hq.org-20100531084334-81f5z74nxx6v9zww
serg> committer: Sergei Golubchik <sergii(a)pisem.net>
serg> branch nick: 5.1
serg> timestamp: Wed 2010-06-09 13:53:51 +0200
serg> message:
serg> fix questionable UNIV_EXPECT's in the xtradb that confused old gcc.
I assume you have cc: the XtraDB developers about this change so that
we don't have to do it over and over again?
Regards,
Monty
2
1

[Maria-developers] [Commits] Rev 2866: mysqltest: use setenv, not putenv, to make gcov happy. in http://bazaar.launchpad.net/~maria-captains/maria/5.1/
by Michael Widenius 16 Jun '10
by Michael Widenius 16 Jun '10
16 Jun '10
Hi!
>>>>> "serg" == serg <serg(a)askmonty.org> writes:
serg> At http://bazaar.launchpad.net/~maria-captains/maria/5.1/
serg> ------------------------------------------------------------
serg> revno: 2866
serg> revision-id: sergii(a)pisem.net-20100614091854-5ynq6lo943qlaacw
serg> parent: monty(a)askmonty.org-20100613221332-ldsnptg0j0mn8u9a
serg> committer: Sergei Golubchik <sergii(a)pisem.net>
serg> branch nick: 5.1
serg> timestamp: Mon 2010-06-14 11:18:54 +0200
serg> message:
serg> mysqltest: use setenv, not putenv, to make gcov happy.
serg> (backport from MySQL)
+static int setenv(const char *name, const char *value, int overwrite)
+{
+ size_t buflen= strlen(name) + strlen(value) + 2;
+ char *envvar= (char *)malloc(buflen);
+ if(!envvar)
+ return ENOMEM;
+ strcpy(envvar, name);
+ strcat(envvar, "=");
+ strcat(envvar, value);
+ putenv(envvar);
+ return 0;
+}
+#endif
I expected better from you :)
A much better version is:
strcat(strcat(strmov(envvar, name), "="), value);
The other question I have is will this not cause a memory leek?
If we allocate the same string many times in here it will definitely
be a memory leak as putenv() will never free the old value.
Regards,
Monty
2
1

[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 13 Jun '10
by worklog-noreply@askmonty.org 13 Jun '10
13 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sun, 13 Jun 2010, 11:56)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.9396 2010-06-13 11:56:34.000000000 +0000
+++ /tmp/wklog.121.new.9396 2010-06-13 11:56:34.000000000 +0000
@@ -1,17 +1,55 @@
-Basic idea: DS-MRR scan should be done as follows:
+1. Choices to be made
+---------------------
-1. Sort incoming keys
-2. Use the sorted keys to do a disk-ordered retrieval
+1.1 Handling of complex ranges
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The "sort incoming keys" part is easy when we have only equality ranges.
+If we allow ranges of arbitrary form (including ranges with one endpoint
+being infinity and/or ranges overlapping with one another), sorting becomes
+non-trivial. Do we need to support this case or support only equality ranges?
-Unresolved questions:
+Decision: the new code should handle only the case with equality ranges.
+For non-equality ranges, the execution will proceed as before.
-* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( including ranges with one endpoint
- being infinity or ranges overlapping with one another), sorting becomes
- non-trivial. Do we need to support this case or support only equality ranges?
+1.2 Handling index prefix scans
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+What do we do if asked to do a scan on a prefix of clustered PK?
-* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs? (current decision: No)
+Decision: handle this if the ranges are equality ranges. The difference from
+scan on full primary key is that
+- we will have to use read_range_XXX() or index_read()/index_next_same()
+ functions, while for full primary key value we could have used rnd_pos().
+- One equality range can produce multiple matching records.
-* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
- of clustered PK will not be in disk order. We need to run it with regular mode)
+1.3 Use of knowledge that primary_key==rowid
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+PKs?
+Decision: don't make this assumption.
+
+
+2. Code-level changes overview
+------------------------------
+
+DsMrr_impl::choose_mrr_impl():
+Enable MRR when
+ - ihis is a clustered primary key
+ - incoming ranges are single-point (HA_MRR_SINGLE_POINT is set)
+ - will need to make the SQL layer to set this flag
+ - incoming ranges are not already sorted (HA_MRR_SORTED is not set)
+
+(TODO do we need new cost formula?)
+
+DsMrr_impl::dsmrr_init()
+ - different elem_size (not rowid length but key tuple length)
+ - don't create the secondary handler object, we won't need it.
+
+DsMrr_impl::dsmrr_fill_buffer():
+ - need a variant of this function that would not access the index but just
+ fill and sort the array.
+
+DsMrr_impl::dsmrr_next():
+ - should abstract-out:
+ - buffer element size
+ - rnd_pos/index_read call.
+ - Also for CPK prefix scans there can be multi
-=-=(Psergey - Sun, 13 Jun 2010, 11:55)=-=-
High Level Description modified.
--- /tmp/wklog.121.old.9380 2010-06-13 11:55:42.000000000 +0000
+++ /tmp/wklog.121.new.9380 2010-06-13 11:55:42.000000000 +0000
@@ -1,18 +1,18 @@
-Currently, DS-MRR doesn't support operation over clustered primary keys. The
-reason for this was that
- - Clustered primary keys are stored in disk order and so, if the sequence of
- ranges is ordered, the reads will already go in disk order (and so DS-MRR's
- step of re-ordering reads is not necessary).
+Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
+reason for this is that
+ - Clustered primary keys are stored in disk order, so, if the sequence of
+ scanned ranges is ordered, the reads will automatically happen in disk
+ order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
-However, with BKA making the MRR calls, there are cases where DS-MRR over
+However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
- order them, so that it hits the disk in key order.
+ order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
@@ -20,3 +20,9 @@
* TODO anything else?
+This WL entry is about addressing the above by adding support of DS-MRR over
+clustered primary key that would work according to this strategy:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval.
+
+
-=-=(Psergey - Sun, 13 Jun 2010, 09:42)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.5009 2010-06-13 09:42:38.000000000 +0000
+++ /tmp/wklog.121.new.5009 2010-06-13 09:42:38.000000000 +0000
@@ -6,11 +6,12 @@
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
- non-trival. Do we need to support this case or support only equality ranges?
+ non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs?
+ PKs? (current decision: No)
-* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
+* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
+ of clustered PK will not be in disk order. We need to run it with regular mode)
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
reason for this is that
- Clustered primary keys are stored in disk order, so, if the sequence of
scanned ranges is ordered, the reads will automatically happen in disk
order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
This WL entry is about addressing the above by adding support of DS-MRR over
clustered primary key that would work according to this strategy:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval.
HIGH-LEVEL SPECIFICATION:
1. Choices to be made
---------------------
1.1 Handling of complex ranges
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The "sort incoming keys" part is easy when we have only equality ranges.
If we allow ranges of arbitrary form (including ranges with one endpoint
being infinity and/or ranges overlapping with one another), sorting becomes
non-trivial. Do we need to support this case or support only equality ranges?
Decision: the new code should handle only the case with equality ranges.
For non-equality ranges, the execution will proceed as before.
1.2 Handling index prefix scans
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
What do we do if asked to do a scan on a prefix of clustered PK?
Decision: handle this if the ranges are equality ranges. The difference from
scan on full primary key is that
- we will have to use read_range_XXX() or index_read()/index_next_same()
functions, while for full primary key value we could have used rnd_pos().
- One equality range can produce multiple matching records.
1.3 Use of knowledge that primary_key==rowid
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs?
Decision: don't make this assumption.
2. Code-level changes overview
------------------------------
DsMrr_impl::choose_mrr_impl():
Enable MRR when
- ihis is a clustered primary key
- incoming ranges are single-point (HA_MRR_SINGLE_POINT is set)
- will need to make the SQL layer to set this flag
- incoming ranges are not already sorted (HA_MRR_SORTED is not set)
(TODO do we need new cost formula?)
DsMrr_impl::dsmrr_init()
- different elem_size (not rowid length but key tuple length)
- don't create the secondary handler object, we won't need it.
DsMrr_impl::dsmrr_fill_buffer():
- need a variant of this function that would not access the index but just
fill and sort the array.
DsMrr_impl::dsmrr_next():
- should abstract-out:
- buffer element size
- rnd_pos/index_read call.
- Also for CPK prefix scans there can be multi
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 13 Jun '10
by worklog-noreply@askmonty.org 13 Jun '10
13 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sun, 13 Jun 2010, 11:56)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.9396 2010-06-13 11:56:34.000000000 +0000
+++ /tmp/wklog.121.new.9396 2010-06-13 11:56:34.000000000 +0000
@@ -1,17 +1,55 @@
-Basic idea: DS-MRR scan should be done as follows:
+1. Choices to be made
+---------------------
-1. Sort incoming keys
-2. Use the sorted keys to do a disk-ordered retrieval
+1.1 Handling of complex ranges
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The "sort incoming keys" part is easy when we have only equality ranges.
+If we allow ranges of arbitrary form (including ranges with one endpoint
+being infinity and/or ranges overlapping with one another), sorting becomes
+non-trivial. Do we need to support this case or support only equality ranges?
-Unresolved questions:
+Decision: the new code should handle only the case with equality ranges.
+For non-equality ranges, the execution will proceed as before.
-* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( including ranges with one endpoint
- being infinity or ranges overlapping with one another), sorting becomes
- non-trivial. Do we need to support this case or support only equality ranges?
+1.2 Handling index prefix scans
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+What do we do if asked to do a scan on a prefix of clustered PK?
-* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs? (current decision: No)
+Decision: handle this if the ranges are equality ranges. The difference from
+scan on full primary key is that
+- we will have to use read_range_XXX() or index_read()/index_next_same()
+ functions, while for full primary key value we could have used rnd_pos().
+- One equality range can produce multiple matching records.
-* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
- of clustered PK will not be in disk order. We need to run it with regular mode)
+1.3 Use of knowledge that primary_key==rowid
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+PKs?
+Decision: don't make this assumption.
+
+
+2. Code-level changes overview
+------------------------------
+
+DsMrr_impl::choose_mrr_impl():
+Enable MRR when
+ - ihis is a clustered primary key
+ - incoming ranges are single-point (HA_MRR_SINGLE_POINT is set)
+ - will need to make the SQL layer to set this flag
+ - incoming ranges are not already sorted (HA_MRR_SORTED is not set)
+
+(TODO do we need new cost formula?)
+
+DsMrr_impl::dsmrr_init()
+ - different elem_size (not rowid length but key tuple length)
+ - don't create the secondary handler object, we won't need it.
+
+DsMrr_impl::dsmrr_fill_buffer():
+ - need a variant of this function that would not access the index but just
+ fill and sort the array.
+
+DsMrr_impl::dsmrr_next():
+ - should abstract-out:
+ - buffer element size
+ - rnd_pos/index_read call.
+ - Also for CPK prefix scans there can be multi
-=-=(Psergey - Sun, 13 Jun 2010, 11:55)=-=-
High Level Description modified.
--- /tmp/wklog.121.old.9380 2010-06-13 11:55:42.000000000 +0000
+++ /tmp/wklog.121.new.9380 2010-06-13 11:55:42.000000000 +0000
@@ -1,18 +1,18 @@
-Currently, DS-MRR doesn't support operation over clustered primary keys. The
-reason for this was that
- - Clustered primary keys are stored in disk order and so, if the sequence of
- ranges is ordered, the reads will already go in disk order (and so DS-MRR's
- step of re-ordering reads is not necessary).
+Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
+reason for this is that
+ - Clustered primary keys are stored in disk order, so, if the sequence of
+ scanned ranges is ordered, the reads will automatically happen in disk
+ order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
-However, with BKA making the MRR calls, there are cases where DS-MRR over
+However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
- order them, so that it hits the disk in key order.
+ order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
@@ -20,3 +20,9 @@
* TODO anything else?
+This WL entry is about addressing the above by adding support of DS-MRR over
+clustered primary key that would work according to this strategy:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval.
+
+
-=-=(Psergey - Sun, 13 Jun 2010, 09:42)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.5009 2010-06-13 09:42:38.000000000 +0000
+++ /tmp/wklog.121.new.5009 2010-06-13 09:42:38.000000000 +0000
@@ -6,11 +6,12 @@
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
- non-trival. Do we need to support this case or support only equality ranges?
+ non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs?
+ PKs? (current decision: No)
-* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
+* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
+ of clustered PK will not be in disk order. We need to run it with regular mode)
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
reason for this is that
- Clustered primary keys are stored in disk order, so, if the sequence of
scanned ranges is ordered, the reads will automatically happen in disk
order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
This WL entry is about addressing the above by adding support of DS-MRR over
clustered primary key that would work according to this strategy:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval.
HIGH-LEVEL SPECIFICATION:
1. Choices to be made
---------------------
1.1 Handling of complex ranges
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The "sort incoming keys" part is easy when we have only equality ranges.
If we allow ranges of arbitrary form (including ranges with one endpoint
being infinity and/or ranges overlapping with one another), sorting becomes
non-trivial. Do we need to support this case or support only equality ranges?
Decision: the new code should handle only the case with equality ranges.
For non-equality ranges, the execution will proceed as before.
1.2 Handling index prefix scans
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
What do we do if asked to do a scan on a prefix of clustered PK?
Decision: handle this if the ranges are equality ranges. The difference from
scan on full primary key is that
- we will have to use read_range_XXX() or index_read()/index_next_same()
functions, while for full primary key value we could have used rnd_pos().
- One equality range can produce multiple matching records.
1.3 Use of knowledge that primary_key==rowid
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs?
Decision: don't make this assumption.
2. Code-level changes overview
------------------------------
DsMrr_impl::choose_mrr_impl():
Enable MRR when
- ihis is a clustered primary key
- incoming ranges are single-point (HA_MRR_SINGLE_POINT is set)
- will need to make the SQL layer to set this flag
- incoming ranges are not already sorted (HA_MRR_SORTED is not set)
(TODO do we need new cost formula?)
DsMrr_impl::dsmrr_init()
- different elem_size (not rowid length but key tuple length)
- don't create the secondary handler object, we won't need it.
DsMrr_impl::dsmrr_fill_buffer():
- need a variant of this function that would not access the index but just
fill and sort the array.
DsMrr_impl::dsmrr_next():
- should abstract-out:
- buffer element size
- rnd_pos/index_read call.
- Also for CPK prefix scans there can be multi
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 13 Jun '10
by worklog-noreply@askmonty.org 13 Jun '10
13 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sun, 13 Jun 2010, 11:55)=-=-
High Level Description modified.
--- /tmp/wklog.121.old.9380 2010-06-13 11:55:42.000000000 +0000
+++ /tmp/wklog.121.new.9380 2010-06-13 11:55:42.000000000 +0000
@@ -1,18 +1,18 @@
-Currently, DS-MRR doesn't support operation over clustered primary keys. The
-reason for this was that
- - Clustered primary keys are stored in disk order and so, if the sequence of
- ranges is ordered, the reads will already go in disk order (and so DS-MRR's
- step of re-ordering reads is not necessary).
+Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
+reason for this is that
+ - Clustered primary keys are stored in disk order, so, if the sequence of
+ scanned ranges is ordered, the reads will automatically happen in disk
+ order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
-However, with BKA making the MRR calls, there are cases where DS-MRR over
+However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
- order them, so that it hits the disk in key order.
+ order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
@@ -20,3 +20,9 @@
* TODO anything else?
+This WL entry is about addressing the above by adding support of DS-MRR over
+clustered primary key that would work according to this strategy:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval.
+
+
-=-=(Psergey - Sun, 13 Jun 2010, 09:42)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.5009 2010-06-13 09:42:38.000000000 +0000
+++ /tmp/wklog.121.new.5009 2010-06-13 09:42:38.000000000 +0000
@@ -6,11 +6,12 @@
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
- non-trival. Do we need to support this case or support only equality ranges?
+ non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs?
+ PKs? (current decision: No)
-* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
+* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
+ of clustered PK will not be in disk order. We need to run it with regular mode)
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
reason for this is that
- Clustered primary keys are stored in disk order, so, if the sequence of
scanned ranges is ordered, the reads will automatically happen in disk
order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
This WL entry is about addressing the above by adding support of DS-MRR over
clustered primary key that would work according to this strategy:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval.
HIGH-LEVEL SPECIFICATION:
Basic idea: DS-MRR scan should be done as follows:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs? (current decision: No)
* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
of clustered PK will not be in disk order. We need to run it with regular mode)
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 13 Jun '10
by worklog-noreply@askmonty.org 13 Jun '10
13 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sun, 13 Jun 2010, 11:55)=-=-
High Level Description modified.
--- /tmp/wklog.121.old.9380 2010-06-13 11:55:42.000000000 +0000
+++ /tmp/wklog.121.new.9380 2010-06-13 11:55:42.000000000 +0000
@@ -1,18 +1,18 @@
-Currently, DS-MRR doesn't support operation over clustered primary keys. The
-reason for this was that
- - Clustered primary keys are stored in disk order and so, if the sequence of
- ranges is ordered, the reads will already go in disk order (and so DS-MRR's
- step of re-ordering reads is not necessary).
+Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
+reason for this is that
+ - Clustered primary keys are stored in disk order, so, if the sequence of
+ scanned ranges is ordered, the reads will automatically happen in disk
+ order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
-However, with BKA making the MRR calls, there are cases where DS-MRR over
+However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
- order them, so that it hits the disk in key order.
+ order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
@@ -20,3 +20,9 @@
* TODO anything else?
+This WL entry is about addressing the above by adding support of DS-MRR over
+clustered primary key that would work according to this strategy:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval.
+
+
-=-=(Psergey - Sun, 13 Jun 2010, 09:42)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.5009 2010-06-13 09:42:38.000000000 +0000
+++ /tmp/wklog.121.new.5009 2010-06-13 09:42:38.000000000 +0000
@@ -6,11 +6,12 @@
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
- non-trival. Do we need to support this case or support only equality ranges?
+ non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs?
+ PKs? (current decision: No)
-* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
+* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
+ of clustered PK will not be in disk order. We need to run it with regular mode)
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
reason for this is that
- Clustered primary keys are stored in disk order, so, if the sequence of
scanned ranges is ordered, the reads will automatically happen in disk
order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
This WL entry is about addressing the above by adding support of DS-MRR over
clustered primary key that would work according to this strategy:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval.
HIGH-LEVEL SPECIFICATION:
Basic idea: DS-MRR scan should be done as follows:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs? (current decision: No)
* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
of clustered PK will not be in disk order. We need to run it with regular mode)
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 13 Jun '10
by worklog-noreply@askmonty.org 13 Jun '10
13 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sun, 13 Jun 2010, 09:42)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.5009 2010-06-13 09:42:38.000000000 +0000
+++ /tmp/wklog.121.new.5009 2010-06-13 09:42:38.000000000 +0000
@@ -6,11 +6,12 @@
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
- non-trival. Do we need to support this case or support only equality ranges?
+ non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs?
+ PKs? (current decision: No)
-* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
+* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
+ of clustered PK will not be in disk order. We need to run it with regular mode)
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't support operation over clustered primary keys. The
reason for this was that
- Clustered primary keys are stored in disk order and so, if the sequence of
ranges is ordered, the reads will already go in disk order (and so DS-MRR's
step of re-ordering reads is not necessary).
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, with BKA making the MRR calls, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
HIGH-LEVEL SPECIFICATION:
Basic idea: DS-MRR scan should be done as follows:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs? (current decision: No)
* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
of clustered PK will not be in disk order. We need to run it with regular mode)
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 13 Jun '10
by worklog-noreply@askmonty.org 13 Jun '10
13 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sun, 13 Jun 2010, 09:42)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.5009 2010-06-13 09:42:38.000000000 +0000
+++ /tmp/wklog.121.new.5009 2010-06-13 09:42:38.000000000 +0000
@@ -6,11 +6,12 @@
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
- non-trival. Do we need to support this case or support only equality ranges?
+ non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs?
+ PKs? (current decision: No)
-* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
+* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
+ of clustered PK will not be in disk order. We need to run it with regular mode)
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't support operation over clustered primary keys. The
reason for this was that
- Clustered primary keys are stored in disk order and so, if the sequence of
ranges is ordered, the reads will already go in disk order (and so DS-MRR's
step of re-ordering reads is not necessary).
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, with BKA making the MRR calls, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
HIGH-LEVEL SPECIFICATION:
Basic idea: DS-MRR scan should be done as follows:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs? (current decision: No)
* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
of clustered PK will not be in disk order. We need to run it with regular mode)
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 12 Jun '10
by worklog-noreply@askmonty.org 12 Jun '10
12 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't support operation over clustered primary keys. The
reason for this was that
- Clustered primary keys are stored in disk order and so, if the sequence of
ranges is ordered, the reads will already go in disk order (and so DS-MRR's
step of re-ordering reads is not necessary).
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, with BKA making the MRR calls, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
HIGH-LEVEL SPECIFICATION:
Basic idea: DS-MRR scan should be done as follows:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
non-trival. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs?
* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 12 Jun '10
by worklog-noreply@askmonty.org 12 Jun '10
12 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't support operation over clustered primary keys. The
reason for this was that
- Clustered primary keys are stored in disk order and so, if the sequence of
ranges is ordered, the reads will already go in disk order (and so DS-MRR's
step of re-ordering reads is not necessary).
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, with BKA making the MRR calls, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
HIGH-LEVEL SPECIFICATION:
Basic idea: DS-MRR scan should be done as follows:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
non-trival. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs?
* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] New (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 12 Jun '10
by worklog-noreply@askmonty.org 12 Jun '10
12 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
DESCRIPTION:
Currently, DS-MRR doesn't support operation over clustered primary keys. The
reason for this was that
- Clustered primary keys are stored in disk order and so, if the sequence of
ranges is ordered, the reads will already go in disk order (and so DS-MRR's
step of re-ordering reads is not necessary).
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, with BKA making the MRR calls, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] New (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 12 Jun '10
by worklog-noreply@askmonty.org 12 Jun '10
12 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
DESCRIPTION:
Currently, DS-MRR doesn't support operation over clustered primary keys. The
reason for this was that
- Clustered primary keys are stored in disk order and so, if the sequence of
ranges is ordered, the reads will already go in disk order (and so DS-MRR's
step of re-ordering reads is not necessary).
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, with BKA making the MRR calls, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
Welcome,
We designed a custom Storage Engine (CLDB) for MySQL/MariaDB
which now pass all premature tests.
I wonder, how can we solve license problem...
We would not share our sources, but we will just use shared library
(*.so)...
Which licenses should our customer who bought MariaDB with CLDB use to
stay legally...
If it is possible, can we use in corporation use MariaDB on GPL and
just sell CLDB (storage engine) on other license (which one) ??
Thank You for Your quick reply...
--
___________________________________________________________
Mateusz Matan
IT Security R&D Department, C/C++ programmer
ComArch S.A., Al. Jana Pawła II 41d, 31-864 Kraków
tel: (+48 12) 684 8411
e-mail: Mateusz.Matan(a)comarch.pl
3
3

09 Jun '10
Hi,
we've talked about engine attributes in the CREATE TABLE,
and that one should be able to specify them per partition as well.
Now, thinking about it, I'm not quite sure what the semantics shuld be.
What is your use case ? How do you want them to work ?
I see different possibilities. Say, there is
create table ... (.....) XXX=1
partition by list (a)
(
partition p0 values in (1) YYY=2,
partition p1 values in (2)
);
1. We can say that XXX should be listed in the engine's
hton->table_options, and YYY - in the hton->partition_options.
this works fine because the engine can use table_share->option_struct
and it will contain correct values, independent from whether a table is
partitioned or not.
but it will break when partitioning will support different engines
in different partitions.
2. We can say that XXX is partition engine options, and a pluggable
engine can only see YYY. YYY should come from hton->partition_options,
and the engine's table level attributes are not applicable.
The drawback - every engine needs to have special code to take care of
the partitioned case, and to duplicate all table-level options in the
partition-level options.
3. Same as 2, but YYY can come from either table level or partition
level arrays. Every engine still needs the special code for partitioned
case, but does not need to duplicate the table_options array.
Regards,
Sergei
5
6

[Maria-developers] Ever wanted to save/restore your GDB breakpoints? Here's a solution.
by Timour Katchaounov 09 Jun '10
by Timour Katchaounov 09 Jun '10
09 Jun '10
Hi,
Today I got really annoyed while debugging a server crash with GDB.
I never managed to make the same GDB reattach to a new process after
a crash. This means that after each crash one has to restart GDB, and
reattach to the new process.
As a result all breakpoints are lost, and need to be set each time,
which is pretty annoying. Today I thought that others might have been
equally annoyed, and might have figured a solution. Indeed, Google is
our friend, and I found the following short snippet that should be
added to .gdbinit:
define bsave
shell rm -f brestore.txt
set logging file brestore.txt
set logging on
info break
set logging off
# reformat on-the-fly to a valid gdb command file
shell perl -n -e 'print "break $1\n" if /^\d+.+?(\S+)$/g' brestore.txt > brestore.gdb
end
document bsave
store actual breakpoints
end
define brestore
source brestore.gdb
end
document brestore
restore breakpoints saved by bsave
end
The credits go to this page:
http://stackoverflow.com/questions/501486/getting-gdb-to-save-a-list-of-bre…
Happy debugging,
Timour
1
1

[Maria-developers] bzr commit into MariaDB 5.1, with Maria 1.5:maria branch (knielsen:2850)
by knielsen@knielsen-hq.org 09 Jun '10
by knielsen@knielsen-hq.org 09 Jun '10
09 Jun '10
#At lp:maria
2850 knielsen(a)knielsen-hq.org 2010-06-09
MWL#116: Fix a couple of races in group commit.
modified:
include/atomic/gcc_builtins.h
include/atomic/x86-gcc.h
sql/handler.cc
=== modified file 'include/atomic/gcc_builtins.h'
--- a/include/atomic/gcc_builtins.h 2008-02-06 16:55:04 +0000
+++ b/include/atomic/gcc_builtins.h 2010-06-09 11:17:39 +0000
@@ -19,8 +19,9 @@
v= __sync_lock_test_and_set(a, v);
#define make_atomic_cas_body(S) \
int ## S sav; \
- sav= __sync_val_compare_and_swap(a, *cmp, set); \
- if (!(ret= (sav == *cmp))) *cmp= sav;
+ int ## S cmp_val= *cmp; \
+ sav= __sync_val_compare_and_swap(a, cmp_val, set);\
+ if (!(ret= (sav == cmp_val))) *cmp= sav
#ifdef MY_ATOMIC_MODE_DUMMY
#define make_atomic_load_body(S) ret= *a
=== modified file 'include/atomic/x86-gcc.h'
--- a/include/atomic/x86-gcc.h 2007-02-28 16:50:51 +0000
+++ b/include/atomic/x86-gcc.h 2010-06-09 11:17:39 +0000
@@ -38,15 +38,33 @@
#define asm __asm__
#endif
+/*
+ The atomic operations imply a memory barrier for the CPU, to ensure that all
+ prior writes are flushed from cache, and all subsequent reads reloaded into
+ cache.
+
+ We need to imply a similar memory barrier for the compiler, so that all
+ pending stores (to memory that may be aliased in other parts of the code)
+ will be flushed to memory before the operation, and all reads from such
+ memory be re-loaded. This is achieved by adding the "memory" pseudo-register
+ to the clobber list, see GCC documentation for more explanation.
+
+ The compiler and CPU memory barriers are needed to make sure changes in one
+ thread are made visible in another by the atomic operation.
+*/
#ifndef MY_ATOMIC_NO_XADD
#define make_atomic_add_body(S) \
- asm volatile (LOCK_prefix "; xadd %0, %1;" : "+r" (v) , "+m" (*a))
+ asm volatile (LOCK_prefix "; xadd %0, %1;" : "+r" (v) , "+m" (*a): : "memory")
#endif
#define make_atomic_fas_body(S) \
- asm volatile ("xchg %0, %1;" : "+r" (v) , "+m" (*a))
+ asm volatile ("xchg %0, %1;" : "+r" (v) , "+m" (*a) : : "memory")
#define make_atomic_cas_body(S) \
+ int ## S sav; \
asm volatile (LOCK_prefix "; cmpxchg %3, %0; setz %2;" \
- : "+m" (*a), "+a" (*cmp), "=q" (ret): "r" (set))
+ : "+m" (*a), "=a" (sav), "=q" (ret) \
+ : "r" (set), "a" (*cmp) : "memory"); \
+ if (!ret) \
+ *cmp= sav
#ifdef MY_ATOMIC_MODE_DUMMY
#define make_atomic_load_body(S) ret=*a
@@ -59,9 +77,9 @@
#define make_atomic_load_body(S) \
ret=0; \
asm volatile (LOCK_prefix "; cmpxchg %2, %0" \
- : "+m" (*a), "+a" (ret): "r" (ret))
+ : "+m" (*a), "+a" (ret) : "r" (ret) : "memory")
#define make_atomic_store_body(S) \
- asm volatile ("; xchg %0, %1;" : "+m" (*a), "+r" (v))
+ asm volatile ("; xchg %0, %1;" : "+m" (*a), "+r" (v) : : "memory")
#endif
/* TODO test on intel whether the below helps. on AMD it makes no difference */
=== modified file 'sql/handler.cc'
--- a/sql/handler.cc 2010-05-26 08:16:18 +0000
+++ b/sql/handler.cc 2010-06-09 11:17:39 +0000
@@ -1103,14 +1103,30 @@ ha_check_and_coalesce_trx_read_only(THD
static THD *
enqueue_atomic(THD *thd)
{
- my_atomic_rwlock_wrlock(&LOCK_group_commit_queue);
+ THD *orig_queue;
+
thd->next_commit_ordered= group_commit_queue;
+
+ my_atomic_rwlock_wrlock(&LOCK_group_commit_queue);
+ do
+ {
+ /*
+ Save the read value of group_commit_queue in each iteration of the loop.
+ When my_atomic_casptr() returns TRUE, we know that orig_queue is equal
+ to the value of group_commit_queue when we enqueued.
+
+ However, as soon as we enqueue, thd->next_commit_ordered may be
+ invalidated by another thread (the group commit leader). So we need to
+ save the old queue value in a local variable orig_queue like this.
+ */
+ orig_queue= thd->next_commit_ordered;
+ }
while (!my_atomic_casptr((void **)(&group_commit_queue),
(void **)(&thd->next_commit_ordered),
- thd))
- ;
+ thd));
my_atomic_rwlock_wrunlock(&LOCK_group_commit_queue);
- return thd->next_commit_ordered;
+
+ return orig_queue;
}
static THD *
@@ -1399,6 +1415,9 @@ int ha_commit_trans(THD *thd, bool all)
int cookie;
if (tc_log->use_group_log_xid)
{
+ // ToDo: if xid==NULL here, we may use is_group_commit_leader uninitialised.
+ // ToDo: Same for cookie below when xid==NULL.
+ // Seems we generally need to check the case xid==NULL.
if (is_group_commit_leader)
{
pthread_mutex_lock(&LOCK_group_commit);
@@ -1434,9 +1453,18 @@ int ha_commit_trans(THD *thd, bool all)
}
pthread_mutex_unlock(&LOCK_group_commit);
- /* Wake up everyone except ourself. */
- while ((queue= queue->next_commit_ordered) != NULL)
- group_commit_wakeup_other(queue);
+ /* Wake up everyone except ourself. */
+ THD *current= queue->next_commit_ordered;
+ while (current != NULL)
+ {
+ /*
+ Careful not to access current->next_commit_ordered after waking up
+ the other thread! As it may change immediately after wakeup.
+ */
+ THD *next= current->next_commit_ordered;
+ group_commit_wakeup_other(current);
+ current= next;
+ }
}
else
{
1
0
Hi,
I had a problem with my group commit patch, and tracked it down to a problem
in my_atomic.
The issue is that my_atomic_cas*(val, cmp, new) accesses *cmp after successful
CAS operation (in one place reading it, in another place writing it). Here is
the fix:
Index: work-5.1-groupcommit/include/atomic/gcc_builtins.h
===================================================================
--- work-5.1-groupcommit.orig/include/atomic/gcc_builtins.h 2010-06-09 11:53:59.000000000 +0200
+++ work-5.1-groupcommit/include/atomic/gcc_builtins.h 2010-06-09 11:54:06.000000000 +0200
@@ -19,8 +19,9 @@
v= __sync_lock_test_and_set(a, v);
#define make_atomic_cas_body(S) \
int ## S sav; \
- sav= __sync_val_compare_and_swap(a, *cmp, set); \
- if (!(ret= (sav == *cmp))) *cmp= sav;
+ int ## S cmp_val= *cmp; \
+ sav= __sync_val_compare_and_swap(a, cmp_val, set);\
+ if (!(ret= (sav == cmp_val))) *cmp= sav
#ifdef MY_ATOMIC_MODE_DUMMY
#define make_atomic_load_body(S) ret= *a
Index: work-5.1-groupcommit/include/atomic/x86-gcc.h
===================================================================
--- work-5.1-groupcommit.orig/include/atomic/x86-gcc.h 2010-06-09 11:53:59.000000000 +0200
+++ work-5.1-groupcommit/include/atomic/x86-gcc.h 2010-06-09 11:54:06.000000000 +0200
@@ -45,8 +45,12 @@
#define make_atomic_fas_body(S) \
asm volatile ("xchg %0, %1;" : "+r" (v) , "+m" (*a))
#define make_atomic_cas_body(S) \
+ int ## S sav; \
asm volatile (LOCK_prefix "; cmpxchg %3, %0; setz %2;" \
- : "+m" (*a), "+a" (*cmp), "=q" (ret): "r" (set))
+ : "+m" (*a), "=a" (sav), "=q" (ret) \
+ : "r" (set), "a" (*cmp)); \
+ if (!ret) \
+ *cmp= sav
#ifdef MY_ATOMIC_MODE_DUMMY
#define make_atomic_load_body(S) ret=*a
This makes the behaviour consistent with the other implementations, see for
example generic-msvc.h.
(It is also pretty important for correct operation. In my code, I use
my_atomic_casptr() to atomically enqueue a struct from one thread, and as soon
as it is enqueued another thread may grab the struct and change it, including
the field pointed to by cmp. So it is essential that my_atomic_casptr()
neither reads nor writes *cmp after successful CAS, as the above patch
ensures.)
It was btw. a bit funny how I tracked this down. After digging for a few hours
in my own code without success I got the idea to check the CAS implementation,
and spotted the problem in gcc_builtins.h. After fixing I was baffled as to
why my code stil failed, until I realised I was not using gcc_builtins.h but
x86-gcc.h, and found and fixed the similar problem there ;-)
Now the question is, where should I push this (if at all)? Any opinions?
-----------------------------------------------------------------------
While I was there, I also noticed another potential problem in gcc_builtins.h,
suggesting this patch:
Index: work-5.1-groupcommit/include/atomic/x86-gcc.h
===================================================================
--- work-5.1-groupcommit.orig/include/atomic/x86-gcc.h 2010-06-09 11:37:12.000000000 +0200
+++ work-5.1-groupcommit/include/atomic/x86-gcc.h 2010-06-09 11:52:47.000000000 +0200
@@ -38,17 +38,31 @@
#define asm __asm__
#endif
+/*
+ The atomic operations imply a memory barrier for the CPU, to ensure that all
+ prior writes are flushed from cache, and all subsequent reads reloaded into
+ cache.
+
+ We need to imply a similar memory barrier for the compiler, so that all
+ pending stores (to memory that may be aliased in other parts of the code)
+ will be flushed to memory before the operation, and all reads from such
+ memory be re-loaded. This is achieved by adding the "memory" pseudo-register
+ to the clobber list, see GCC documentation for more explanation.
+
+ The compiler and CPU memory barriers are needed to make sure changes in one
+ thread are made visible in another by the atomic operation.
+*/
#ifndef MY_ATOMIC_NO_XADD
#define make_atomic_add_body(S) \
- asm volatile (LOCK_prefix "; xadd %0, %1;" : "+r" (v) , "+m" (*a))
+ asm volatile (LOCK_prefix "; xadd %0, %1;" : "+r" (v) , "+m" (*a): : "memory")
#endif
#define make_atomic_fas_body(S) \
- asm volatile ("xchg %0, %1;" : "+r" (v) , "+m" (*a))
+ asm volatile ("xchg %0, %1;" : "+r" (v) , "+m" (*a) : : "memory")
#define make_atomic_cas_body(S) \
int ## S sav; \
asm volatile (LOCK_prefix "; cmpxchg %3, %0; setz %2;" \
: "+m" (*a), "=a" (sav), "=q" (ret) \
- : "r" (set), "a" (*cmp)); \
+ : "r" (set), "a" (*cmp) : "memory"); \
if (!ret) \
*cmp= sav
@@ -63,9 +77,9 @@
#define make_atomic_load_body(S) \
ret=0; \
asm volatile (LOCK_prefix "; cmpxchg %2, %0" \
- : "+m" (*a), "+a" (ret): "r" (ret))
+ : "+m" (*a), "+a" (ret) : "r" (ret) : "memory")
#define make_atomic_store_body(S) \
- asm volatile ("; xchg %0, %1;" : "+m" (*a), "+r" (v))
+ asm volatile ("; xchg %0, %1;" : "+m" (*a), "+r" (v) : : "memory")
#endif
/* TODO test on intel whether the below helps. on AMD it makes no difference */
The comment in the patch explains the idea I think. Basically, these memory
barrier operations need to be a compiler barrier also. Otherwise there is
nothing that prevents GCC from moving unrelated stores across the memory
barrier operation. This is a problem for example when filling in a structure
and then atomically linking it into a shared list. The CPU memory barrier
ensures that the fields in the structure will be visible to other threads on
the CPU level, but it is also necessary to tell GCC to keep the stores into
the struct fields prior to the store linking into the shared list.
(It also makes the volatile declaration on the updated memory location
unnecessary, but maybe it is needed for other implementations (though it
shouldn't be, volatile is almost always wrong)).
- Kristian.
1
0

[Maria-developers] Patch to build xtradb and federatedx statically on Windows
by Bo Thorsen 09 Jun '10
by Bo Thorsen 09 Jun '10
09 Jun '10
Hi everyone,
Recently, we found out that xtradb and federatedx were compiled as
plugins on Windows. This is not correct because they are core storage
engines and should be linked statically instead.
On Unix, it's possible to build both types, but the CMake files doesn't
really handle this. They just look in plug.in, and if there is a dynamic
line, the storage engine is built dynamically. I'm hesitant to fix this
properly, as it's probably already done in MySQL 5.5, or at least it
will be. I see no reason to come up with a good solution for a temporary
problem - there are many other issues on Windows that are more pressing.
FederatedX couldn't compile as a statically linked plugin at all on
Windows. I have fixed this now, but I *really* don't like the way we do
these hacks on Windows. It's absurdly difficult to keep track of when a
variable says xtradb and when it says innobase. And debugging
CMakeLists.txt files is not too much fun. Anyway, I did the same thing
on FederatedX as is done in XtraDB, and it works now.
Finally, I removed a subdir that didn't have a CMakeLists.txt. Removed a
warning during the cmake run.
With this patch, the Windows zip file release build should work
correctly and produce a MariaDB with both XtraDB and FederatedX in it.
Should I check this into ~maria-captains/maria/5.1, or is there a branch
more suitable at the moment?
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
2
2

[Maria-developers] Rev 2795: Uninitialized memory problem fixed. in file:///home/bell/maria/bzr/work-maria-5.3-scache2/
by sanja@askmonty.org 08 Jun '10
by sanja@askmonty.org 08 Jun '10
08 Jun '10
At file:///home/bell/maria/bzr/work-maria-5.3-scache2/
------------------------------------------------------------
revno: 2795
revision-id: sanja(a)askmonty.org-20100608122454-dtlvue8n2s55yi67
parent: sanja(a)askmonty.org-20100608120847-v4loj2gdqjflbarv
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-scache2
timestamp: Tue 2010-06-08 15:24:54 +0300
message:
Uninitialized memory problem fixed.
=== modified file 'sql/table.cc'
--- a/sql/table.cc 2010-06-07 07:58:45 +0000
+++ b/sql/table.cc 2010-06-08 12:24:54 +0000
@@ -5176,6 +5176,7 @@
key_part_info->offset= (*reg_field)->offset(record[0]);
key_part_info->length= (uint16) (*reg_field)->pack_length();
keyinfo->key_length+= key_part_info->length;
+ key_part_info->key_part_flag= 0;
/* TODO:
The below method of computing the key format length of the
key part is a copy/paste from opt_range.cc, and table.cc.
1
0

[Maria-developers] Rev 2794: Fixed uninitialized memory in file:///home/bell/maria/bzr/work-maria-5.3-scache2/
by sanja@askmonty.org 08 Jun '10
by sanja@askmonty.org 08 Jun '10
08 Jun '10
At file:///home/bell/maria/bzr/work-maria-5.3-scache2/
------------------------------------------------------------
revno: 2794
revision-id: sanja(a)askmonty.org-20100608120847-v4loj2gdqjflbarv
parent: sanja(a)askmonty.org-20100608114156-36q3me04ecxh2by6
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-scache2
timestamp: Tue 2010-06-08 15:08:47 +0300
message:
Fixed uninitialized memory
=== modified file 'sql/sql_subquery_cache.cc'
--- a/sql/sql_subquery_cache.cc 2010-06-08 11:41:56 +0000
+++ b/sql/sql_subquery_cache.cc 2010-06-08 12:08:47 +0000
@@ -50,6 +50,7 @@
tab_ref->null_rejecting= 1;
tab_ref->disable_cache= FALSE;
tab_ref->has_record= 0;
+ tab_ref->use_count= 0;
KEY_PART_INFO *cur_key_part= tmp_key->key_part;
store_key **ref_key= tab_ref->key_copy;
1
0

[Maria-developers] Rev 2793: Fixed uninitialized memory. in file:///home/bell/maria/bzr/work-maria-5.3-scache2/
by sanja@askmonty.org 08 Jun '10
by sanja@askmonty.org 08 Jun '10
08 Jun '10
At file:///home/bell/maria/bzr/work-maria-5.3-scache2/
------------------------------------------------------------
revno: 2793
revision-id: sanja(a)askmonty.org-20100608114156-36q3me04ecxh2by6
parent: sanja(a)askmonty.org-20100608093610-efg156vu4mi4u9pp
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-scache2
timestamp: Tue 2010-06-08 14:41:56 +0300
message:
Fixed uninitialized memory.
=== modified file 'sql/mysql_priv.h'
--- a/sql/mysql_priv.h 2010-05-31 21:25:54 +0000
+++ b/sql/mysql_priv.h 2010-06-08 11:41:56 +0000
@@ -1274,7 +1274,7 @@
select_result *result, SELECT_LEX_UNIT *unit,
SELECT_LEX *select_lex);
-struct st_join_table *create_index_lookup_join_tab(TABLE *table);
+struct st_join_table *create_index_lookup_join_tab(TABLE *table, int key_no);
int join_read_key2(THD *thd, struct st_join_table *tab, TABLE *table,
struct st_table_ref *table_ref);
void free_underlaid_joins(THD *thd, SELECT_LEX *select);
=== modified file 'sql/sql_select.cc'
--- a/sql/sql_select.cc 2010-05-31 21:25:54 +0000
+++ b/sql/sql_select.cc 2010-06-08 11:41:56 +0000
@@ -7627,11 +7627,12 @@
Creates and fills JOIN_TAB for index look up in temporary table
@param table The table where to look up
+ @param key_no Number of key
@return JOIN_TAB object or NULL in case of error
*/
-JOIN_TAB *create_index_lookup_join_tab(TABLE *table)
+JOIN_TAB *create_index_lookup_join_tab(TABLE *table, int key_no)
{
JOIN_TAB *tab;
DBUG_ENTER("create_index_lookup_join_tab");
@@ -7640,13 +7641,12 @@
DBUG_RETURN(NULL);
tab->read_record.table= table;
tab->read_record.file=table->file;
- /*tab->read_record.unlock_row= rr_unlock_row;*/
tab->next_select=0;
tab->sorted= 1;
+ tab->ref.key= key_no;
table->status= STATUS_NO_RECORD;
tab->read_first_record= join_read_key;
- /*tab->read_record.unlock_row= join_read_key_unlock_row;*/
tab->read_record.read_record= join_no_more_records;
if (table->covering_keys.is_set(tab->ref.key) &&
!table->no_keyread)
=== modified file 'sql/sql_subquery_cache.cc'
--- a/sql/sql_subquery_cache.cc 2010-05-31 21:25:54 +0000
+++ b/sql/sql_subquery_cache.cc 2010-06-08 11:41:56 +0000
@@ -225,7 +225,7 @@
(uchar*)&field_counter) < 0) ||
createtmp_table_search_structures(table_thd, cache_table, li_items,
&tab_ref) ||
- !(tab= create_index_lookup_join_tab(cache_table)))
+ !(tab= create_index_lookup_join_tab(cache_table, 0)))
{
DBUG_PRINT("error", ("creating index failed"));
goto error;
1
0

[Maria-developers] Rev 2792: Fixed memory management problem (Item can't contain other Item due to way of Items destruction). in file:///home/bell/maria/bzr/work-maria-5.3-scache2/
by sanja@askmonty.org 08 Jun '10
by sanja@askmonty.org 08 Jun '10
08 Jun '10
At file:///home/bell/maria/bzr/work-maria-5.3-scache2/
------------------------------------------------------------
revno: 2792
revision-id: sanja(a)askmonty.org-20100608093610-efg156vu4mi4u9pp
parent: sanja(a)askmonty.org-20100608074734-1m60ib2tac7y9m33
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-scache2
timestamp: Tue 2010-06-08 12:36:10 +0300
message:
Fixed memory management problem (Item can't contain other Item due to way of Items destruction).
=== modified file 'sql/item_cmpfunc.cc'
--- a/sql/item_cmpfunc.cc 2010-05-31 21:25:54 +0000
+++ b/sql/item_cmpfunc.cc 2010-06-08 09:36:10 +0000
@@ -1737,13 +1737,16 @@
not_null_tables_cache|= args[1]->not_null_tables();
const_item_cache&= args[1]->const_item();
DBUG_ASSERT(scache == NULL);
+ DBUG_ASSERT(value_for_scache == NULL);
if (args[0]->cols() ==1 &&
thd->variables.optimizer_switch & OPTIMIZER_SWITCH_SUBQUERY_CACHE &&
!(sub->engine->uncacheable() & (UNCACHEABLE_RAND |
UNCACHEABLE_SIDEEFFECT)))
{
sub->depends_on.push_front((Item**)&cache);
- scache= new Subquery_cache_tmptable(thd, sub->depends_on, &result);
+ value_for_scache= new Item_bool_cache;
+ scache= new Subquery_cache_tmptable(thd, sub->depends_on,
+ value_for_scache);
}
fixed= 1;
return FALSE;
@@ -1851,8 +1854,8 @@
/* put result in the cache */
if (scache)
{
- result.set(tmp, null_value);
- scache->put_value(&result);
+ value_for_scache->set(tmp, null_value);
+ scache->put_value(value_for_scache);
}
DBUG_RETURN(tmp);
}
@@ -1876,6 +1879,7 @@
delete scache;
scache= 0;
}
+ value_for_scache= 0;
DBUG_VOID_RETURN;
}
=== modified file 'sql/item_cmpfunc.h'
--- a/sql/item_cmpfunc.h 2010-05-31 21:25:54 +0000
+++ b/sql/item_cmpfunc.h 2010-06-08 09:36:10 +0000
@@ -241,7 +241,7 @@
/* Subquery cache */
Subquery_cache *scache;
/* result representation for the subquery cache */
- Item_bool_cache result;
+ Item_bool_cache *value_for_scache;
bool save_cache;
/*
Stores the value of "NULL IN (SELECT ...)" for uncorrelated subqueries:
@@ -252,7 +252,8 @@
my_bool result_for_null_param;
public:
Item_in_optimizer(Item *a, Item_in_subselect *b):
- Item_bool_func(a, my_reinterpret_cast(Item *)(b)), cache(0), scache(NULL),
+ Item_bool_func(a, my_reinterpret_cast(Item *)(b)), cache(0),
+ scache(NULL), value_for_scache(NULL),
save_cache(0), result_for_null_param(UNKNOWN)
{}
bool fix_fields(THD *, Item **);
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-05-31 21:25:54 +0000
+++ b/sql/item_subselect.cc 2010-06-08 09:36:10 +0000
@@ -34,10 +34,10 @@
Item_subselect::Item_subselect():
Item_result_field(), value_assigned(0), thd(0), substitution(0),
- engine(0), old_engine(0), scache(0), used_tables_cache(0),
- have_to_be_excluded(0), const_item_cache(1), inside_first_fix_fields(0),
- done_first_fix_fields(FALSE), eliminated(FALSE), engine_changed(0),
- changed(0), is_correlated(FALSE)
+ engine(0), old_engine(0), scache(0), value_for_scache(0),
+ used_tables_cache(0), have_to_be_excluded(0), const_item_cache(1),
+ inside_first_fix_fields(0), done_first_fix_fields(FALSE),
+ eliminated(FALSE), engine_changed(0), changed(0), is_correlated(FALSE)
{
with_subselect= 1;
reset();
@@ -121,6 +121,7 @@
delete scache;
scache= 0;
}
+ value_for_scache= 0;
reset();
value_assigned= 0;
DBUG_VOID_RETURN;
@@ -129,7 +130,7 @@
void Item_singlerow_subselect::cleanup()
{
DBUG_ENTER("Item_singlerow_subselect::cleanup");
- value= 0; row= 0;
+ value= 0; row= 0; null_value_item= 0;
Item_subselect::cleanup();
DBUG_VOID_RETURN;
}
@@ -572,7 +573,7 @@
Item_singlerow_subselect::Item_singlerow_subselect(st_select_lex *select_lex)
- :Item_subselect(), value(0)
+ :Item_subselect(), value(0), null_value_item(0)
{
DBUG_ENTER("Item_singlerow_subselect::Item_singlerow_subselect");
init(select_lex, new select_singlerow_subselect(this));
@@ -760,6 +761,7 @@
(uint)depends_on.elements,
(uint)test(thd->variables.optimizer_switch & OPTIMIZER_SWITCH_SUBQUERY_CACHE)));
engine->fix_length_and_dec(row= &value);
+ DBUG_ASSERT(scache == NULL);
if (depends_on.elements &&
optimizer_flag(thd, OPTIMIZER_SWITCH_SUBQUERY_CACHE) &&
!(engine->uncacheable() & (UNCACHEABLE_RAND |
@@ -839,6 +841,22 @@
DBUG_RETURN(NULL);
}
+
+/**
+ Puts NULL value as result in the cache
+*/
+
+void Item_singlerow_subselect::put_null_value_in_scache()
+{
+ if (!value_for_scache)
+ {
+ value_for_scache= new Item_bool_cache;
+ value_for_scache->set(0, TRUE); // NULL
+ }
+ DBUG_ASSERT(value_for_scache->null_value);
+ scache->put_value(value_for_scache);
+}
+
double Item_singlerow_subselect::val_real()
{
Item *cached_value;
@@ -870,7 +888,7 @@
reset();
DBUG_PRINT("info", ("error: %u", (uint)err));
if (scache && !err)
- scache->put_value(&const_null_value);
+ put_null_value_in_scache();
DBUG_RETURN(0);
}
}
@@ -906,7 +924,7 @@
reset();
DBUG_PRINT("info", ("error: %u", (uint)err));
if (scache && !err)
- scache->put_value(&const_null_value);
+ put_null_value_in_scache();
DBUG_RETURN(0);
}
}
@@ -942,7 +960,7 @@
reset();
DBUG_PRINT("info", ("error: %u", (uint)err));
if (scache && !err)
- scache->put_value(&const_null_value);
+ put_null_value_in_scache();
DBUG_RETURN(0);
}
}
@@ -979,7 +997,7 @@
reset();
DBUG_PRINT("info", ("error: %u", (uint)err));
if (scache && !err)
- scache->put_value(&const_null_value);
+ put_null_value_in_scache();
DBUG_RETURN(0);
}
}
@@ -1016,7 +1034,7 @@
reset();
DBUG_PRINT("info", ("error: %u", (uint)err));
if (scache && !err)
- scache->put_value(&const_null_value);
+ put_null_value_in_scache();
DBUG_RETURN(0);
}
}
@@ -1108,13 +1126,16 @@
max_columns= engine->cols();
/* We need only 1 row to determine existence */
unit->global_parameters->select_limit= new Item_int((int32) 1);
+
+ DBUG_ASSERT(scache == NULL);
+ DBUG_ASSERT(value_for_scache == NULL);
if (substype() == EXISTS_SUBS && depends_on.elements &&
optimizer_flag(thd, OPTIMIZER_SWITCH_SUBQUERY_CACHE) &&
!(engine->uncacheable() & (UNCACHEABLE_RAND |
UNCACHEABLE_SIDEEFFECT)))
{
- DBUG_ASSERT(scache == NULL);
- scache= new Subquery_cache_tmptable(thd, depends_on, &result);
+ value_for_scache= new Item_bool_cache;
+ scache= new Subquery_cache_tmptable(thd, depends_on, value_for_scache);
DBUG_PRINT("info", ("cache: 0x%lx", (ulong) scache));
}
DBUG_VOID_RETURN;
@@ -1141,8 +1162,8 @@
if (scache)
{
- result.set(value, FALSE);
- scache->put_value(&result);
+ value_for_scache->set(value, FALSE);
+ scache->put_value(value_for_scache);
}
DBUG_RETURN((double) value);
@@ -1170,8 +1191,8 @@
if (scache)
{
- result.set(value, FALSE);
- scache->put_value(&result);
+ value_for_scache->set(value, FALSE);
+ scache->put_value(value_for_scache);
}
DBUG_RETURN(value);
@@ -1213,8 +1234,8 @@
if (scache)
{
- result.set(value, FALSE);
- scache->put_value(&result);
+ value_for_scache->set(value, FALSE);
+ scache->put_value(value_for_scache);
}
str->set((ulonglong)value,&my_charset_bin);
@@ -1257,8 +1278,8 @@
if (scache)
{
- result.set(value, FALSE);
- scache->put_value(&result);
+ value_for_scache->set(value, FALSE);
+ scache->put_value(value_for_scache);
}
int2my_decimal(E_DEC_FATAL_ERROR, value, 0, decimal_value);
@@ -1287,8 +1308,8 @@
if (scache)
{
- result.set(value, FALSE);
- scache->put_value(&result);
+ value_for_scache->set(value, FALSE);
+ scache->put_value(value_for_scache);
}
DBUG_RETURN(value != 0);
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-05-31 21:25:54 +0000
+++ b/sql/item_subselect.h 2010-06-08 09:36:10 +0000
@@ -60,8 +60,8 @@
subselect_engine *old_engine;
/* subquery cache */
Subquery_cache *scache;
- /* null consrtant for caching */
- Item_null const_null_value;
+ /* subquery cache value for NULL and TRUE/FALSE subqueries */
+ Item_bool_cache *value_for_scache;
/* cache of used external tables */
table_map used_tables_cache;
/* allowed number of columns (1 for single value subqueries) */
@@ -217,10 +217,15 @@
{
protected:
Item_cache *value, **row;
+ /* null value for subquery cache value */
+ Item_null *null_value_item;
+
+ void put_null_value_in_scache();
public:
Item_singlerow_subselect(st_select_lex *select_lex);
- Item_singlerow_subselect() :Item_subselect(), value(0), row (0) {}
+ Item_singlerow_subselect() :Item_subselect(), value(0), row (0),
+ null_value_item(0) {}
void cleanup();
subs_type substype() { return SINGLEROW_SUBS; }
@@ -284,8 +289,6 @@
{
protected:
bool value; /* value of this item (boolean: exists/not-exists) */
- /* result representation for the subquery cache */
- Item_bool_cache result;
public:
Item_exists_subselect(st_select_lex *select_lex);
1
0
Is the adutko-ultrasparc3 buildbot too slow? would it be missed?
2
1

[Maria-developers] Rev 2791: bugfixes lost in moving between trees in file:///home/bell/maria/bzr/work-maria-5.3-scache2/
by sanja@askmonty.org 08 Jun '10
by sanja@askmonty.org 08 Jun '10
08 Jun '10
At file:///home/bell/maria/bzr/work-maria-5.3-scache2/
------------------------------------------------------------
revno: 2791
revision-id: sanja(a)askmonty.org-20100608074734-1m60ib2tac7y9m33
parent: sanja(a)askmonty.org-20100607075845-lo3tcaiuk54qqlw0
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-scache2
timestamp: Tue 2010-06-08 10:47:34 +0300
message:
bugfixes lost in moving between trees
=== modified file 'sql/item.cc'
--- a/sql/item.cc 2010-05-31 21:25:54 +0000
+++ b/sql/item.cc 2010-06-08 07:47:34 +0000
@@ -5152,6 +5152,10 @@
int Item_field::save_in_field(Field *to, bool no_conversions)
{
+ /* if it is external field */
+ if (unlikely(depended_from))
+ return save_field_in_field(field, &null_value, to, no_conversions);
+
return save_field_in_field(result_field, &null_value, to, no_conversions);
}
@@ -6359,7 +6363,7 @@
int Item_ref::save_in_field(Field *to, bool no_conversions)
{
int res;
- if (result_field)
+ if (result_field && !depended_from)
return save_field_in_field(result_field, &null_value, to, no_conversions);
res= (*ref)->save_in_field(to, no_conversions);
null_value= (*ref)->null_value;
1
0
Welcome,
We designed a custom Storage Engine (CLDB) for MySQL/MariaDB
which now pass all premature tests.
I wonder, how can we solve license problem...
We would not share our sources, but we will just use shared library
(*.so)...
Which licenses should our customer who bought MariaDB with CLDB use to
stay legally...
If it is possible, can we use in corporation use MariaDB on GPL and
just sell CLDB (storage engine) on other license (which one) ??
Thank You for Your quick reply...
--
___________________________________________________________
Mateusz Matan
IT Security R&D Department, C/C++ programmer
ComArch S.A., Al. Jana Pawła II 41d, 31-864 Kraków
tel: (+48 12) 684 8411
e-mail: Mateusz.Matan(a)comarch.pl
1
0

Re: [Maria-developers] [Commits] Rev 2797: MWL#90, code movearound to unify merged and non-merged semi-join materialization processing in file:///home/psergey/dev/maria-5.3-subqueries-r12/
by Sergey Petrunya 07 Jun '10
by Sergey Petrunya 07 Jun '10
07 Jun '10
Hi Monty,
On Fri, Jun 04, 2010 at 05:36:59PM +0300, Michael Widenius wrote:
> Note that this is not a full review, but just a quick scan of some of
> the things in the commit. (One suspicious thing found...)
Thanks for the feedback! Reply summary:
- This commit was primarily to get a buildbot run, hence the presence of loads
of commented-out old code and lack of real comments. I'll address this a bit
later.
- The suspicious thing confirmed and fixed.
- Style suggestions followed.
> >>>>> "Sergey" == Sergey Petrunya <psergey(a)askmonty.org> writes:
>
> <cut>
>
> Sergey> === modified file 'sql/item_cmpfunc.cc'
> Sergey> --- a/sql/item_cmpfunc.cc 2010-05-25 06:32:15 +0000
> Sergey> +++ b/sql/item_cmpfunc.cc 2010-06-04 13:40:57 +0000
> Sergey> @@ -5733,6 +5733,8 @@
> Sergey> It's a field from an materialized semi-join. We can substitute it only
> Sergey> for a field from the same semi-join.
> Sergey> */
> Sergey> +#if 0
> Sergey> + psergey3:remove:
>
> Please don't ever use #if 0; Instead use something like:
>
> #ifdef LEFT_FOR_TESTING_WILL_BE_REMOVED_BY_PSERGEY_SOON
>
> Even better to just remove the code (after all, we can always find it
> in bzr)
>
> <cut>
> Sergey> - if (item->field->table->reginfo.join_tab >= first)
> Sergey> + //if (item->field->table->reginfo.join_tab >= first)
>
> Same here; Don't leave the old code around
>
> <cut>
>
> Sergey> +bool join_tab_execution_startup(JOIN_TAB *tab)
> Sergey> {
> Sergey> + DBUG_ENTER("join_tab_execution_startup");
> Sergey> Item_in_subselect *in_subs;
>
> Please put DBUG_ENTER after all declarations.
> (So that we have same layout in C and C++)
>
> <cut>
>
> Sergey> +#if 0
>
> Replace with proper #if or remove code
> <cut>
>
> Sergey> +++ b/sql/sql_select.cc 2010-06-04 13:40:57 +0000
> Sergey> @@ -1008,15 +1006,26 @@
> Sergey> /*
> Sergey> Permorm the the optimization on fields evaluation mentioned above
> Sergey> for all on expressions.
> Sergey> - */
> Sergey> - for (JOIN_TAB *tab= join_tab + const_tables; tab < join_tab + tables ; tab++)
> Sergey> + */
> Sergey> +
> Sergey> {
> Sergey> - if (*tab->on_expr_ref)
> Sergey> + List_iterator<JOIN_TAB_RANGE> it(join_tab_ranges);
> Sergey> + JOIN_TAB_RANGE *jt_range;
> Sergey> + bool first= TRUE;
>
> Wouldn't it be better to set first to const_tables and then to 0 ?
>
> Sergey> + while ((jt_range= it++))
> Sergey> {
> Sergey> + for (JOIN_TAB *tab= jt_range->start + (first ? const_tables : 0);
>
> If you do the above, then you can just do 'jt_range->start + first' here
>
> Sergey> + tab < jt_range->end; tab++)
> Sergey> + {
> Sergey> + if (*tab->on_expr_ref)
> Sergey> + {
> Sergey> + *tab->on_expr_ref= substitute_for_best_equal_field(*tab->on_expr_ref,
> Sergey> + tab->cond_equal,
> Sergey> + map2table);
> Sergey> + (*tab->on_expr_ref)->update_used_tables();
> Sergey> + }
> Sergey> + }
> Sergey> + first= FALSE;
> Sergey> }
> Sergey> }
>
> A comment for the above outer loop would be nice.
> (It's not obvious why only the first element in join_tab_ranges has
> const tables)
>
> Sergey> @@ -1026,6 +1035,7 @@
> Sergey> {
> Sergey> conds=new Item_int((longlong) 0,1); // Always false
> Sergey> }
> Sergey> +
> Sergey> if (make_join_select(this, select, conds))
> Sergey> {
> Sergey> zero_result_cause=
> Sergey> @@ -1289,7 +1299,8 @@
> Sergey> if (need_tmp || select_distinct || group_list || order)
> Sergey> {
> Sergey> for (uint i = const_tables; i < tables; i++)
> Sergey> - join_tab[i].table->prepare_for_position();
> Sergey> + table[i]->prepare_for_position();
> Sergey> +
>
> Isn't table[] in other order than join_tab?
> (I thought that only join_tab has the const tables first)
Right. Will fix this.
> <cut>
>
> Sergey> +JOIN_TAB *first_linear_tab(JOIN *join, bool after_const_tables)
> Sergey> +{
> Sergey> + JOIN_TAB *first= join->join_tab;
> Sergey> + if (after_const_tables)
> Sergey> + first += join->const_tables;
>
> remove space before '+'
>
> Sergey> + if (first < join->join_tab + join->top_jtrange_tables)
> Sergey> + return first;
> Sergey> + else
> Sergey> + return NULL;
> Sergey> +}
>
> Better to do:
>
> return (first < join->join_tab + join->top_jtrange_tables) ? first : 0;
>
> Or:
>
> if (first < join->join_tab + join->top_jtrange_tables)
> return first;
> return NULL;
>
Changed.
> <cut>
>
> Sergey> +JOIN_TAB *next_linear_tab(JOIN* join, JOIN_TAB* tab, bool include_bush_roots) //psergey2: added
>
> Remove comments; It's trival to see that the function was added :)
>
> Sergey> +{
> Sergey> + if (include_bush_roots && tab->bush_children)
> Sergey> + return tab->bush_children->start;
> Sergey> +
> Sergey> + if (tab->last_leaf_in_bush)
> Sergey> + tab= tab->bush_root_tab;
> Sergey> +
> Sergey> + if (tab->bush_root_tab)
> Sergey> + return ++tab;
>
> Add an assert before if (tab->last_leaf_in_bush):
>
> DBUG_ASSERT(!tab->last_leaf_in_bush || tab->bush_root_tab);
> Just to declare that the above code is safe!
Done.
> Sergey> +
> Sergey> + if (++tab == join->join_tab + join->top_jtrange_tables /*join->join_tab_ranges.head()->end*/)
>
> Move the comment to previous row and make it more clear what you are testing
> (The current comment doesn't tell me much)
>
> Sergey> + return NULL;
> Sergey> +
> Sergey> + if (!include_bush_roots && tab->bush_children)
> Sergey> + {
> Sergey> + tab= tab->bush_children->start;
> Sergey> + }
> Sergey> + return tab;
>
> Why not do:
>
> return ((!include_bush_roots && tab->bush_children) ?
> tab->bush_children->start : tab);
>
> Sergey> + if ((start? tab: ++tab) == join->join_tab_ranges.head()->end)
> Sergey> + return NULL; /* End */
>
> I think the above code would be more clear if you would do:
>
> if (start)
> tab++;
> if (...)
>
> This makes it clear that the ++tab is not just for the test but also
> for future usage of tab.
Changed.
> Sergey> +
> Sergey> + if (tab->bush_children)
> Sergey> + return tab->bush_children->start;
> Sergey> +
> Sergey> + return tab;
>
> Could be combined with ?
>
> Sergey> +}
> Sergey> +
> Sergey> +
> Sergey> +static Item *null_ptr= NULL;
>
> Can we make this const, so that if anyone tried to change this memory
> location we would get an exception?
Done
> <cut>
>
> Sergey> DBUG_RETURN(TRUE); /* purecov: inspected */
>
> Sergey> join_tab= parent->join_tab_reexec;
> Sergey> + //psergey2: hopefully this is ok:
> Sergey> + // join_tab_ranges.head()->start= join_tab;
> Sergey> + // join_tab_ranges.head()->end= join_tab + 1;
>
> Better to know than to hope :)
It wasn't ok actually, already changed.
>
> (sorry, don't have time to look at the rest now)
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0

[Maria-developers] New (by Knielsen test): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DLL to the statement-level replication event generator.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] New (by Knielsen test): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DLL to the statement-level replication event generator.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Knielsen): New replication APIs (107)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: New replication APIs
CREATION DATE..: Mon, 15 Mar 2010, 13:55
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 107 (http://askmonty.org/worklog/?tid=107)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 50
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 07 Jun 2010, 12:11)=-=-
High Level Description modified.
--- /tmp/wklog.107.old.31097 2010-06-07 12:11:57.000000000 +0000
+++ /tmp/wklog.107.new.31097 2010-06-07 12:11:57.000000000 +0000
@@ -7,3 +7,6 @@
https://lists.launchpad.net/maria-developers/msg01998.html
+Wiki page for the project:
+
+ http://askmonty.org/wiki/ReplicationProject
-=-=(Knielsen - Mon, 29 Mar 2010, 07:33)=-=-
Research and design discussions: Galera, 2pc/XA, group commit, multi-engine transactions.
Worked 14 hours and estimate 0 hours remain (original estimate increased by 14 hours).
-=-=(Knielsen - Wed, 24 Mar 2010, 10:39)=-=-
Design discussions
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Mon, 15 Mar 2010, 14:28)=-=-
Research into the problem, and discussions on phone/mailing list
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Guest - Mon, 15 Mar 2010, 14:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.107.old.9086 2010-03-15 14:18:18.000000000 +0000
+++ /tmp/wklog.107.new.9086 2010-03-15 14:18:18.000000000 +0000
@@ -1 +1,43 @@
+Current ideas/status after discussions on the mailing list:
+
+ - Implement a set of plugin APIs and use them to move all of the existing
+ MySQL replication into a (set of) plugins.
+
+ - Design the APIs so that they can support full MySQL replication, but also
+ so that they do not hardcode assumptions about how this replication
+ implementation is done, and so that they will be suitable for other types of
+ replication (Tungsten, Galera, parallel replication, ...).
+
+ - APIs need to include the concept of a global transaction ID. Need to
+ determine the extent to which the semantics of such ID will be defined
+ by the API, and to which extend it will be defined by the plugin
+ implementations.
+
+ - APIs should properly support reliable crash-recovery with decent
+ performance (eg. not require multiple mandatory fsync()s per commit, and
+ not make group commit impossible).
+
+ - Would be nice if the API provided facilities for implementing good
+ consistency checking support (mainly checking master tables against slave
+ tables is hard here I think, but also applying wrong binlog data and
+ individual event checksums).
+
+
+Steps to make this more concrete:
+
+ - Investigate the current MySQL replication, and list all of the places where
+ a plugin implementation will need to connect/hook into the MySQL server.
+ * handler::{write,update,delete}_row()
+ * Statement execution
+ * Transaction start/commit
+ * Table open
+ * Query safe/not/safe for statement based replication
+ * Statement-based logging details (user variables, random seed, etc.)
+ * ...
+
+ - Use this list to make an initial sketch of the set of APIs we need.
+
+ - Use the list to determine the feasibility of this project and the level of
+ detail in the API needed to support a full replication implementation as a
+ plugin.
-=-=(Serg - Mon, 15 Mar 2010, 14:13)=-=-
Observers changed: Sergei
DESCRIPTION:
This is a top-level task for the project of designing a new set of replication
APIs for MariaDB.
This task is for the initial discussion of what to do and where to focus.
The project is started in this email thread:
https://lists.launchpad.net/maria-developers/msg01998.html
Wiki page for the project:
http://askmonty.org/wiki/ReplicationProject
HIGH-LEVEL SPECIFICATION:
Current ideas/status after discussions on the mailing list:
- Implement a set of plugin APIs and use them to move all of the existing
MySQL replication into a (set of) plugins.
- Design the APIs so that they can support full MySQL replication, but also
so that they do not hardcode assumptions about how this replication
implementation is done, and so that they will be suitable for other types of
replication (Tungsten, Galera, parallel replication, ...).
- APIs need to include the concept of a global transaction ID. Need to
determine the extent to which the semantics of such ID will be defined
by the API, and to which extend it will be defined by the plugin
implementations.
- APIs should properly support reliable crash-recovery with decent
performance (eg. not require multiple mandatory fsync()s per commit, and
not make group commit impossible).
- Would be nice if the API provided facilities for implementing good
consistency checking support (mainly checking master tables against slave
tables is hard here I think, but also applying wrong binlog data and
individual event checksums).
Steps to make this more concrete:
- Investigate the current MySQL replication, and list all of the places where
a plugin implementation will need to connect/hook into the MySQL server.
* handler::{write,update,delete}_row()
* Statement execution
* Transaction start/commit
* Table open
* Query safe/not/safe for statement based replication
* Statement-based logging details (user variables, random seed, etc.)
* ...
- Use this list to make an initial sketch of the set of APIs we need.
- Use the list to determine the feasibility of this project and the level of
detail in the API needed to support a full replication implementation as a
plugin.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Knielsen): New replication APIs (107)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: New replication APIs
CREATION DATE..: Mon, 15 Mar 2010, 13:55
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 107 (http://askmonty.org/worklog/?tid=107)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 50
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 07 Jun 2010, 12:11)=-=-
High Level Description modified.
--- /tmp/wklog.107.old.31097 2010-06-07 12:11:57.000000000 +0000
+++ /tmp/wklog.107.new.31097 2010-06-07 12:11:57.000000000 +0000
@@ -7,3 +7,6 @@
https://lists.launchpad.net/maria-developers/msg01998.html
+Wiki page for the project:
+
+ http://askmonty.org/wiki/ReplicationProject
-=-=(Knielsen - Mon, 29 Mar 2010, 07:33)=-=-
Research and design discussions: Galera, 2pc/XA, group commit, multi-engine transactions.
Worked 14 hours and estimate 0 hours remain (original estimate increased by 14 hours).
-=-=(Knielsen - Wed, 24 Mar 2010, 10:39)=-=-
Design discussions
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Mon, 15 Mar 2010, 14:28)=-=-
Research into the problem, and discussions on phone/mailing list
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Guest - Mon, 15 Mar 2010, 14:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.107.old.9086 2010-03-15 14:18:18.000000000 +0000
+++ /tmp/wklog.107.new.9086 2010-03-15 14:18:18.000000000 +0000
@@ -1 +1,43 @@
+Current ideas/status after discussions on the mailing list:
+
+ - Implement a set of plugin APIs and use them to move all of the existing
+ MySQL replication into a (set of) plugins.
+
+ - Design the APIs so that they can support full MySQL replication, but also
+ so that they do not hardcode assumptions about how this replication
+ implementation is done, and so that they will be suitable for other types of
+ replication (Tungsten, Galera, parallel replication, ...).
+
+ - APIs need to include the concept of a global transaction ID. Need to
+ determine the extent to which the semantics of such ID will be defined
+ by the API, and to which extend it will be defined by the plugin
+ implementations.
+
+ - APIs should properly support reliable crash-recovery with decent
+ performance (eg. not require multiple mandatory fsync()s per commit, and
+ not make group commit impossible).
+
+ - Would be nice if the API provided facilities for implementing good
+ consistency checking support (mainly checking master tables against slave
+ tables is hard here I think, but also applying wrong binlog data and
+ individual event checksums).
+
+
+Steps to make this more concrete:
+
+ - Investigate the current MySQL replication, and list all of the places where
+ a plugin implementation will need to connect/hook into the MySQL server.
+ * handler::{write,update,delete}_row()
+ * Statement execution
+ * Transaction start/commit
+ * Table open
+ * Query safe/not/safe for statement based replication
+ * Statement-based logging details (user variables, random seed, etc.)
+ * ...
+
+ - Use this list to make an initial sketch of the set of APIs we need.
+
+ - Use the list to determine the feasibility of this project and the level of
+ detail in the API needed to support a full replication implementation as a
+ plugin.
-=-=(Serg - Mon, 15 Mar 2010, 14:13)=-=-
Observers changed: Sergei
DESCRIPTION:
This is a top-level task for the project of designing a new set of replication
APIs for MariaDB.
This task is for the initial discussion of what to do and where to focus.
The project is started in this email thread:
https://lists.launchpad.net/maria-developers/msg01998.html
Wiki page for the project:
http://askmonty.org/wiki/ReplicationProject
HIGH-LEVEL SPECIFICATION:
Current ideas/status after discussions on the mailing list:
- Implement a set of plugin APIs and use them to move all of the existing
MySQL replication into a (set of) plugins.
- Design the APIs so that they can support full MySQL replication, but also
so that they do not hardcode assumptions about how this replication
implementation is done, and so that they will be suitable for other types of
replication (Tungsten, Galera, parallel replication, ...).
- APIs need to include the concept of a global transaction ID. Need to
determine the extent to which the semantics of such ID will be defined
by the API, and to which extend it will be defined by the plugin
implementations.
- APIs should properly support reliable crash-recovery with decent
performance (eg. not require multiple mandatory fsync()s per commit, and
not make group commit impossible).
- Would be nice if the API provided facilities for implementing good
consistency checking support (mainly checking master tables against slave
tables is hard here I think, but also applying wrong binlog data and
individual event checksums).
Steps to make this more concrete:
- Investigate the current MySQL replication, and list all of the places where
a plugin implementation will need to connect/hook into the MySQL server.
* handler::{write,update,delete}_row()
* Statement execution
* Transaction start/commit
* Table open
* Query safe/not/safe for statement based replication
* Statement-based logging details (user variables, random seed, etc.)
* ...
- Use this list to make an initial sketch of the set of APIs we need.
- Use the list to determine the feasibility of this project and the level of
detail in the API needed to support a full replication implementation as a
plugin.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Rev 2790: bugfixes in file:///home/bell/maria/bzr/work-maria-5.3-scache2/
by sanja@askmonty.org 07 Jun '10
by sanja@askmonty.org 07 Jun '10
07 Jun '10
At file:///home/bell/maria/bzr/work-maria-5.3-scache2/
------------------------------------------------------------
revno: 2790
revision-id: sanja(a)askmonty.org-20100607075845-lo3tcaiuk54qqlw0
parent: sanja(a)askmonty.org-20100531212554-oal32d5v360l6cul
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-scache2
timestamp: Mon 2010-06-07 10:58:45 +0300
message:
bugfixes
=== modified file 'mysql-test/r/subquery_cache.result'
--- a/mysql-test/r/subquery_cache.result 2010-05-31 21:25:54 +0000
+++ b/mysql-test/r/subquery_cache.result 2010-06-07 07:58:45 +0000
@@ -588,4 +588,28 @@
Subquery_cache_hit 0
Subquery_cache_miss 4
drop table t1;
+#test of sql_big_tables switch and outer table reference in subquery with grouping
+set option sql_big_tables=1;
+CREATE TABLE t1 (a INT PRIMARY KEY, b INT);
+INSERT INTO t1 VALUES (1,1),(2,1),(3,2),(4,2),(5,3),(6,3);
+SELECT (SELECT t1_outer.a FROM t1 AS t1_inner GROUP BY b LIMIT 1) FROM t1 AS t1_outer;
+(SELECT t1_outer.a FROM t1 AS t1_inner GROUP BY b LIMIT 1)
+1
+2
+3
+4
+5
+6
+drop table t1;
+set option sql_big_tables=0;
+#test of function reference to outer query
+set local group_concat_max_len=400;
+create table t2 (a int, b int);
+insert into t2 values (1,1), (2,2);
+select b x, (select group_concat(x) from t2) from t2;
+x (select group_concat(x) from t2)
+1 1,1
+2 2,2
+drop table t2;
+set local group_concat_max_len=default;
set optimizer_switch='subquery_cache=default';
=== modified file 'mysql-test/t/subquery_cache.test'
--- a/mysql-test/t/subquery_cache.test 2010-05-31 21:25:54 +0000
+++ b/mysql-test/t/subquery_cache.test 2010-06-07 07:58:45 +0000
@@ -201,4 +201,20 @@
show status like "subquery_cache%";
drop table t1;
+--echo #test of sql_big_tables switch and outer table reference in subquery with grouping
+set option sql_big_tables=1;
+CREATE TABLE t1 (a INT PRIMARY KEY, b INT);
+INSERT INTO t1 VALUES (1,1),(2,1),(3,2),(4,2),(5,3),(6,3);
+SELECT (SELECT t1_outer.a FROM t1 AS t1_inner GROUP BY b LIMIT 1) FROM t1 AS t1_outer;
+drop table t1;
+set option sql_big_tables=0;
+
+--echo #test of function reference to outer query
+set local group_concat_max_len=400;
+create table t2 (a int, b int);
+insert into t2 values (1,1), (2,2);
+select b x, (select group_concat(x) from t2) from t2;
+drop table t2;
+set local group_concat_max_len=default;
+
set optimizer_switch='subquery_cache=default';
=== modified file 'sql/table.cc'
--- a/sql/table.cc 2010-05-31 21:25:54 +0000
+++ b/sql/table.cc 2010-06-07 07:58:45 +0000
@@ -5187,10 +5187,16 @@
key_part_info->store_length= key_part_info->length;
if ((*reg_field)->real_maybe_null())
+ {
key_part_info->store_length+= HA_KEY_NULL_LENGTH;
+ keyinfo->key_length+= HA_KEY_NULL_LENGTH;
+ }
if ((*reg_field)->type() == MYSQL_TYPE_BLOB ||
(*reg_field)->real_type() == MYSQL_TYPE_VARCHAR)
+ {
key_part_info->store_length+= HA_KEY_BLOB_LENGTH;
+ keyinfo->key_length+= HA_KEY_BLOB_LENGTH; // ???
+ }
key_part_info->type= (uint8) (*reg_field)->key_type();
key_part_info->key_type =
1
0

[Maria-developers] Progress (by Knielsen): Efficient group commit for binary log (116)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Efficient group commit for binary log
CREATION DATE..: Mon, 26 Apr 2010, 13:28
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Serg
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 116 (http://askmonty.org/worklog/?tid=116)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 74
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 07 Jun 2010, 07:16)=-=-
Missing hours in previous progress report.
Worked 3 hours and estimate 0 hours remain (original estimate increased by 3 hours).
-=-=(Knielsen - Mon, 07 Jun 2010, 07:15)=-=-
Some more benchmarking.
Blog about results and architecture.
Fix two bugs in the proof-of-concept patch, races that causes hangs in benchmarks.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Guest - Tue, 01 Jun 2010, 14:20)=-=-
Status updated.
--- /tmp/wklog.116.old.32652 2010-06-01 14:20:15.000000000 +0000
+++ /tmp/wklog.116.new.32652 2010-06-01 14:20:15.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Mon, 31 May 2010, 06:48)=-=-
Finish first architecture draft (changed my mind a number of times before I was satisfied).
Write up architecture in worklog.
Fix remaining test failures in proof-of-concept patch + implement xtradb part.
Run some benchmarks on proof-of-concept implementation.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Tue, 25 May 2010, 13:19)=-=-
Low Level Design modified.
--- /tmp/wklog.116.old.14255 2010-05-25 13:19:00.000000000 +0000
+++ /tmp/wklog.116.new.14255 2010-05-25 13:19:00.000000000 +0000
@@ -1 +1,363 @@
+1. Changes for ha_commit_trans()
+
+The gut of the code for commit is in the function ha_commit_trans() (and in
+commit_one_phase() which is called from it). This must be extended to use the
+new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
+
+1.1 Atomic queue of committing transactions
+
+To keep the right commit order among participants, we put transactions into a
+queue. The operations on the queue are non-locking:
+
+ - Insert THD at the head of the queue, and return old queue.
+
+ THD *enqueue_atomic(THD *thd)
+
+ - Fetch (and delete) the whole queue.
+
+ THD *atomic_grab_reverse_queue()
+
+These are simple to implement with atomic compare-and-set. Note that there is
+no ABA problem [2], as we do not delete individual elements from the queue, we
+grab the whole queue and replace it with NULL.
+
+A transaction enters the queue when it does prepare_ordered(). This way, the
+scheduling order for prepare_ordered() calls is what determines the sequence
+in the queue and effectively the commit order.
+
+The queue is grabbed by the code doing group_log_xid() and commit_ordered()
+calls. The queue is passed directly to group_log_xid(), and afterwards
+iterated to do individual commit_ordered() calls.
+
+Using a lock-free queue allows prepare_ordered() (for one transaction) to run
+in parallel with commit_ordered (in another transaction), increasing potential
+parallelism.
+
+The queue is simply a linked list of THD objects, linked through a
+THD::next_commit_ordered field. Since we add at the head of the queue, the
+list is actually in reverse order, so must be reversed when we grab and delete
+it.
+
+The reason that enqueue_atomic() returns the old queue is so that we can check
+if an insert goes to the head of the queue. The thread at the head of the
+queue will do the sequential part of group commit for everyone.
+
+
+1.2 Locks
+
+1.2.1 Global LOCK_prepare_ordered
+
+This lock is taken to serialise calls to prepare_ordered(). Note that
+effectively, the commit order is decided by the order in which threads obtain
+this lock.
+
+
+1.2.2 Global LOCK_group_commit and COND_group_commit
+
+This lock is used to protect the serial part of group commit. It is taken
+around the code where we grab the queue, call group_log_xid() on the queue,
+and call commit_ordered() on each element of the queue, to make sure they
+happen serialised and in consistent order. It also protects the variable
+group_commit_queue_busy, which is used when not using group_log_xid() to delay
+running over a new queue until the first queue is completely done.
+
+
+1.2.3 Global LOCK_commit_ordered
+
+This lock is taken around calls to commit_ordered(), to ensure they happen
+serialised.
+
+
+1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
+
+This lock protects the thd->group_commit_ready variable, as well as the
+condition variable used to wake up threads after log_xid() and
+commit_ordered() finishes.
+
+
+1.2.5 Global LOCK_group_commit_queue
+
+This is only used on platforms with no native compare-and-set operations, to
+make the queue operations atomic.
+
+
+1.3 Commit algorithm.
+
+This is the basic algorithm, simplified by
+
+ - omitting some error handling
+
+ - omitting looping over all handlers when invoking handler methods
+
+ - omitting some possible optimisations when not all calls needed (see next
+ section).
+
+ - Omitting the case where no group_log_xid() is used, see below.
+
+---- BEGIN ALGORITHM ----
+ ht->prepare()
+
+ // Call prepare_ordered() and enqueue in correct commit order
+ lock(LOCK_prepare_ordered)
+ ht->prepare_ordered()
+ old_queue= enqueue_atomic(thd)
+ thd->group_commit_ready= FALSE
+ is_group_commit_leader= (old_queue == NULL)
+ unlock(LOCK_prepare_ordered)
+
+ if (is_group_commit_leader)
+
+ // The first in queue handles group commit for everyone
+
+ lock(LOCK_group_commit)
+ // Wait while queue is busy, see below for when this occurs
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+
+ // Grab and reverse the queue to get correct order of transactions
+ queue= atomic_grab_reverse_queue()
+
+ // This call will set individual error codes in thd->xid_error
+ // It also sets the cookie for unlog() in thd->xid_cookie
+ group_log_xid(queue)
+
+ lock(LOCK_commit_ordered)
+ for (other IN queue)
+ if (!other->xid_error)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ unlock(LOCK_group_commit)
+
+ // Now we are done, so wake up all the others.
+ for (other IN TAIL(queue))
+ lock(other->LOCK_commit_ordered)
+ other->group_commit_ready= TRUE
+ cond_signal(other->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+ else
+ // If not the leader, just wait until leader did the work for us.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ // Finally do any error reporting now that we're back in own thread.
+ if (thd->xid_error)
+ xid_delayed_error(thd)
+ else
+ ht->commit(thd)
+ unlog(thd->xid_cookie, thd->xid)
+---- END ALGORITHM ----
+
+If the transaction coordinator does not support group_log_xid(), we have to do
+things differently. In this case after the serialisation point at
+prepare_ordered(), we have to parallelise again when running log_xid()
+(otherwise we would loose group commit). But then when log_xid() is done, we
+have to serialise again to check for any error and call commit_ordered() in
+correct sequence for any transaction where log_xid() did not return error.
+
+The central part of the algorithm in this case (when using log_xid()) is:
+
+---- BEGIN ALGORITHM ----
+ cookie= log_xid(thd)
+ error= (cookie == 0)
+
+ if (is_group_commit_leader)
+
+ // The first to enqueue grabs the queue and runs first.
+ // But we must wait until a previous queue run is fully done.
+
+ lock(LOCK_group_commit)
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+ queue= atomic_grab_reverse_queue()
+ // The queue will be busy until last thread in it is done.
+ group_commit_queue_busy= TRUE
+ unlock(LOCK_group_commit)
+ else
+ // Not first in queue -> wait for previous one to wake us up.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ if (!error) // Only if log_xid() was successful
+ lock(LOCK_commit_ordered)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ // Wake up the next thread, and release queue in last.
+ next= thd->next_commit_ordered
+
+ if (next)
+ lock(next->LOCK_commit_ordered)
+ next->group_commit_ready= TRUE
+ cond_signal(next->COND_commit_ordered)
+ unlock(next->LOCK_commit_ordered)
+ else
+ lock(LOCK_group_commit)
+ group_commit_queue_busy= FALSE
+ unlock(LOCK_group_commit)
+---- END ALGORITHM ----
+
+There are a number of locks taken in the algorithm, but in the group_log_xid()
+case most of them should be uncontended most of the time. The
+LOCK_group_commit of course will be contended, as new threads queue up waiting
+for the previous group commit (and binlog fsync()) to finish so they can do
+the next group commit. This is the whole point of implementing group commit.
+
+The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
+contended as long as handlers follow the intension of having the corresponding
+handler calls execute quickly.
+
+The per-thread LOCK_commit_ordered mutexes should not be contended; they are
+only used to wake up a sleeping thread.
+
+
+1.4 Optimisations when not using all three new calls
+
+
+The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
+optional, and if not implemented by a particular handler/transaction
+coordinator, we can optimise the algorithm to take advantage of not having to
+keep ordering for the missing parts.
+
+If there is no prepare_ordered(), then we need not take the
+LOCK_prepare_ordered mutex.
+
+If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
+mutex.
+
+If there is no group_log_xid(), then we only need the queue to ensure same
+ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
+if either of these (or both) are also not present, we do not need to use the
+queue at all.
+
+
+2. Binlog code changes (log.cc)
+
+
+The bulk of the work needed for the binary log is to extend the code to allow
+group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
+inside the binlog code for group commit.
+
+The existing code runs most of the write + fsync to the binary lock under the
+global LOCK_log mutex, preventing any group commit.
+
+To enable group commit, this code must be split into two parts:
+
+ - one part that runs per transaction, re-writing the embedded event positions
+ for the correct offset, and writing this into the in-memory log cache.
+
+ - another part that writes a set of transactions to the disk, and runs
+ fsync().
+
+Then in group_log_xid(), we can run the first part in a loop over all the
+transactions in the passed-in queue, and run the second part only once.
+
+The binlog code also has other code paths that write into the binlog,
+eg. non-transactional statements. These have to be adapted also to work with
+the new code.
+
+In order to get some group commit facility for these also, we change that part
+of the code in a similar way to ha_commit_trans. We keep another,
+binlog-internal queue of such non-transactional binlog writes, and such writes
+queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
+LOCK_log, it loops over the queue for the fast part, and does the slow part
+once, then finally wakes up the others in the queue.
+
+In the transactional case in group_log_xid(), before we run the passed-in
+queue, we add any members found in the binlog-internal queue. This allows
+these non-transactional writes to share the group commit.
+
+However, in the case where it is a non-transactional write that gets the
+LOCK_log, the transactional transactions from the ha_commit_trans() queue will
+not be able to take part (they will have to wait for their turn to do another
+fsync). It seems difficult to cleanly let the binlog code grab the queue from
+out of the ha_commit_trans() algorithm. I think the group commit is mostly
+useful in transactional workloads anyway (non-transactional engines will loose
+data anyway in case of crash, so why fsync() after each transaction?)
+
+
+3. XtraDB changes (ha_innodb.cc)
+
+The changes needed in XtraDB are comparatively simple, as XtraDB already
+implements group commit, it just needs to be enabled with the new
+commit_ordered() call.
+
+The existing commit() method already is logically in two parts. The first part
+runs under the prepare_commit_mutex() and must be run in same order as binlog
+commit. This part needs to be moved to commit_ordered(). The second part runs
+after releasing prepare_commit_mutex and does transaction log write+fsync; it
+can remain.
+
+Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
+XtraDB option to disable it).
+
+There are two asserts that check that the thread running the first part of
+XtraDB commit is the same as the thread running the other operations for the
+transaction. These have to be removed (as commit_ordered() can run in a
+different thread). Also an error reporting with sql_print_error() has to be
+delayed until commit() time.
+
+
+4. Proof-of-concept implementation
+
+There is a proof-of-concept implementation of this architecture, in the form
+of a quilt patch series [3].
+
+A quick benchmark was done, with sync_binlog=1 and
+innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
+transactions against one table.
+
+Without the patch, we get only 25 queries per second.
+
+With the patch, we get 650 queries per second.
+
+
+5. Open issues/tasks
+
+5.1 XA / other prepare() and commit() call sites.
+
+Check that user-level XA is handled correctly and working. And covered
+sufficiently with tests. Also check that any other calls of ha->prepare() and
+ha->commit() outside of ha_commit_trans() are handled correctly.
+
+5.2 Testing
+
+This worklog needs additions to the test suite, including error inserts to
+check error handling, and synchronisation points to check thread parallelism
+correctness.
+
+
+6. Alternative implementations
+
+ - The binlog code maintains its own extra atomic transaction queue to handle
+ non-transactional commits in a good way together with transactional (with
+ respect to group commit). Alternatively, we could ignore this issue and
+ just give up on group commit for non-transactional statements, for some
+ code simplifications.
+
+ - The binlog code has two ways to prepare end_event and similar, one that
+ uses stack-allocation, and another for when stack allocation is not
+ possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
+ so small that it would make sense to use the same code for both cases.
+
+ - Instead of adding extra fields to THD, we could allocate a separate
+ structure on the thd->mem_root() with the required extra fields (including
+ the THD pointer). Would seem to require initialising mutexes at every
+ commit though.
+
+ - It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
+ (should not be hard).
+
+
+-----------------------------------------------------------------------
+
+References:
+
+[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
+
+[3] https://knielsen-hq.org/maria/patches.mwl116/
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.116.old.14249 2010-05-25 13:18:34.000000000 +0000
+++ /tmp/wklog.116.new.14249 2010-05-25 13:18:34.000000000 +0000
@@ -1 +1,157 @@
+The basic idea in group commit is that multiple threads, each handling one
+transaction, prepare for commit and then queue up together waiting to do an
+fsync() on the transaction log. Then once the log is available, a single
+thread does the fsync() + other necessary book-keeping for all of the threads
+at once. After this, the single thread signals the other threads that it's
+done and they can finish up and return success (or failure) from the commit
+operation.
+
+So group commit has a parallel part, and a sequential part. So we need a
+facility for engines/binlog to participate in both the parallel and the
+sequential part.
+
+To do this, we add two new handlerton methods:
+
+ int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
+ void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
+
+The idea is that the existing prepare() and commit() methods run in the
+parallel part of group commit, and the new prepare_ordered() and
+commit_ordered() run in the sequential part.
+
+The prepare_ordered() method is called after prepare(). The order of
+tranctions that call into prepare_ordered() is guaranteed to be the same among
+all storage engines and binlog, and it is serialised so no two calls can be
+running inside the same engine at the same time.
+
+The commit_ordered() method is called before commit(), and similarly is
+guaranteed to have same transaction order in all participants, and to be
+serialised within one engine.
+
+As the prepare_ordered() and commit_ordered() calls are serialised, the idea
+is that handlers should do the minimum amount of work needed in these calls,
+relaying most of the work (eg. fsync() ...) to prepare() and commit().
+
+As a concrete example, for InnoDB the commit_ordered() method will do the
+first part of commit that fixed the commit order in the transaction log
+buffer, and the commit() method will write the log to disk and fsync()
+it. This split already exists inside the InnoDB code, running before
+respectively after releasing the prepare_commit_mutex.
+
+In addition, the XA transaction coordinator (TC_LOG) is special, since it is
+the one responsible for deciding whether to commit or rollback the
+transaction. For this we need an extra method, since this decision can be done
+only after we know that all prepare() and prepare_ordered() calls succeed, and
+must be done to know whether to call commit_ordered()/commit(), or do rollback.
+
+The existing method for this is TC_LOG::log_xid(). To make implementing group
+commit simpler to implement in a transaction coordinator and more efficient,
+we introduce a new method:
+
+ void group_log_xid(THD *first_thd);
+
+This method runs in the sequential part of group commit. It receives a list of
+transactions to perform log_xid() on, in the correct commit order. (Note that
+TC_LOG can do parallel parts of group commit in its own prepare() and commit()
+methods).
+
+This method can make it easier to implement the group commit in TC_LOG, as it
+gets directly the list of transactions in the right order. Without it, it
+might need to compute such order anyway in a prepare_ordered() method, and the
+server has to create this ordered list anyway to implement the order guarantee
+for prepare_ordered() and commit_ordered().
+
+This group_log_xid() method also is more efficient, as it avoids some
+inter-thread synchronisation. Since group_log_xid() is serialised, we can run
+it together with all the commit_ordered() method calls and need only a single
+sequential code section. With the log_xid() methods, we would need first a
+sequential part for the prepare_ordered() calls, then a parallel part with
+log_xid() calls (to not loose group commit ability for log_xid()), then again
+a sequential part for the commit_ordered() method calls.
+
+The extra synchronisation is needed, as each commit_ordered() call will have
+to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
+should not be called), and also wait for commit_ordered() to finish in all
+threads handling earlier commits. In effect we will need to bounce the
+execution from one thread to the other among all participants in the group
+commit.
+
+As a consequence of the group_log_xid() optimisation, handlers must be aware
+that the commit_ordered() call can happen in another thread than the one
+running commit() (so thread local storage is not available). This should not
+be a big issue as the THD is available for storing any needed information.
+
+Since group_log_xid() runs for multiple transactions in a single thread, it
+can not do error reporting (my_error()) as that relies on thread local
+storage. Instead it sets an error code in THD::xid_error, and if there is an
+error then later another method will be called (in correct thread context) to
+actually report the error:
+
+ int xid_delayed_error(THD *thd)
+
+The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
+are optional (as is xid_delayed_error). A storage engine or transaction
+coordinator is free to not implement them if they are not needed. In this case
+there will be no order guarantee for the corresponding stage of group commit
+for that engine. For example, InnoDB needs no ordering of the prepare phase,
+so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
+all, so does not need to implement any of them.
+
+Note in particular that all existing engines (/binlog implementations if they
+exist) will work unmodified (and also without any change in group commit
+facilities or commit order guaranteed).
+
+Using these new APIs, the work will be to
+
+ - In ha_commit_trans(), implement the correct semantics for the three new
+ calls.
+
+ - In XtraDB, use the new commit_ordered() call to remove the
+ prepare_commit_mutex (and resurrect group commit) without loosing the
+ consistency with binlog commit order.
+
+ - In log.cc (binlog module), implement group_log_xid() to do group commit of
+ multiple transactions to the binlog with a single shared fsync() call.
+
+-----------------------------------------------------------------------
+Some possible alternative for this worklog:
+
+ - We could eliminate the group_log_xid() method for a simpler API, at the
+ cost of extra synchronisation between threads to do in-order
+ commit_ordered() method calls. This would also allow to call
+ commit_ordered() in the correct thread context.
+
+ - Alternatively, we could eliminate log_xid() and require that all
+ transaction coordinators implement group_log_xid() instead, again for some
+ moderate simplification.
+
+ - At the moment there is no plugin actually using prepare_ordered(), so, it
+ could be removed from the design. But it fits in well, is efficient to
+ implement, and could be useful later (eg. for the requested feature of
+ releasing locks early in InnoDB).
+
+-----------------------------------------------------------------------
+Some possible follow-up projects after this is implemented:
+
+ - Add statistics about how efficient group commit is (#fsyncs/#commits in
+ each engine and binlog).
+
+ - Implement an XtraDB prepare_ordered() methods that can release row locks
+ early (Mark Callaghan from Facebook advocates this, but need to determine
+ exactly how to do this safely).
+
+ - Implement a new crash recovery algorithm that uses the consistent commit
+ ordering to need only fsync() for the binlog. At crash recovery, any
+ missing transactions in an engine is replayed from the correct point in the
+ binlog (this point must be stored transactionally inside the engine, as
+ XtraDB already does today).
+
+ - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
+ consistent snapshow, with same set of committed and not committed
+ transactions in all engines, 2) returns a corresponding consistent binlog
+ position. This should be easy by piggybacking on the synchronisation
+ implemented for ha_commit_trans().
+
+ - Use this in XtraBackup to get consistent binlog position without having to
+ block all updates with FLUSH TABLES WITH READ LOCK.
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High Level Description modified.
--- /tmp/wklog.116.old.14234 2010-05-25 13:18:07.000000000 +0000
+++ /tmp/wklog.116.new.14234 2010-05-25 13:18:07.000000000 +0000
@@ -21,3 +21,69 @@
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
+----
+
+Implementing group commit in MySQL faces some challenges from the handler
+plugin architecture:
+
+1. Because storage engine handlers have separate transaction log from the
+mysql binlog (and from each other), there are multiple fsync() calls per
+commit that need the group commit optimisation (2 per participating storage
+engine + 1 for binlog).
+
+2. The code handling commit is split in several places, in main server code
+and in storage engine code. With pluggable binlog it will be split even
+more. This requires a good abstract yet powerful API to be able to implement
+group commit simply and efficiently in plugins without the different parts
+having to rely on iternals of the others.
+
+3. We want the order of commits to be the same in all engines participating in
+multiple transactions. This requirement is the reason that InnoDB currently
+breaks group commit with the infamous prepare_commit_mutex.
+
+While currently there is no server guarantee to get same commit order in
+engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
+several reasons why this could be desirable:
+
+ - InnoDB hot backup needs to be able to extract a binlog position that is
+ consistent with the hot backup to be able to provision a new slave, and
+ this is impossible without imposing at least partial consistent ordering
+ between InnoDB and binlog.
+
+ - Other backup methods could have similar needs, eg. XtraBackup or
+ `mysqldump --single-transaction`, to have consistent commit order between
+ binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
+ or similar expensive blocking operation. (other backup methods, like LVM
+ snapshot, don't need consistent commit order, as they can restore
+ out-of-order commits during crash recovery using XA).
+
+ - If we have consistent commit order, we can think about optimising commit to
+ need only one fsync (for binlog); lost commits in storage engines can then
+ be recovered from the binlog at crash recovery by re-playing against the
+ engine from a particular point in the binlog.
+
+ - With consistent commit order, we can get better semantics for START
+ TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
+ could even get it to return also a matching binlog position). Currently,
+ this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
+ engines.
+
+ - In InnoDB, the performance in the presense of hotspots can be improved if
+ we can release row locks early in the commit phase, but this requires that we
+release them in
+ the same order as commits in the binlog to ensure consistency between
+ master and slaves.
+
+ - There was some discussions around Galera [1] synchroneous replication and
+ global transaction ID that it needed consistent commit order among
+ participating engines.
+
+ - I believe there could be other applications for guaranteed consistent
+ commit order, and that the architecture described in this worklog can
+ implement such guarantee with reasonable overhead.
+
+
+References:
+
+[1] Galera: http://www.codership.com/products/galera_replication
+
-=-=(Knielsen - Tue, 25 May 2010, 08:28)=-=-
More thoughts on and changes to the archtecture. Got to something now that I am satisfied with and
that seems to be able to handle all issues.
Implement new prepare_ordered and commit_ordered handler methods and the logic in ha_commit_trans().
Implement TC_LOG::group_log_xid() method and logic in ha_commit_trans().
Implement XtraDB part, using commit_ordered() rather than prepare_commit_mutex.
Fix test suite failures.
Proof-of-concept patch series complete now.
Do initial benchmark, getting good results. With 64 threads, see 26x improvement in queries-per-sec.
Next step: write up the architecture description.
Worked 21 hours and estimate 0 hours remain (original estimate increased by 21 hours).
-=-=(Knielsen - Wed, 12 May 2010, 06:41)=-=-
Started work on a Quilt patch series, refactoring the binlog code to prepare for implementing the
group commit, and working on the design of group commit in parallel.
Found and fixed several problems in error handling when writing to binlog.
Removed redundant table map version locking.
Split binlog writing into two parts in preparations for group commit. When ready to write to the
binlog, threads enter a queue, and the first thread in the queue handles the binlog writing for
everyone. When it obtains the LOCK_log, it first loops over all threads, executing the first part of
binlog writing (the write(2) syscall essentially). It then runs the second part (fsync(2)
essentially) only once, and then wakes up the remaining threads in the queue.
Still to be done:
Finish the proof-of-concept group commit patch, by 1) implementing the prepare_fast() and
commit_fast() callbacks in handler.cc 2) move the binlog thread enqueue from log_xid() to
binlog_prepare_fast(), 3) move fast part of InnoDB commit to innobase_commit_fast(), removing the
prepare_commit_mutex().
Write up the final design in this worklog.
Evaluate the design to see if we can do better/different.
Think about possible next steps, such as releasing innodb row locks early (in
innobase_prepare_fast), and doing crash recovery by replaying transactions from the binlog (removing
the need for engine durability and 2 of 3 fsync() in commit).
Worked 28 hours and estimate 0 hours remain (original estimate increased by 28 hours).
-=-=(Serg - Mon, 26 Apr 2010, 14:10)=-=-
Observers changed: Serg
DESCRIPTION:
Currently, in order to ensure that the server can recover after a crash to a
state in which storage engines and binary log are consistent with each other,
it is necessary to use XA with durable commits for both storage engines
(innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1).
This is _very_ expensive, since the server needs to do three fsync() operations
for every commit, as there is no working group commit when the binary log is
enabled.
The idea is to
- Implement/fix group commit to work properly with the binary log enabled.
- (Optionally) avoid the need to fsync() in the engine, and instead rely on
replaying any lost transactions from the binary log against the engine
during crash recovery.
For background see these articles:
http://kristiannielsen.livejournal.com/12254.html
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
----
Implementing group commit in MySQL faces some challenges from the handler
plugin architecture:
1. Because storage engine handlers have separate transaction log from the
mysql binlog (and from each other), there are multiple fsync() calls per
commit that need the group commit optimisation (2 per participating storage
engine + 1 for binlog).
2. The code handling commit is split in several places, in main server code
and in storage engine code. With pluggable binlog it will be split even
more. This requires a good abstract yet powerful API to be able to implement
group commit simply and efficiently in plugins without the different parts
having to rely on iternals of the others.
3. We want the order of commits to be the same in all engines participating in
multiple transactions. This requirement is the reason that InnoDB currently
breaks group commit with the infamous prepare_commit_mutex.
While currently there is no server guarantee to get same commit order in
engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
several reasons why this could be desirable:
- InnoDB hot backup needs to be able to extract a binlog position that is
consistent with the hot backup to be able to provision a new slave, and
this is impossible without imposing at least partial consistent ordering
between InnoDB and binlog.
- Other backup methods could have similar needs, eg. XtraBackup or
`mysqldump --single-transaction`, to have consistent commit order between
binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
or similar expensive blocking operation. (other backup methods, like LVM
snapshot, don't need consistent commit order, as they can restore
out-of-order commits during crash recovery using XA).
- If we have consistent commit order, we can think about optimising commit to
need only one fsync (for binlog); lost commits in storage engines can then
be recovered from the binlog at crash recovery by re-playing against the
engine from a particular point in the binlog.
- With consistent commit order, we can get better semantics for START
TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
could even get it to return also a matching binlog position). Currently,
this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
engines.
- In InnoDB, the performance in the presense of hotspots can be improved if
we can release row locks early in the commit phase, but this requires that we
release them in
the same order as commits in the binlog to ensure consistency between
master and slaves.
- There was some discussions around Galera [1] synchroneous replication and
global transaction ID that it needed consistent commit order among
participating engines.
- I believe there could be other applications for guaranteed consistent
commit order, and that the architecture described in this worklog can
implement such guarantee with reasonable overhead.
References:
[1] Galera: http://www.codership.com/products/galera_replication
HIGH-LEVEL SPECIFICATION:
The basic idea in group commit is that multiple threads, each handling one
transaction, prepare for commit and then queue up together waiting to do an
fsync() on the transaction log. Then once the log is available, a single
thread does the fsync() + other necessary book-keeping for all of the threads
at once. After this, the single thread signals the other threads that it's
done and they can finish up and return success (or failure) from the commit
operation.
So group commit has a parallel part, and a sequential part. So we need a
facility for engines/binlog to participate in both the parallel and the
sequential part.
To do this, we add two new handlerton methods:
int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
The idea is that the existing prepare() and commit() methods run in the
parallel part of group commit, and the new prepare_ordered() and
commit_ordered() run in the sequential part.
The prepare_ordered() method is called after prepare(). The order of
tranctions that call into prepare_ordered() is guaranteed to be the same among
all storage engines and binlog, and it is serialised so no two calls can be
running inside the same engine at the same time.
The commit_ordered() method is called before commit(), and similarly is
guaranteed to have same transaction order in all participants, and to be
serialised within one engine.
As the prepare_ordered() and commit_ordered() calls are serialised, the idea
is that handlers should do the minimum amount of work needed in these calls,
relaying most of the work (eg. fsync() ...) to prepare() and commit().
As a concrete example, for InnoDB the commit_ordered() method will do the
first part of commit that fixed the commit order in the transaction log
buffer, and the commit() method will write the log to disk and fsync()
it. This split already exists inside the InnoDB code, running before
respectively after releasing the prepare_commit_mutex.
In addition, the XA transaction coordinator (TC_LOG) is special, since it is
the one responsible for deciding whether to commit or rollback the
transaction. For this we need an extra method, since this decision can be done
only after we know that all prepare() and prepare_ordered() calls succeed, and
must be done to know whether to call commit_ordered()/commit(), or do rollback.
The existing method for this is TC_LOG::log_xid(). To make implementing group
commit simpler to implement in a transaction coordinator and more efficient,
we introduce a new method:
void group_log_xid(THD *first_thd);
This method runs in the sequential part of group commit. It receives a list of
transactions to perform log_xid() on, in the correct commit order. (Note that
TC_LOG can do parallel parts of group commit in its own prepare() and commit()
methods).
This method can make it easier to implement the group commit in TC_LOG, as it
gets directly the list of transactions in the right order. Without it, it
might need to compute such order anyway in a prepare_ordered() method, and the
server has to create this ordered list anyway to implement the order guarantee
for prepare_ordered() and commit_ordered().
This group_log_xid() method also is more efficient, as it avoids some
inter-thread synchronisation. Since group_log_xid() is serialised, we can run
it together with all the commit_ordered() method calls and need only a single
sequential code section. With the log_xid() methods, we would need first a
sequential part for the prepare_ordered() calls, then a parallel part with
log_xid() calls (to not loose group commit ability for log_xid()), then again
a sequential part for the commit_ordered() method calls.
The extra synchronisation is needed, as each commit_ordered() call will have
to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
should not be called), and also wait for commit_ordered() to finish in all
threads handling earlier commits. In effect we will need to bounce the
execution from one thread to the other among all participants in the group
commit.
As a consequence of the group_log_xid() optimisation, handlers must be aware
that the commit_ordered() call can happen in another thread than the one
running commit() (so thread local storage is not available). This should not
be a big issue as the THD is available for storing any needed information.
Since group_log_xid() runs for multiple transactions in a single thread, it
can not do error reporting (my_error()) as that relies on thread local
storage. Instead it sets an error code in THD::xid_error, and if there is an
error then later another method will be called (in correct thread context) to
actually report the error:
int xid_delayed_error(THD *thd)
The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
are optional (as is xid_delayed_error). A storage engine or transaction
coordinator is free to not implement them if they are not needed. In this case
there will be no order guarantee for the corresponding stage of group commit
for that engine. For example, InnoDB needs no ordering of the prepare phase,
so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
all, so does not need to implement any of them.
Note in particular that all existing engines (/binlog implementations if they
exist) will work unmodified (and also without any change in group commit
facilities or commit order guaranteed).
Using these new APIs, the work will be to
- In ha_commit_trans(), implement the correct semantics for the three new
calls.
- In XtraDB, use the new commit_ordered() call to remove the
prepare_commit_mutex (and resurrect group commit) without loosing the
consistency with binlog commit order.
- In log.cc (binlog module), implement group_log_xid() to do group commit of
multiple transactions to the binlog with a single shared fsync() call.
-----------------------------------------------------------------------
Some possible alternative for this worklog:
- We could eliminate the group_log_xid() method for a simpler API, at the
cost of extra synchronisation between threads to do in-order
commit_ordered() method calls. This would also allow to call
commit_ordered() in the correct thread context.
- Alternatively, we could eliminate log_xid() and require that all
transaction coordinators implement group_log_xid() instead, again for some
moderate simplification.
- At the moment there is no plugin actually using prepare_ordered(), so, it
could be removed from the design. But it fits in well, is efficient to
implement, and could be useful later (eg. for the requested feature of
releasing locks early in InnoDB).
-----------------------------------------------------------------------
Some possible follow-up projects after this is implemented:
- Add statistics about how efficient group commit is (#fsyncs/#commits in
each engine and binlog).
- Implement an XtraDB prepare_ordered() methods that can release row locks
early (Mark Callaghan from Facebook advocates this, but need to determine
exactly how to do this safely).
- Implement a new crash recovery algorithm that uses the consistent commit
ordering to need only fsync() for the binlog. At crash recovery, any
missing transactions in an engine is replayed from the correct point in the
binlog (this point must be stored transactionally inside the engine, as
XtraDB already does today).
- Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
consistent snapshow, with same set of committed and not committed
transactions in all engines, 2) returns a corresponding consistent binlog
position. This should be easy by piggybacking on the synchronisation
implemented for ha_commit_trans().
- Use this in XtraBackup to get consistent binlog position without having to
block all updates with FLUSH TABLES WITH READ LOCK.
LOW-LEVEL DESIGN:
1. Changes for ha_commit_trans()
The gut of the code for commit is in the function ha_commit_trans() (and in
commit_one_phase() which is called from it). This must be extended to use the
new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
1.1 Atomic queue of committing transactions
To keep the right commit order among participants, we put transactions into a
queue. The operations on the queue are non-locking:
- Insert THD at the head of the queue, and return old queue.
THD *enqueue_atomic(THD *thd)
- Fetch (and delete) the whole queue.
THD *atomic_grab_reverse_queue()
These are simple to implement with atomic compare-and-set. Note that there is
no ABA problem [2], as we do not delete individual elements from the queue, we
grab the whole queue and replace it with NULL.
A transaction enters the queue when it does prepare_ordered(). This way, the
scheduling order for prepare_ordered() calls is what determines the sequence
in the queue and effectively the commit order.
The queue is grabbed by the code doing group_log_xid() and commit_ordered()
calls. The queue is passed directly to group_log_xid(), and afterwards
iterated to do individual commit_ordered() calls.
Using a lock-free queue allows prepare_ordered() (for one transaction) to run
in parallel with commit_ordered (in another transaction), increasing potential
parallelism.
The queue is simply a linked list of THD objects, linked through a
THD::next_commit_ordered field. Since we add at the head of the queue, the
list is actually in reverse order, so must be reversed when we grab and delete
it.
The reason that enqueue_atomic() returns the old queue is so that we can check
if an insert goes to the head of the queue. The thread at the head of the
queue will do the sequential part of group commit for everyone.
1.2 Locks
1.2.1 Global LOCK_prepare_ordered
This lock is taken to serialise calls to prepare_ordered(). Note that
effectively, the commit order is decided by the order in which threads obtain
this lock.
1.2.2 Global LOCK_group_commit and COND_group_commit
This lock is used to protect the serial part of group commit. It is taken
around the code where we grab the queue, call group_log_xid() on the queue,
and call commit_ordered() on each element of the queue, to make sure they
happen serialised and in consistent order. It also protects the variable
group_commit_queue_busy, which is used when not using group_log_xid() to delay
running over a new queue until the first queue is completely done.
1.2.3 Global LOCK_commit_ordered
This lock is taken around calls to commit_ordered(), to ensure they happen
serialised.
1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
This lock protects the thd->group_commit_ready variable, as well as the
condition variable used to wake up threads after log_xid() and
commit_ordered() finishes.
1.2.5 Global LOCK_group_commit_queue
This is only used on platforms with no native compare-and-set operations, to
make the queue operations atomic.
1.3 Commit algorithm.
This is the basic algorithm, simplified by
- omitting some error handling
- omitting looping over all handlers when invoking handler methods
- omitting some possible optimisations when not all calls needed (see next
section).
- Omitting the case where no group_log_xid() is used, see below.
---- BEGIN ALGORITHM ----
ht->prepare()
// Call prepare_ordered() and enqueue in correct commit order
lock(LOCK_prepare_ordered)
ht->prepare_ordered()
old_queue= enqueue_atomic(thd)
thd->group_commit_ready= FALSE
is_group_commit_leader= (old_queue == NULL)
unlock(LOCK_prepare_ordered)
if (is_group_commit_leader)
// The first in queue handles group commit for everyone
lock(LOCK_group_commit)
// Wait while queue is busy, see below for when this occurs
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
// Grab and reverse the queue to get correct order of transactions
queue= atomic_grab_reverse_queue()
// This call will set individual error codes in thd->xid_error
// It also sets the cookie for unlog() in thd->xid_cookie
group_log_xid(queue)
lock(LOCK_commit_ordered)
for (other IN queue)
if (!other->xid_error)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
unlock(LOCK_group_commit)
// Now we are done, so wake up all the others.
for (other IN TAIL(queue))
lock(other->LOCK_commit_ordered)
other->group_commit_ready= TRUE
cond_signal(other->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
else
// If not the leader, just wait until leader did the work for us.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
// Finally do any error reporting now that we're back in own thread.
if (thd->xid_error)
xid_delayed_error(thd)
else
ht->commit(thd)
unlog(thd->xid_cookie, thd->xid)
---- END ALGORITHM ----
If the transaction coordinator does not support group_log_xid(), we have to do
things differently. In this case after the serialisation point at
prepare_ordered(), we have to parallelise again when running log_xid()
(otherwise we would loose group commit). But then when log_xid() is done, we
have to serialise again to check for any error and call commit_ordered() in
correct sequence for any transaction where log_xid() did not return error.
The central part of the algorithm in this case (when using log_xid()) is:
---- BEGIN ALGORITHM ----
cookie= log_xid(thd)
error= (cookie == 0)
if (is_group_commit_leader)
// The first to enqueue grabs the queue and runs first.
// But we must wait until a previous queue run is fully done.
lock(LOCK_group_commit)
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
queue= atomic_grab_reverse_queue()
// The queue will be busy until last thread in it is done.
group_commit_queue_busy= TRUE
unlock(LOCK_group_commit)
else
// Not first in queue -> wait for previous one to wake us up.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
if (!error) // Only if log_xid() was successful
lock(LOCK_commit_ordered)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
// Wake up the next thread, and release queue in last.
next= thd->next_commit_ordered
if (next)
lock(next->LOCK_commit_ordered)
next->group_commit_ready= TRUE
cond_signal(next->COND_commit_ordered)
unlock(next->LOCK_commit_ordered)
else
lock(LOCK_group_commit)
group_commit_queue_busy= FALSE
unlock(LOCK_group_commit)
---- END ALGORITHM ----
There are a number of locks taken in the algorithm, but in the group_log_xid()
case most of them should be uncontended most of the time. The
LOCK_group_commit of course will be contended, as new threads queue up waiting
for the previous group commit (and binlog fsync()) to finish so they can do
the next group commit. This is the whole point of implementing group commit.
The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
contended as long as handlers follow the intension of having the corresponding
handler calls execute quickly.
The per-thread LOCK_commit_ordered mutexes should not be contended; they are
only used to wake up a sleeping thread.
1.4 Optimisations when not using all three new calls
The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
optional, and if not implemented by a particular handler/transaction
coordinator, we can optimise the algorithm to take advantage of not having to
keep ordering for the missing parts.
If there is no prepare_ordered(), then we need not take the
LOCK_prepare_ordered mutex.
If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
mutex.
If there is no group_log_xid(), then we only need the queue to ensure same
ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
if either of these (or both) are also not present, we do not need to use the
queue at all.
2. Binlog code changes (log.cc)
The bulk of the work needed for the binary log is to extend the code to allow
group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
inside the binlog code for group commit.
The existing code runs most of the write + fsync to the binary lock under the
global LOCK_log mutex, preventing any group commit.
To enable group commit, this code must be split into two parts:
- one part that runs per transaction, re-writing the embedded event positions
for the correct offset, and writing this into the in-memory log cache.
- another part that writes a set of transactions to the disk, and runs
fsync().
Then in group_log_xid(), we can run the first part in a loop over all the
transactions in the passed-in queue, and run the second part only once.
The binlog code also has other code paths that write into the binlog,
eg. non-transactional statements. These have to be adapted also to work with
the new code.
In order to get some group commit facility for these also, we change that part
of the code in a similar way to ha_commit_trans. We keep another,
binlog-internal queue of such non-transactional binlog writes, and such writes
queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
LOCK_log, it loops over the queue for the fast part, and does the slow part
once, then finally wakes up the others in the queue.
In the transactional case in group_log_xid(), before we run the passed-in
queue, we add any members found in the binlog-internal queue. This allows
these non-transactional writes to share the group commit.
However, in the case where it is a non-transactional write that gets the
LOCK_log, the transactional transactions from the ha_commit_trans() queue will
not be able to take part (they will have to wait for their turn to do another
fsync). It seems difficult to cleanly let the binlog code grab the queue from
out of the ha_commit_trans() algorithm. I think the group commit is mostly
useful in transactional workloads anyway (non-transactional engines will loose
data anyway in case of crash, so why fsync() after each transaction?)
3. XtraDB changes (ha_innodb.cc)
The changes needed in XtraDB are comparatively simple, as XtraDB already
implements group commit, it just needs to be enabled with the new
commit_ordered() call.
The existing commit() method already is logically in two parts. The first part
runs under the prepare_commit_mutex() and must be run in same order as binlog
commit. This part needs to be moved to commit_ordered(). The second part runs
after releasing prepare_commit_mutex and does transaction log write+fsync; it
can remain.
Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
XtraDB option to disable it).
There are two asserts that check that the thread running the first part of
XtraDB commit is the same as the thread running the other operations for the
transaction. These have to be removed (as commit_ordered() can run in a
different thread). Also an error reporting with sql_print_error() has to be
delayed until commit() time.
4. Proof-of-concept implementation
There is a proof-of-concept implementation of this architecture, in the form
of a quilt patch series [3].
A quick benchmark was done, with sync_binlog=1 and
innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
transactions against one table.
Without the patch, we get only 25 queries per second.
With the patch, we get 650 queries per second.
5. Open issues/tasks
5.1 XA / other prepare() and commit() call sites.
Check that user-level XA is handled correctly and working. And covered
sufficiently with tests. Also check that any other calls of ha->prepare() and
ha->commit() outside of ha_commit_trans() are handled correctly.
5.2 Testing
This worklog needs additions to the test suite, including error inserts to
check error handling, and synchronisation points to check thread parallelism
correctness.
6. Alternative implementations
- The binlog code maintains its own extra atomic transaction queue to handle
non-transactional commits in a good way together with transactional (with
respect to group commit). Alternatively, we could ignore this issue and
just give up on group commit for non-transactional statements, for some
code simplifications.
- The binlog code has two ways to prepare end_event and similar, one that
uses stack-allocation, and another for when stack allocation is not
possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
so small that it would make sense to use the same code for both cases.
- Instead of adding extra fields to THD, we could allocate a separate
structure on the thd->mem_root() with the required extra fields (including
the THD pointer). Would seem to require initialising mutexes at every
commit though.
- It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
(should not be hard).
-----------------------------------------------------------------------
References:
[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
[3] https://knielsen-hq.org/maria/patches.mwl116/
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Efficient group commit for binary log (116)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Efficient group commit for binary log
CREATION DATE..: Mon, 26 Apr 2010, 13:28
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Serg
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 116 (http://askmonty.org/worklog/?tid=116)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 74
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 07 Jun 2010, 07:16)=-=-
Missing hours in previous progress report.
Worked 3 hours and estimate 0 hours remain (original estimate increased by 3 hours).
-=-=(Knielsen - Mon, 07 Jun 2010, 07:15)=-=-
Some more benchmarking.
Blog about results and architecture.
Fix two bugs in the proof-of-concept patch, races that causes hangs in benchmarks.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Guest - Tue, 01 Jun 2010, 14:20)=-=-
Status updated.
--- /tmp/wklog.116.old.32652 2010-06-01 14:20:15.000000000 +0000
+++ /tmp/wklog.116.new.32652 2010-06-01 14:20:15.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Mon, 31 May 2010, 06:48)=-=-
Finish first architecture draft (changed my mind a number of times before I was satisfied).
Write up architecture in worklog.
Fix remaining test failures in proof-of-concept patch + implement xtradb part.
Run some benchmarks on proof-of-concept implementation.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Tue, 25 May 2010, 13:19)=-=-
Low Level Design modified.
--- /tmp/wklog.116.old.14255 2010-05-25 13:19:00.000000000 +0000
+++ /tmp/wklog.116.new.14255 2010-05-25 13:19:00.000000000 +0000
@@ -1 +1,363 @@
+1. Changes for ha_commit_trans()
+
+The gut of the code for commit is in the function ha_commit_trans() (and in
+commit_one_phase() which is called from it). This must be extended to use the
+new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
+
+1.1 Atomic queue of committing transactions
+
+To keep the right commit order among participants, we put transactions into a
+queue. The operations on the queue are non-locking:
+
+ - Insert THD at the head of the queue, and return old queue.
+
+ THD *enqueue_atomic(THD *thd)
+
+ - Fetch (and delete) the whole queue.
+
+ THD *atomic_grab_reverse_queue()
+
+These are simple to implement with atomic compare-and-set. Note that there is
+no ABA problem [2], as we do not delete individual elements from the queue, we
+grab the whole queue and replace it with NULL.
+
+A transaction enters the queue when it does prepare_ordered(). This way, the
+scheduling order for prepare_ordered() calls is what determines the sequence
+in the queue and effectively the commit order.
+
+The queue is grabbed by the code doing group_log_xid() and commit_ordered()
+calls. The queue is passed directly to group_log_xid(), and afterwards
+iterated to do individual commit_ordered() calls.
+
+Using a lock-free queue allows prepare_ordered() (for one transaction) to run
+in parallel with commit_ordered (in another transaction), increasing potential
+parallelism.
+
+The queue is simply a linked list of THD objects, linked through a
+THD::next_commit_ordered field. Since we add at the head of the queue, the
+list is actually in reverse order, so must be reversed when we grab and delete
+it.
+
+The reason that enqueue_atomic() returns the old queue is so that we can check
+if an insert goes to the head of the queue. The thread at the head of the
+queue will do the sequential part of group commit for everyone.
+
+
+1.2 Locks
+
+1.2.1 Global LOCK_prepare_ordered
+
+This lock is taken to serialise calls to prepare_ordered(). Note that
+effectively, the commit order is decided by the order in which threads obtain
+this lock.
+
+
+1.2.2 Global LOCK_group_commit and COND_group_commit
+
+This lock is used to protect the serial part of group commit. It is taken
+around the code where we grab the queue, call group_log_xid() on the queue,
+and call commit_ordered() on each element of the queue, to make sure they
+happen serialised and in consistent order. It also protects the variable
+group_commit_queue_busy, which is used when not using group_log_xid() to delay
+running over a new queue until the first queue is completely done.
+
+
+1.2.3 Global LOCK_commit_ordered
+
+This lock is taken around calls to commit_ordered(), to ensure they happen
+serialised.
+
+
+1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
+
+This lock protects the thd->group_commit_ready variable, as well as the
+condition variable used to wake up threads after log_xid() and
+commit_ordered() finishes.
+
+
+1.2.5 Global LOCK_group_commit_queue
+
+This is only used on platforms with no native compare-and-set operations, to
+make the queue operations atomic.
+
+
+1.3 Commit algorithm.
+
+This is the basic algorithm, simplified by
+
+ - omitting some error handling
+
+ - omitting looping over all handlers when invoking handler methods
+
+ - omitting some possible optimisations when not all calls needed (see next
+ section).
+
+ - Omitting the case where no group_log_xid() is used, see below.
+
+---- BEGIN ALGORITHM ----
+ ht->prepare()
+
+ // Call prepare_ordered() and enqueue in correct commit order
+ lock(LOCK_prepare_ordered)
+ ht->prepare_ordered()
+ old_queue= enqueue_atomic(thd)
+ thd->group_commit_ready= FALSE
+ is_group_commit_leader= (old_queue == NULL)
+ unlock(LOCK_prepare_ordered)
+
+ if (is_group_commit_leader)
+
+ // The first in queue handles group commit for everyone
+
+ lock(LOCK_group_commit)
+ // Wait while queue is busy, see below for when this occurs
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+
+ // Grab and reverse the queue to get correct order of transactions
+ queue= atomic_grab_reverse_queue()
+
+ // This call will set individual error codes in thd->xid_error
+ // It also sets the cookie for unlog() in thd->xid_cookie
+ group_log_xid(queue)
+
+ lock(LOCK_commit_ordered)
+ for (other IN queue)
+ if (!other->xid_error)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ unlock(LOCK_group_commit)
+
+ // Now we are done, so wake up all the others.
+ for (other IN TAIL(queue))
+ lock(other->LOCK_commit_ordered)
+ other->group_commit_ready= TRUE
+ cond_signal(other->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+ else
+ // If not the leader, just wait until leader did the work for us.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ // Finally do any error reporting now that we're back in own thread.
+ if (thd->xid_error)
+ xid_delayed_error(thd)
+ else
+ ht->commit(thd)
+ unlog(thd->xid_cookie, thd->xid)
+---- END ALGORITHM ----
+
+If the transaction coordinator does not support group_log_xid(), we have to do
+things differently. In this case after the serialisation point at
+prepare_ordered(), we have to parallelise again when running log_xid()
+(otherwise we would loose group commit). But then when log_xid() is done, we
+have to serialise again to check for any error and call commit_ordered() in
+correct sequence for any transaction where log_xid() did not return error.
+
+The central part of the algorithm in this case (when using log_xid()) is:
+
+---- BEGIN ALGORITHM ----
+ cookie= log_xid(thd)
+ error= (cookie == 0)
+
+ if (is_group_commit_leader)
+
+ // The first to enqueue grabs the queue and runs first.
+ // But we must wait until a previous queue run is fully done.
+
+ lock(LOCK_group_commit)
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+ queue= atomic_grab_reverse_queue()
+ // The queue will be busy until last thread in it is done.
+ group_commit_queue_busy= TRUE
+ unlock(LOCK_group_commit)
+ else
+ // Not first in queue -> wait for previous one to wake us up.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ if (!error) // Only if log_xid() was successful
+ lock(LOCK_commit_ordered)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ // Wake up the next thread, and release queue in last.
+ next= thd->next_commit_ordered
+
+ if (next)
+ lock(next->LOCK_commit_ordered)
+ next->group_commit_ready= TRUE
+ cond_signal(next->COND_commit_ordered)
+ unlock(next->LOCK_commit_ordered)
+ else
+ lock(LOCK_group_commit)
+ group_commit_queue_busy= FALSE
+ unlock(LOCK_group_commit)
+---- END ALGORITHM ----
+
+There are a number of locks taken in the algorithm, but in the group_log_xid()
+case most of them should be uncontended most of the time. The
+LOCK_group_commit of course will be contended, as new threads queue up waiting
+for the previous group commit (and binlog fsync()) to finish so they can do
+the next group commit. This is the whole point of implementing group commit.
+
+The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
+contended as long as handlers follow the intension of having the corresponding
+handler calls execute quickly.
+
+The per-thread LOCK_commit_ordered mutexes should not be contended; they are
+only used to wake up a sleeping thread.
+
+
+1.4 Optimisations when not using all three new calls
+
+
+The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
+optional, and if not implemented by a particular handler/transaction
+coordinator, we can optimise the algorithm to take advantage of not having to
+keep ordering for the missing parts.
+
+If there is no prepare_ordered(), then we need not take the
+LOCK_prepare_ordered mutex.
+
+If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
+mutex.
+
+If there is no group_log_xid(), then we only need the queue to ensure same
+ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
+if either of these (or both) are also not present, we do not need to use the
+queue at all.
+
+
+2. Binlog code changes (log.cc)
+
+
+The bulk of the work needed for the binary log is to extend the code to allow
+group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
+inside the binlog code for group commit.
+
+The existing code runs most of the write + fsync to the binary lock under the
+global LOCK_log mutex, preventing any group commit.
+
+To enable group commit, this code must be split into two parts:
+
+ - one part that runs per transaction, re-writing the embedded event positions
+ for the correct offset, and writing this into the in-memory log cache.
+
+ - another part that writes a set of transactions to the disk, and runs
+ fsync().
+
+Then in group_log_xid(), we can run the first part in a loop over all the
+transactions in the passed-in queue, and run the second part only once.
+
+The binlog code also has other code paths that write into the binlog,
+eg. non-transactional statements. These have to be adapted also to work with
+the new code.
+
+In order to get some group commit facility for these also, we change that part
+of the code in a similar way to ha_commit_trans. We keep another,
+binlog-internal queue of such non-transactional binlog writes, and such writes
+queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
+LOCK_log, it loops over the queue for the fast part, and does the slow part
+once, then finally wakes up the others in the queue.
+
+In the transactional case in group_log_xid(), before we run the passed-in
+queue, we add any members found in the binlog-internal queue. This allows
+these non-transactional writes to share the group commit.
+
+However, in the case where it is a non-transactional write that gets the
+LOCK_log, the transactional transactions from the ha_commit_trans() queue will
+not be able to take part (they will have to wait for their turn to do another
+fsync). It seems difficult to cleanly let the binlog code grab the queue from
+out of the ha_commit_trans() algorithm. I think the group commit is mostly
+useful in transactional workloads anyway (non-transactional engines will loose
+data anyway in case of crash, so why fsync() after each transaction?)
+
+
+3. XtraDB changes (ha_innodb.cc)
+
+The changes needed in XtraDB are comparatively simple, as XtraDB already
+implements group commit, it just needs to be enabled with the new
+commit_ordered() call.
+
+The existing commit() method already is logically in two parts. The first part
+runs under the prepare_commit_mutex() and must be run in same order as binlog
+commit. This part needs to be moved to commit_ordered(). The second part runs
+after releasing prepare_commit_mutex and does transaction log write+fsync; it
+can remain.
+
+Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
+XtraDB option to disable it).
+
+There are two asserts that check that the thread running the first part of
+XtraDB commit is the same as the thread running the other operations for the
+transaction. These have to be removed (as commit_ordered() can run in a
+different thread). Also an error reporting with sql_print_error() has to be
+delayed until commit() time.
+
+
+4. Proof-of-concept implementation
+
+There is a proof-of-concept implementation of this architecture, in the form
+of a quilt patch series [3].
+
+A quick benchmark was done, with sync_binlog=1 and
+innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
+transactions against one table.
+
+Without the patch, we get only 25 queries per second.
+
+With the patch, we get 650 queries per second.
+
+
+5. Open issues/tasks
+
+5.1 XA / other prepare() and commit() call sites.
+
+Check that user-level XA is handled correctly and working. And covered
+sufficiently with tests. Also check that any other calls of ha->prepare() and
+ha->commit() outside of ha_commit_trans() are handled correctly.
+
+5.2 Testing
+
+This worklog needs additions to the test suite, including error inserts to
+check error handling, and synchronisation points to check thread parallelism
+correctness.
+
+
+6. Alternative implementations
+
+ - The binlog code maintains its own extra atomic transaction queue to handle
+ non-transactional commits in a good way together with transactional (with
+ respect to group commit). Alternatively, we could ignore this issue and
+ just give up on group commit for non-transactional statements, for some
+ code simplifications.
+
+ - The binlog code has two ways to prepare end_event and similar, one that
+ uses stack-allocation, and another for when stack allocation is not
+ possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
+ so small that it would make sense to use the same code for both cases.
+
+ - Instead of adding extra fields to THD, we could allocate a separate
+ structure on the thd->mem_root() with the required extra fields (including
+ the THD pointer). Would seem to require initialising mutexes at every
+ commit though.
+
+ - It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
+ (should not be hard).
+
+
+-----------------------------------------------------------------------
+
+References:
+
+[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
+
+[3] https://knielsen-hq.org/maria/patches.mwl116/
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.116.old.14249 2010-05-25 13:18:34.000000000 +0000
+++ /tmp/wklog.116.new.14249 2010-05-25 13:18:34.000000000 +0000
@@ -1 +1,157 @@
+The basic idea in group commit is that multiple threads, each handling one
+transaction, prepare for commit and then queue up together waiting to do an
+fsync() on the transaction log. Then once the log is available, a single
+thread does the fsync() + other necessary book-keeping for all of the threads
+at once. After this, the single thread signals the other threads that it's
+done and they can finish up and return success (or failure) from the commit
+operation.
+
+So group commit has a parallel part, and a sequential part. So we need a
+facility for engines/binlog to participate in both the parallel and the
+sequential part.
+
+To do this, we add two new handlerton methods:
+
+ int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
+ void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
+
+The idea is that the existing prepare() and commit() methods run in the
+parallel part of group commit, and the new prepare_ordered() and
+commit_ordered() run in the sequential part.
+
+The prepare_ordered() method is called after prepare(). The order of
+tranctions that call into prepare_ordered() is guaranteed to be the same among
+all storage engines and binlog, and it is serialised so no two calls can be
+running inside the same engine at the same time.
+
+The commit_ordered() method is called before commit(), and similarly is
+guaranteed to have same transaction order in all participants, and to be
+serialised within one engine.
+
+As the prepare_ordered() and commit_ordered() calls are serialised, the idea
+is that handlers should do the minimum amount of work needed in these calls,
+relaying most of the work (eg. fsync() ...) to prepare() and commit().
+
+As a concrete example, for InnoDB the commit_ordered() method will do the
+first part of commit that fixed the commit order in the transaction log
+buffer, and the commit() method will write the log to disk and fsync()
+it. This split already exists inside the InnoDB code, running before
+respectively after releasing the prepare_commit_mutex.
+
+In addition, the XA transaction coordinator (TC_LOG) is special, since it is
+the one responsible for deciding whether to commit or rollback the
+transaction. For this we need an extra method, since this decision can be done
+only after we know that all prepare() and prepare_ordered() calls succeed, and
+must be done to know whether to call commit_ordered()/commit(), or do rollback.
+
+The existing method for this is TC_LOG::log_xid(). To make implementing group
+commit simpler to implement in a transaction coordinator and more efficient,
+we introduce a new method:
+
+ void group_log_xid(THD *first_thd);
+
+This method runs in the sequential part of group commit. It receives a list of
+transactions to perform log_xid() on, in the correct commit order. (Note that
+TC_LOG can do parallel parts of group commit in its own prepare() and commit()
+methods).
+
+This method can make it easier to implement the group commit in TC_LOG, as it
+gets directly the list of transactions in the right order. Without it, it
+might need to compute such order anyway in a prepare_ordered() method, and the
+server has to create this ordered list anyway to implement the order guarantee
+for prepare_ordered() and commit_ordered().
+
+This group_log_xid() method also is more efficient, as it avoids some
+inter-thread synchronisation. Since group_log_xid() is serialised, we can run
+it together with all the commit_ordered() method calls and need only a single
+sequential code section. With the log_xid() methods, we would need first a
+sequential part for the prepare_ordered() calls, then a parallel part with
+log_xid() calls (to not loose group commit ability for log_xid()), then again
+a sequential part for the commit_ordered() method calls.
+
+The extra synchronisation is needed, as each commit_ordered() call will have
+to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
+should not be called), and also wait for commit_ordered() to finish in all
+threads handling earlier commits. In effect we will need to bounce the
+execution from one thread to the other among all participants in the group
+commit.
+
+As a consequence of the group_log_xid() optimisation, handlers must be aware
+that the commit_ordered() call can happen in another thread than the one
+running commit() (so thread local storage is not available). This should not
+be a big issue as the THD is available for storing any needed information.
+
+Since group_log_xid() runs for multiple transactions in a single thread, it
+can not do error reporting (my_error()) as that relies on thread local
+storage. Instead it sets an error code in THD::xid_error, and if there is an
+error then later another method will be called (in correct thread context) to
+actually report the error:
+
+ int xid_delayed_error(THD *thd)
+
+The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
+are optional (as is xid_delayed_error). A storage engine or transaction
+coordinator is free to not implement them if they are not needed. In this case
+there will be no order guarantee for the corresponding stage of group commit
+for that engine. For example, InnoDB needs no ordering of the prepare phase,
+so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
+all, so does not need to implement any of them.
+
+Note in particular that all existing engines (/binlog implementations if they
+exist) will work unmodified (and also without any change in group commit
+facilities or commit order guaranteed).
+
+Using these new APIs, the work will be to
+
+ - In ha_commit_trans(), implement the correct semantics for the three new
+ calls.
+
+ - In XtraDB, use the new commit_ordered() call to remove the
+ prepare_commit_mutex (and resurrect group commit) without loosing the
+ consistency with binlog commit order.
+
+ - In log.cc (binlog module), implement group_log_xid() to do group commit of
+ multiple transactions to the binlog with a single shared fsync() call.
+
+-----------------------------------------------------------------------
+Some possible alternative for this worklog:
+
+ - We could eliminate the group_log_xid() method for a simpler API, at the
+ cost of extra synchronisation between threads to do in-order
+ commit_ordered() method calls. This would also allow to call
+ commit_ordered() in the correct thread context.
+
+ - Alternatively, we could eliminate log_xid() and require that all
+ transaction coordinators implement group_log_xid() instead, again for some
+ moderate simplification.
+
+ - At the moment there is no plugin actually using prepare_ordered(), so, it
+ could be removed from the design. But it fits in well, is efficient to
+ implement, and could be useful later (eg. for the requested feature of
+ releasing locks early in InnoDB).
+
+-----------------------------------------------------------------------
+Some possible follow-up projects after this is implemented:
+
+ - Add statistics about how efficient group commit is (#fsyncs/#commits in
+ each engine and binlog).
+
+ - Implement an XtraDB prepare_ordered() methods that can release row locks
+ early (Mark Callaghan from Facebook advocates this, but need to determine
+ exactly how to do this safely).
+
+ - Implement a new crash recovery algorithm that uses the consistent commit
+ ordering to need only fsync() for the binlog. At crash recovery, any
+ missing transactions in an engine is replayed from the correct point in the
+ binlog (this point must be stored transactionally inside the engine, as
+ XtraDB already does today).
+
+ - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
+ consistent snapshow, with same set of committed and not committed
+ transactions in all engines, 2) returns a corresponding consistent binlog
+ position. This should be easy by piggybacking on the synchronisation
+ implemented for ha_commit_trans().
+
+ - Use this in XtraBackup to get consistent binlog position without having to
+ block all updates with FLUSH TABLES WITH READ LOCK.
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High Level Description modified.
--- /tmp/wklog.116.old.14234 2010-05-25 13:18:07.000000000 +0000
+++ /tmp/wklog.116.new.14234 2010-05-25 13:18:07.000000000 +0000
@@ -21,3 +21,69 @@
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
+----
+
+Implementing group commit in MySQL faces some challenges from the handler
+plugin architecture:
+
+1. Because storage engine handlers have separate transaction log from the
+mysql binlog (and from each other), there are multiple fsync() calls per
+commit that need the group commit optimisation (2 per participating storage
+engine + 1 for binlog).
+
+2. The code handling commit is split in several places, in main server code
+and in storage engine code. With pluggable binlog it will be split even
+more. This requires a good abstract yet powerful API to be able to implement
+group commit simply and efficiently in plugins without the different parts
+having to rely on iternals of the others.
+
+3. We want the order of commits to be the same in all engines participating in
+multiple transactions. This requirement is the reason that InnoDB currently
+breaks group commit with the infamous prepare_commit_mutex.
+
+While currently there is no server guarantee to get same commit order in
+engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
+several reasons why this could be desirable:
+
+ - InnoDB hot backup needs to be able to extract a binlog position that is
+ consistent with the hot backup to be able to provision a new slave, and
+ this is impossible without imposing at least partial consistent ordering
+ between InnoDB and binlog.
+
+ - Other backup methods could have similar needs, eg. XtraBackup or
+ `mysqldump --single-transaction`, to have consistent commit order between
+ binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
+ or similar expensive blocking operation. (other backup methods, like LVM
+ snapshot, don't need consistent commit order, as they can restore
+ out-of-order commits during crash recovery using XA).
+
+ - If we have consistent commit order, we can think about optimising commit to
+ need only one fsync (for binlog); lost commits in storage engines can then
+ be recovered from the binlog at crash recovery by re-playing against the
+ engine from a particular point in the binlog.
+
+ - With consistent commit order, we can get better semantics for START
+ TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
+ could even get it to return also a matching binlog position). Currently,
+ this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
+ engines.
+
+ - In InnoDB, the performance in the presense of hotspots can be improved if
+ we can release row locks early in the commit phase, but this requires that we
+release them in
+ the same order as commits in the binlog to ensure consistency between
+ master and slaves.
+
+ - There was some discussions around Galera [1] synchroneous replication and
+ global transaction ID that it needed consistent commit order among
+ participating engines.
+
+ - I believe there could be other applications for guaranteed consistent
+ commit order, and that the architecture described in this worklog can
+ implement such guarantee with reasonable overhead.
+
+
+References:
+
+[1] Galera: http://www.codership.com/products/galera_replication
+
-=-=(Knielsen - Tue, 25 May 2010, 08:28)=-=-
More thoughts on and changes to the archtecture. Got to something now that I am satisfied with and
that seems to be able to handle all issues.
Implement new prepare_ordered and commit_ordered handler methods and the logic in ha_commit_trans().
Implement TC_LOG::group_log_xid() method and logic in ha_commit_trans().
Implement XtraDB part, using commit_ordered() rather than prepare_commit_mutex.
Fix test suite failures.
Proof-of-concept patch series complete now.
Do initial benchmark, getting good results. With 64 threads, see 26x improvement in queries-per-sec.
Next step: write up the architecture description.
Worked 21 hours and estimate 0 hours remain (original estimate increased by 21 hours).
-=-=(Knielsen - Wed, 12 May 2010, 06:41)=-=-
Started work on a Quilt patch series, refactoring the binlog code to prepare for implementing the
group commit, and working on the design of group commit in parallel.
Found and fixed several problems in error handling when writing to binlog.
Removed redundant table map version locking.
Split binlog writing into two parts in preparations for group commit. When ready to write to the
binlog, threads enter a queue, and the first thread in the queue handles the binlog writing for
everyone. When it obtains the LOCK_log, it first loops over all threads, executing the first part of
binlog writing (the write(2) syscall essentially). It then runs the second part (fsync(2)
essentially) only once, and then wakes up the remaining threads in the queue.
Still to be done:
Finish the proof-of-concept group commit patch, by 1) implementing the prepare_fast() and
commit_fast() callbacks in handler.cc 2) move the binlog thread enqueue from log_xid() to
binlog_prepare_fast(), 3) move fast part of InnoDB commit to innobase_commit_fast(), removing the
prepare_commit_mutex().
Write up the final design in this worklog.
Evaluate the design to see if we can do better/different.
Think about possible next steps, such as releasing innodb row locks early (in
innobase_prepare_fast), and doing crash recovery by replaying transactions from the binlog (removing
the need for engine durability and 2 of 3 fsync() in commit).
Worked 28 hours and estimate 0 hours remain (original estimate increased by 28 hours).
-=-=(Serg - Mon, 26 Apr 2010, 14:10)=-=-
Observers changed: Serg
DESCRIPTION:
Currently, in order to ensure that the server can recover after a crash to a
state in which storage engines and binary log are consistent with each other,
it is necessary to use XA with durable commits for both storage engines
(innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1).
This is _very_ expensive, since the server needs to do three fsync() operations
for every commit, as there is no working group commit when the binary log is
enabled.
The idea is to
- Implement/fix group commit to work properly with the binary log enabled.
- (Optionally) avoid the need to fsync() in the engine, and instead rely on
replaying any lost transactions from the binary log against the engine
during crash recovery.
For background see these articles:
http://kristiannielsen.livejournal.com/12254.html
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
----
Implementing group commit in MySQL faces some challenges from the handler
plugin architecture:
1. Because storage engine handlers have separate transaction log from the
mysql binlog (and from each other), there are multiple fsync() calls per
commit that need the group commit optimisation (2 per participating storage
engine + 1 for binlog).
2. The code handling commit is split in several places, in main server code
and in storage engine code. With pluggable binlog it will be split even
more. This requires a good abstract yet powerful API to be able to implement
group commit simply and efficiently in plugins without the different parts
having to rely on iternals of the others.
3. We want the order of commits to be the same in all engines participating in
multiple transactions. This requirement is the reason that InnoDB currently
breaks group commit with the infamous prepare_commit_mutex.
While currently there is no server guarantee to get same commit order in
engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
several reasons why this could be desirable:
- InnoDB hot backup needs to be able to extract a binlog position that is
consistent with the hot backup to be able to provision a new slave, and
this is impossible without imposing at least partial consistent ordering
between InnoDB and binlog.
- Other backup methods could have similar needs, eg. XtraBackup or
`mysqldump --single-transaction`, to have consistent commit order between
binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
or similar expensive blocking operation. (other backup methods, like LVM
snapshot, don't need consistent commit order, as they can restore
out-of-order commits during crash recovery using XA).
- If we have consistent commit order, we can think about optimising commit to
need only one fsync (for binlog); lost commits in storage engines can then
be recovered from the binlog at crash recovery by re-playing against the
engine from a particular point in the binlog.
- With consistent commit order, we can get better semantics for START
TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
could even get it to return also a matching binlog position). Currently,
this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
engines.
- In InnoDB, the performance in the presense of hotspots can be improved if
we can release row locks early in the commit phase, but this requires that we
release them in
the same order as commits in the binlog to ensure consistency between
master and slaves.
- There was some discussions around Galera [1] synchroneous replication and
global transaction ID that it needed consistent commit order among
participating engines.
- I believe there could be other applications for guaranteed consistent
commit order, and that the architecture described in this worklog can
implement such guarantee with reasonable overhead.
References:
[1] Galera: http://www.codership.com/products/galera_replication
HIGH-LEVEL SPECIFICATION:
The basic idea in group commit is that multiple threads, each handling one
transaction, prepare for commit and then queue up together waiting to do an
fsync() on the transaction log. Then once the log is available, a single
thread does the fsync() + other necessary book-keeping for all of the threads
at once. After this, the single thread signals the other threads that it's
done and they can finish up and return success (or failure) from the commit
operation.
So group commit has a parallel part, and a sequential part. So we need a
facility for engines/binlog to participate in both the parallel and the
sequential part.
To do this, we add two new handlerton methods:
int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
The idea is that the existing prepare() and commit() methods run in the
parallel part of group commit, and the new prepare_ordered() and
commit_ordered() run in the sequential part.
The prepare_ordered() method is called after prepare(). The order of
tranctions that call into prepare_ordered() is guaranteed to be the same among
all storage engines and binlog, and it is serialised so no two calls can be
running inside the same engine at the same time.
The commit_ordered() method is called before commit(), and similarly is
guaranteed to have same transaction order in all participants, and to be
serialised within one engine.
As the prepare_ordered() and commit_ordered() calls are serialised, the idea
is that handlers should do the minimum amount of work needed in these calls,
relaying most of the work (eg. fsync() ...) to prepare() and commit().
As a concrete example, for InnoDB the commit_ordered() method will do the
first part of commit that fixed the commit order in the transaction log
buffer, and the commit() method will write the log to disk and fsync()
it. This split already exists inside the InnoDB code, running before
respectively after releasing the prepare_commit_mutex.
In addition, the XA transaction coordinator (TC_LOG) is special, since it is
the one responsible for deciding whether to commit or rollback the
transaction. For this we need an extra method, since this decision can be done
only after we know that all prepare() and prepare_ordered() calls succeed, and
must be done to know whether to call commit_ordered()/commit(), or do rollback.
The existing method for this is TC_LOG::log_xid(). To make implementing group
commit simpler to implement in a transaction coordinator and more efficient,
we introduce a new method:
void group_log_xid(THD *first_thd);
This method runs in the sequential part of group commit. It receives a list of
transactions to perform log_xid() on, in the correct commit order. (Note that
TC_LOG can do parallel parts of group commit in its own prepare() and commit()
methods).
This method can make it easier to implement the group commit in TC_LOG, as it
gets directly the list of transactions in the right order. Without it, it
might need to compute such order anyway in a prepare_ordered() method, and the
server has to create this ordered list anyway to implement the order guarantee
for prepare_ordered() and commit_ordered().
This group_log_xid() method also is more efficient, as it avoids some
inter-thread synchronisation. Since group_log_xid() is serialised, we can run
it together with all the commit_ordered() method calls and need only a single
sequential code section. With the log_xid() methods, we would need first a
sequential part for the prepare_ordered() calls, then a parallel part with
log_xid() calls (to not loose group commit ability for log_xid()), then again
a sequential part for the commit_ordered() method calls.
The extra synchronisation is needed, as each commit_ordered() call will have
to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
should not be called), and also wait for commit_ordered() to finish in all
threads handling earlier commits. In effect we will need to bounce the
execution from one thread to the other among all participants in the group
commit.
As a consequence of the group_log_xid() optimisation, handlers must be aware
that the commit_ordered() call can happen in another thread than the one
running commit() (so thread local storage is not available). This should not
be a big issue as the THD is available for storing any needed information.
Since group_log_xid() runs for multiple transactions in a single thread, it
can not do error reporting (my_error()) as that relies on thread local
storage. Instead it sets an error code in THD::xid_error, and if there is an
error then later another method will be called (in correct thread context) to
actually report the error:
int xid_delayed_error(THD *thd)
The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
are optional (as is xid_delayed_error). A storage engine or transaction
coordinator is free to not implement them if they are not needed. In this case
there will be no order guarantee for the corresponding stage of group commit
for that engine. For example, InnoDB needs no ordering of the prepare phase,
so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
all, so does not need to implement any of them.
Note in particular that all existing engines (/binlog implementations if they
exist) will work unmodified (and also without any change in group commit
facilities or commit order guaranteed).
Using these new APIs, the work will be to
- In ha_commit_trans(), implement the correct semantics for the three new
calls.
- In XtraDB, use the new commit_ordered() call to remove the
prepare_commit_mutex (and resurrect group commit) without loosing the
consistency with binlog commit order.
- In log.cc (binlog module), implement group_log_xid() to do group commit of
multiple transactions to the binlog with a single shared fsync() call.
-----------------------------------------------------------------------
Some possible alternative for this worklog:
- We could eliminate the group_log_xid() method for a simpler API, at the
cost of extra synchronisation between threads to do in-order
commit_ordered() method calls. This would also allow to call
commit_ordered() in the correct thread context.
- Alternatively, we could eliminate log_xid() and require that all
transaction coordinators implement group_log_xid() instead, again for some
moderate simplification.
- At the moment there is no plugin actually using prepare_ordered(), so, it
could be removed from the design. But it fits in well, is efficient to
implement, and could be useful later (eg. for the requested feature of
releasing locks early in InnoDB).
-----------------------------------------------------------------------
Some possible follow-up projects after this is implemented:
- Add statistics about how efficient group commit is (#fsyncs/#commits in
each engine and binlog).
- Implement an XtraDB prepare_ordered() methods that can release row locks
early (Mark Callaghan from Facebook advocates this, but need to determine
exactly how to do this safely).
- Implement a new crash recovery algorithm that uses the consistent commit
ordering to need only fsync() for the binlog. At crash recovery, any
missing transactions in an engine is replayed from the correct point in the
binlog (this point must be stored transactionally inside the engine, as
XtraDB already does today).
- Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
consistent snapshow, with same set of committed and not committed
transactions in all engines, 2) returns a corresponding consistent binlog
position. This should be easy by piggybacking on the synchronisation
implemented for ha_commit_trans().
- Use this in XtraBackup to get consistent binlog position without having to
block all updates with FLUSH TABLES WITH READ LOCK.
LOW-LEVEL DESIGN:
1. Changes for ha_commit_trans()
The gut of the code for commit is in the function ha_commit_trans() (and in
commit_one_phase() which is called from it). This must be extended to use the
new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
1.1 Atomic queue of committing transactions
To keep the right commit order among participants, we put transactions into a
queue. The operations on the queue are non-locking:
- Insert THD at the head of the queue, and return old queue.
THD *enqueue_atomic(THD *thd)
- Fetch (and delete) the whole queue.
THD *atomic_grab_reverse_queue()
These are simple to implement with atomic compare-and-set. Note that there is
no ABA problem [2], as we do not delete individual elements from the queue, we
grab the whole queue and replace it with NULL.
A transaction enters the queue when it does prepare_ordered(). This way, the
scheduling order for prepare_ordered() calls is what determines the sequence
in the queue and effectively the commit order.
The queue is grabbed by the code doing group_log_xid() and commit_ordered()
calls. The queue is passed directly to group_log_xid(), and afterwards
iterated to do individual commit_ordered() calls.
Using a lock-free queue allows prepare_ordered() (for one transaction) to run
in parallel with commit_ordered (in another transaction), increasing potential
parallelism.
The queue is simply a linked list of THD objects, linked through a
THD::next_commit_ordered field. Since we add at the head of the queue, the
list is actually in reverse order, so must be reversed when we grab and delete
it.
The reason that enqueue_atomic() returns the old queue is so that we can check
if an insert goes to the head of the queue. The thread at the head of the
queue will do the sequential part of group commit for everyone.
1.2 Locks
1.2.1 Global LOCK_prepare_ordered
This lock is taken to serialise calls to prepare_ordered(). Note that
effectively, the commit order is decided by the order in which threads obtain
this lock.
1.2.2 Global LOCK_group_commit and COND_group_commit
This lock is used to protect the serial part of group commit. It is taken
around the code where we grab the queue, call group_log_xid() on the queue,
and call commit_ordered() on each element of the queue, to make sure they
happen serialised and in consistent order. It also protects the variable
group_commit_queue_busy, which is used when not using group_log_xid() to delay
running over a new queue until the first queue is completely done.
1.2.3 Global LOCK_commit_ordered
This lock is taken around calls to commit_ordered(), to ensure they happen
serialised.
1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
This lock protects the thd->group_commit_ready variable, as well as the
condition variable used to wake up threads after log_xid() and
commit_ordered() finishes.
1.2.5 Global LOCK_group_commit_queue
This is only used on platforms with no native compare-and-set operations, to
make the queue operations atomic.
1.3 Commit algorithm.
This is the basic algorithm, simplified by
- omitting some error handling
- omitting looping over all handlers when invoking handler methods
- omitting some possible optimisations when not all calls needed (see next
section).
- Omitting the case where no group_log_xid() is used, see below.
---- BEGIN ALGORITHM ----
ht->prepare()
// Call prepare_ordered() and enqueue in correct commit order
lock(LOCK_prepare_ordered)
ht->prepare_ordered()
old_queue= enqueue_atomic(thd)
thd->group_commit_ready= FALSE
is_group_commit_leader= (old_queue == NULL)
unlock(LOCK_prepare_ordered)
if (is_group_commit_leader)
// The first in queue handles group commit for everyone
lock(LOCK_group_commit)
// Wait while queue is busy, see below for when this occurs
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
// Grab and reverse the queue to get correct order of transactions
queue= atomic_grab_reverse_queue()
// This call will set individual error codes in thd->xid_error
// It also sets the cookie for unlog() in thd->xid_cookie
group_log_xid(queue)
lock(LOCK_commit_ordered)
for (other IN queue)
if (!other->xid_error)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
unlock(LOCK_group_commit)
// Now we are done, so wake up all the others.
for (other IN TAIL(queue))
lock(other->LOCK_commit_ordered)
other->group_commit_ready= TRUE
cond_signal(other->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
else
// If not the leader, just wait until leader did the work for us.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
// Finally do any error reporting now that we're back in own thread.
if (thd->xid_error)
xid_delayed_error(thd)
else
ht->commit(thd)
unlog(thd->xid_cookie, thd->xid)
---- END ALGORITHM ----
If the transaction coordinator does not support group_log_xid(), we have to do
things differently. In this case after the serialisation point at
prepare_ordered(), we have to parallelise again when running log_xid()
(otherwise we would loose group commit). But then when log_xid() is done, we
have to serialise again to check for any error and call commit_ordered() in
correct sequence for any transaction where log_xid() did not return error.
The central part of the algorithm in this case (when using log_xid()) is:
---- BEGIN ALGORITHM ----
cookie= log_xid(thd)
error= (cookie == 0)
if (is_group_commit_leader)
// The first to enqueue grabs the queue and runs first.
// But we must wait until a previous queue run is fully done.
lock(LOCK_group_commit)
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
queue= atomic_grab_reverse_queue()
// The queue will be busy until last thread in it is done.
group_commit_queue_busy= TRUE
unlock(LOCK_group_commit)
else
// Not first in queue -> wait for previous one to wake us up.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
if (!error) // Only if log_xid() was successful
lock(LOCK_commit_ordered)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
// Wake up the next thread, and release queue in last.
next= thd->next_commit_ordered
if (next)
lock(next->LOCK_commit_ordered)
next->group_commit_ready= TRUE
cond_signal(next->COND_commit_ordered)
unlock(next->LOCK_commit_ordered)
else
lock(LOCK_group_commit)
group_commit_queue_busy= FALSE
unlock(LOCK_group_commit)
---- END ALGORITHM ----
There are a number of locks taken in the algorithm, but in the group_log_xid()
case most of them should be uncontended most of the time. The
LOCK_group_commit of course will be contended, as new threads queue up waiting
for the previous group commit (and binlog fsync()) to finish so they can do
the next group commit. This is the whole point of implementing group commit.
The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
contended as long as handlers follow the intension of having the corresponding
handler calls execute quickly.
The per-thread LOCK_commit_ordered mutexes should not be contended; they are
only used to wake up a sleeping thread.
1.4 Optimisations when not using all three new calls
The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
optional, and if not implemented by a particular handler/transaction
coordinator, we can optimise the algorithm to take advantage of not having to
keep ordering for the missing parts.
If there is no prepare_ordered(), then we need not take the
LOCK_prepare_ordered mutex.
If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
mutex.
If there is no group_log_xid(), then we only need the queue to ensure same
ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
if either of these (or both) are also not present, we do not need to use the
queue at all.
2. Binlog code changes (log.cc)
The bulk of the work needed for the binary log is to extend the code to allow
group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
inside the binlog code for group commit.
The existing code runs most of the write + fsync to the binary lock under the
global LOCK_log mutex, preventing any group commit.
To enable group commit, this code must be split into two parts:
- one part that runs per transaction, re-writing the embedded event positions
for the correct offset, and writing this into the in-memory log cache.
- another part that writes a set of transactions to the disk, and runs
fsync().
Then in group_log_xid(), we can run the first part in a loop over all the
transactions in the passed-in queue, and run the second part only once.
The binlog code also has other code paths that write into the binlog,
eg. non-transactional statements. These have to be adapted also to work with
the new code.
In order to get some group commit facility for these also, we change that part
of the code in a similar way to ha_commit_trans. We keep another,
binlog-internal queue of such non-transactional binlog writes, and such writes
queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
LOCK_log, it loops over the queue for the fast part, and does the slow part
once, then finally wakes up the others in the queue.
In the transactional case in group_log_xid(), before we run the passed-in
queue, we add any members found in the binlog-internal queue. This allows
these non-transactional writes to share the group commit.
However, in the case where it is a non-transactional write that gets the
LOCK_log, the transactional transactions from the ha_commit_trans() queue will
not be able to take part (they will have to wait for their turn to do another
fsync). It seems difficult to cleanly let the binlog code grab the queue from
out of the ha_commit_trans() algorithm. I think the group commit is mostly
useful in transactional workloads anyway (non-transactional engines will loose
data anyway in case of crash, so why fsync() after each transaction?)
3. XtraDB changes (ha_innodb.cc)
The changes needed in XtraDB are comparatively simple, as XtraDB already
implements group commit, it just needs to be enabled with the new
commit_ordered() call.
The existing commit() method already is logically in two parts. The first part
runs under the prepare_commit_mutex() and must be run in same order as binlog
commit. This part needs to be moved to commit_ordered(). The second part runs
after releasing prepare_commit_mutex and does transaction log write+fsync; it
can remain.
Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
XtraDB option to disable it).
There are two asserts that check that the thread running the first part of
XtraDB commit is the same as the thread running the other operations for the
transaction. These have to be removed (as commit_ordered() can run in a
different thread). Also an error reporting with sql_print_error() has to be
delayed until commit() time.
4. Proof-of-concept implementation
There is a proof-of-concept implementation of this architecture, in the form
of a quilt patch series [3].
A quick benchmark was done, with sync_binlog=1 and
innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
transactions against one table.
Without the patch, we get only 25 queries per second.
With the patch, we get 650 queries per second.
5. Open issues/tasks
5.1 XA / other prepare() and commit() call sites.
Check that user-level XA is handled correctly and working. And covered
sufficiently with tests. Also check that any other calls of ha->prepare() and
ha->commit() outside of ha_commit_trans() are handled correctly.
5.2 Testing
This worklog needs additions to the test suite, including error inserts to
check error handling, and synchronisation points to check thread parallelism
correctness.
6. Alternative implementations
- The binlog code maintains its own extra atomic transaction queue to handle
non-transactional commits in a good way together with transactional (with
respect to group commit). Alternatively, we could ignore this issue and
just give up on group commit for non-transactional statements, for some
code simplifications.
- The binlog code has two ways to prepare end_event and similar, one that
uses stack-allocation, and another for when stack allocation is not
possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
so small that it would make sense to use the same code for both cases.
- Instead of adding extra fields to THD, we could allocate a separate
structure on the thd->mem_root() with the required extra fields (including
the THD pointer). Would seem to require initialising mutexes at every
commit though.
- It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
(should not be hard).
-----------------------------------------------------------------------
References:
[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
[3] https://knielsen-hq.org/maria/patches.mwl116/
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Efficient group commit for binary log (116)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Efficient group commit for binary log
CREATION DATE..: Mon, 26 Apr 2010, 13:28
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Serg
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 116 (http://askmonty.org/worklog/?tid=116)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 71
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 07 Jun 2010, 07:15)=-=-
Some more benchmarking.
Blog about results and architecture.
Fix two bugs in the proof-of-concept patch, races that causes hangs in benchmarks.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Guest - Tue, 01 Jun 2010, 14:20)=-=-
Status updated.
--- /tmp/wklog.116.old.32652 2010-06-01 14:20:15.000000000 +0000
+++ /tmp/wklog.116.new.32652 2010-06-01 14:20:15.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Mon, 31 May 2010, 06:48)=-=-
Finish first architecture draft (changed my mind a number of times before I was satisfied).
Write up architecture in worklog.
Fix remaining test failures in proof-of-concept patch + implement xtradb part.
Run some benchmarks on proof-of-concept implementation.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Tue, 25 May 2010, 13:19)=-=-
Low Level Design modified.
--- /tmp/wklog.116.old.14255 2010-05-25 13:19:00.000000000 +0000
+++ /tmp/wklog.116.new.14255 2010-05-25 13:19:00.000000000 +0000
@@ -1 +1,363 @@
+1. Changes for ha_commit_trans()
+
+The gut of the code for commit is in the function ha_commit_trans() (and in
+commit_one_phase() which is called from it). This must be extended to use the
+new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
+
+1.1 Atomic queue of committing transactions
+
+To keep the right commit order among participants, we put transactions into a
+queue. The operations on the queue are non-locking:
+
+ - Insert THD at the head of the queue, and return old queue.
+
+ THD *enqueue_atomic(THD *thd)
+
+ - Fetch (and delete) the whole queue.
+
+ THD *atomic_grab_reverse_queue()
+
+These are simple to implement with atomic compare-and-set. Note that there is
+no ABA problem [2], as we do not delete individual elements from the queue, we
+grab the whole queue and replace it with NULL.
+
+A transaction enters the queue when it does prepare_ordered(). This way, the
+scheduling order for prepare_ordered() calls is what determines the sequence
+in the queue and effectively the commit order.
+
+The queue is grabbed by the code doing group_log_xid() and commit_ordered()
+calls. The queue is passed directly to group_log_xid(), and afterwards
+iterated to do individual commit_ordered() calls.
+
+Using a lock-free queue allows prepare_ordered() (for one transaction) to run
+in parallel with commit_ordered (in another transaction), increasing potential
+parallelism.
+
+The queue is simply a linked list of THD objects, linked through a
+THD::next_commit_ordered field. Since we add at the head of the queue, the
+list is actually in reverse order, so must be reversed when we grab and delete
+it.
+
+The reason that enqueue_atomic() returns the old queue is so that we can check
+if an insert goes to the head of the queue. The thread at the head of the
+queue will do the sequential part of group commit for everyone.
+
+
+1.2 Locks
+
+1.2.1 Global LOCK_prepare_ordered
+
+This lock is taken to serialise calls to prepare_ordered(). Note that
+effectively, the commit order is decided by the order in which threads obtain
+this lock.
+
+
+1.2.2 Global LOCK_group_commit and COND_group_commit
+
+This lock is used to protect the serial part of group commit. It is taken
+around the code where we grab the queue, call group_log_xid() on the queue,
+and call commit_ordered() on each element of the queue, to make sure they
+happen serialised and in consistent order. It also protects the variable
+group_commit_queue_busy, which is used when not using group_log_xid() to delay
+running over a new queue until the first queue is completely done.
+
+
+1.2.3 Global LOCK_commit_ordered
+
+This lock is taken around calls to commit_ordered(), to ensure they happen
+serialised.
+
+
+1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
+
+This lock protects the thd->group_commit_ready variable, as well as the
+condition variable used to wake up threads after log_xid() and
+commit_ordered() finishes.
+
+
+1.2.5 Global LOCK_group_commit_queue
+
+This is only used on platforms with no native compare-and-set operations, to
+make the queue operations atomic.
+
+
+1.3 Commit algorithm.
+
+This is the basic algorithm, simplified by
+
+ - omitting some error handling
+
+ - omitting looping over all handlers when invoking handler methods
+
+ - omitting some possible optimisations when not all calls needed (see next
+ section).
+
+ - Omitting the case where no group_log_xid() is used, see below.
+
+---- BEGIN ALGORITHM ----
+ ht->prepare()
+
+ // Call prepare_ordered() and enqueue in correct commit order
+ lock(LOCK_prepare_ordered)
+ ht->prepare_ordered()
+ old_queue= enqueue_atomic(thd)
+ thd->group_commit_ready= FALSE
+ is_group_commit_leader= (old_queue == NULL)
+ unlock(LOCK_prepare_ordered)
+
+ if (is_group_commit_leader)
+
+ // The first in queue handles group commit for everyone
+
+ lock(LOCK_group_commit)
+ // Wait while queue is busy, see below for when this occurs
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+
+ // Grab and reverse the queue to get correct order of transactions
+ queue= atomic_grab_reverse_queue()
+
+ // This call will set individual error codes in thd->xid_error
+ // It also sets the cookie for unlog() in thd->xid_cookie
+ group_log_xid(queue)
+
+ lock(LOCK_commit_ordered)
+ for (other IN queue)
+ if (!other->xid_error)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ unlock(LOCK_group_commit)
+
+ // Now we are done, so wake up all the others.
+ for (other IN TAIL(queue))
+ lock(other->LOCK_commit_ordered)
+ other->group_commit_ready= TRUE
+ cond_signal(other->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+ else
+ // If not the leader, just wait until leader did the work for us.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ // Finally do any error reporting now that we're back in own thread.
+ if (thd->xid_error)
+ xid_delayed_error(thd)
+ else
+ ht->commit(thd)
+ unlog(thd->xid_cookie, thd->xid)
+---- END ALGORITHM ----
+
+If the transaction coordinator does not support group_log_xid(), we have to do
+things differently. In this case after the serialisation point at
+prepare_ordered(), we have to parallelise again when running log_xid()
+(otherwise we would loose group commit). But then when log_xid() is done, we
+have to serialise again to check for any error and call commit_ordered() in
+correct sequence for any transaction where log_xid() did not return error.
+
+The central part of the algorithm in this case (when using log_xid()) is:
+
+---- BEGIN ALGORITHM ----
+ cookie= log_xid(thd)
+ error= (cookie == 0)
+
+ if (is_group_commit_leader)
+
+ // The first to enqueue grabs the queue and runs first.
+ // But we must wait until a previous queue run is fully done.
+
+ lock(LOCK_group_commit)
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+ queue= atomic_grab_reverse_queue()
+ // The queue will be busy until last thread in it is done.
+ group_commit_queue_busy= TRUE
+ unlock(LOCK_group_commit)
+ else
+ // Not first in queue -> wait for previous one to wake us up.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ if (!error) // Only if log_xid() was successful
+ lock(LOCK_commit_ordered)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ // Wake up the next thread, and release queue in last.
+ next= thd->next_commit_ordered
+
+ if (next)
+ lock(next->LOCK_commit_ordered)
+ next->group_commit_ready= TRUE
+ cond_signal(next->COND_commit_ordered)
+ unlock(next->LOCK_commit_ordered)
+ else
+ lock(LOCK_group_commit)
+ group_commit_queue_busy= FALSE
+ unlock(LOCK_group_commit)
+---- END ALGORITHM ----
+
+There are a number of locks taken in the algorithm, but in the group_log_xid()
+case most of them should be uncontended most of the time. The
+LOCK_group_commit of course will be contended, as new threads queue up waiting
+for the previous group commit (and binlog fsync()) to finish so they can do
+the next group commit. This is the whole point of implementing group commit.
+
+The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
+contended as long as handlers follow the intension of having the corresponding
+handler calls execute quickly.
+
+The per-thread LOCK_commit_ordered mutexes should not be contended; they are
+only used to wake up a sleeping thread.
+
+
+1.4 Optimisations when not using all three new calls
+
+
+The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
+optional, and if not implemented by a particular handler/transaction
+coordinator, we can optimise the algorithm to take advantage of not having to
+keep ordering for the missing parts.
+
+If there is no prepare_ordered(), then we need not take the
+LOCK_prepare_ordered mutex.
+
+If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
+mutex.
+
+If there is no group_log_xid(), then we only need the queue to ensure same
+ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
+if either of these (or both) are also not present, we do not need to use the
+queue at all.
+
+
+2. Binlog code changes (log.cc)
+
+
+The bulk of the work needed for the binary log is to extend the code to allow
+group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
+inside the binlog code for group commit.
+
+The existing code runs most of the write + fsync to the binary lock under the
+global LOCK_log mutex, preventing any group commit.
+
+To enable group commit, this code must be split into two parts:
+
+ - one part that runs per transaction, re-writing the embedded event positions
+ for the correct offset, and writing this into the in-memory log cache.
+
+ - another part that writes a set of transactions to the disk, and runs
+ fsync().
+
+Then in group_log_xid(), we can run the first part in a loop over all the
+transactions in the passed-in queue, and run the second part only once.
+
+The binlog code also has other code paths that write into the binlog,
+eg. non-transactional statements. These have to be adapted also to work with
+the new code.
+
+In order to get some group commit facility for these also, we change that part
+of the code in a similar way to ha_commit_trans. We keep another,
+binlog-internal queue of such non-transactional binlog writes, and such writes
+queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
+LOCK_log, it loops over the queue for the fast part, and does the slow part
+once, then finally wakes up the others in the queue.
+
+In the transactional case in group_log_xid(), before we run the passed-in
+queue, we add any members found in the binlog-internal queue. This allows
+these non-transactional writes to share the group commit.
+
+However, in the case where it is a non-transactional write that gets the
+LOCK_log, the transactional transactions from the ha_commit_trans() queue will
+not be able to take part (they will have to wait for their turn to do another
+fsync). It seems difficult to cleanly let the binlog code grab the queue from
+out of the ha_commit_trans() algorithm. I think the group commit is mostly
+useful in transactional workloads anyway (non-transactional engines will loose
+data anyway in case of crash, so why fsync() after each transaction?)
+
+
+3. XtraDB changes (ha_innodb.cc)
+
+The changes needed in XtraDB are comparatively simple, as XtraDB already
+implements group commit, it just needs to be enabled with the new
+commit_ordered() call.
+
+The existing commit() method already is logically in two parts. The first part
+runs under the prepare_commit_mutex() and must be run in same order as binlog
+commit. This part needs to be moved to commit_ordered(). The second part runs
+after releasing prepare_commit_mutex and does transaction log write+fsync; it
+can remain.
+
+Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
+XtraDB option to disable it).
+
+There are two asserts that check that the thread running the first part of
+XtraDB commit is the same as the thread running the other operations for the
+transaction. These have to be removed (as commit_ordered() can run in a
+different thread). Also an error reporting with sql_print_error() has to be
+delayed until commit() time.
+
+
+4. Proof-of-concept implementation
+
+There is a proof-of-concept implementation of this architecture, in the form
+of a quilt patch series [3].
+
+A quick benchmark was done, with sync_binlog=1 and
+innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
+transactions against one table.
+
+Without the patch, we get only 25 queries per second.
+
+With the patch, we get 650 queries per second.
+
+
+5. Open issues/tasks
+
+5.1 XA / other prepare() and commit() call sites.
+
+Check that user-level XA is handled correctly and working. And covered
+sufficiently with tests. Also check that any other calls of ha->prepare() and
+ha->commit() outside of ha_commit_trans() are handled correctly.
+
+5.2 Testing
+
+This worklog needs additions to the test suite, including error inserts to
+check error handling, and synchronisation points to check thread parallelism
+correctness.
+
+
+6. Alternative implementations
+
+ - The binlog code maintains its own extra atomic transaction queue to handle
+ non-transactional commits in a good way together with transactional (with
+ respect to group commit). Alternatively, we could ignore this issue and
+ just give up on group commit for non-transactional statements, for some
+ code simplifications.
+
+ - The binlog code has two ways to prepare end_event and similar, one that
+ uses stack-allocation, and another for when stack allocation is not
+ possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
+ so small that it would make sense to use the same code for both cases.
+
+ - Instead of adding extra fields to THD, we could allocate a separate
+ structure on the thd->mem_root() with the required extra fields (including
+ the THD pointer). Would seem to require initialising mutexes at every
+ commit though.
+
+ - It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
+ (should not be hard).
+
+
+-----------------------------------------------------------------------
+
+References:
+
+[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
+
+[3] https://knielsen-hq.org/maria/patches.mwl116/
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.116.old.14249 2010-05-25 13:18:34.000000000 +0000
+++ /tmp/wklog.116.new.14249 2010-05-25 13:18:34.000000000 +0000
@@ -1 +1,157 @@
+The basic idea in group commit is that multiple threads, each handling one
+transaction, prepare for commit and then queue up together waiting to do an
+fsync() on the transaction log. Then once the log is available, a single
+thread does the fsync() + other necessary book-keeping for all of the threads
+at once. After this, the single thread signals the other threads that it's
+done and they can finish up and return success (or failure) from the commit
+operation.
+
+So group commit has a parallel part, and a sequential part. So we need a
+facility for engines/binlog to participate in both the parallel and the
+sequential part.
+
+To do this, we add two new handlerton methods:
+
+ int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
+ void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
+
+The idea is that the existing prepare() and commit() methods run in the
+parallel part of group commit, and the new prepare_ordered() and
+commit_ordered() run in the sequential part.
+
+The prepare_ordered() method is called after prepare(). The order of
+tranctions that call into prepare_ordered() is guaranteed to be the same among
+all storage engines and binlog, and it is serialised so no two calls can be
+running inside the same engine at the same time.
+
+The commit_ordered() method is called before commit(), and similarly is
+guaranteed to have same transaction order in all participants, and to be
+serialised within one engine.
+
+As the prepare_ordered() and commit_ordered() calls are serialised, the idea
+is that handlers should do the minimum amount of work needed in these calls,
+relaying most of the work (eg. fsync() ...) to prepare() and commit().
+
+As a concrete example, for InnoDB the commit_ordered() method will do the
+first part of commit that fixed the commit order in the transaction log
+buffer, and the commit() method will write the log to disk and fsync()
+it. This split already exists inside the InnoDB code, running before
+respectively after releasing the prepare_commit_mutex.
+
+In addition, the XA transaction coordinator (TC_LOG) is special, since it is
+the one responsible for deciding whether to commit or rollback the
+transaction. For this we need an extra method, since this decision can be done
+only after we know that all prepare() and prepare_ordered() calls succeed, and
+must be done to know whether to call commit_ordered()/commit(), or do rollback.
+
+The existing method for this is TC_LOG::log_xid(). To make implementing group
+commit simpler to implement in a transaction coordinator and more efficient,
+we introduce a new method:
+
+ void group_log_xid(THD *first_thd);
+
+This method runs in the sequential part of group commit. It receives a list of
+transactions to perform log_xid() on, in the correct commit order. (Note that
+TC_LOG can do parallel parts of group commit in its own prepare() and commit()
+methods).
+
+This method can make it easier to implement the group commit in TC_LOG, as it
+gets directly the list of transactions in the right order. Without it, it
+might need to compute such order anyway in a prepare_ordered() method, and the
+server has to create this ordered list anyway to implement the order guarantee
+for prepare_ordered() and commit_ordered().
+
+This group_log_xid() method also is more efficient, as it avoids some
+inter-thread synchronisation. Since group_log_xid() is serialised, we can run
+it together with all the commit_ordered() method calls and need only a single
+sequential code section. With the log_xid() methods, we would need first a
+sequential part for the prepare_ordered() calls, then a parallel part with
+log_xid() calls (to not loose group commit ability for log_xid()), then again
+a sequential part for the commit_ordered() method calls.
+
+The extra synchronisation is needed, as each commit_ordered() call will have
+to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
+should not be called), and also wait for commit_ordered() to finish in all
+threads handling earlier commits. In effect we will need to bounce the
+execution from one thread to the other among all participants in the group
+commit.
+
+As a consequence of the group_log_xid() optimisation, handlers must be aware
+that the commit_ordered() call can happen in another thread than the one
+running commit() (so thread local storage is not available). This should not
+be a big issue as the THD is available for storing any needed information.
+
+Since group_log_xid() runs for multiple transactions in a single thread, it
+can not do error reporting (my_error()) as that relies on thread local
+storage. Instead it sets an error code in THD::xid_error, and if there is an
+error then later another method will be called (in correct thread context) to
+actually report the error:
+
+ int xid_delayed_error(THD *thd)
+
+The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
+are optional (as is xid_delayed_error). A storage engine or transaction
+coordinator is free to not implement them if they are not needed. In this case
+there will be no order guarantee for the corresponding stage of group commit
+for that engine. For example, InnoDB needs no ordering of the prepare phase,
+so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
+all, so does not need to implement any of them.
+
+Note in particular that all existing engines (/binlog implementations if they
+exist) will work unmodified (and also without any change in group commit
+facilities or commit order guaranteed).
+
+Using these new APIs, the work will be to
+
+ - In ha_commit_trans(), implement the correct semantics for the three new
+ calls.
+
+ - In XtraDB, use the new commit_ordered() call to remove the
+ prepare_commit_mutex (and resurrect group commit) without loosing the
+ consistency with binlog commit order.
+
+ - In log.cc (binlog module), implement group_log_xid() to do group commit of
+ multiple transactions to the binlog with a single shared fsync() call.
+
+-----------------------------------------------------------------------
+Some possible alternative for this worklog:
+
+ - We could eliminate the group_log_xid() method for a simpler API, at the
+ cost of extra synchronisation between threads to do in-order
+ commit_ordered() method calls. This would also allow to call
+ commit_ordered() in the correct thread context.
+
+ - Alternatively, we could eliminate log_xid() and require that all
+ transaction coordinators implement group_log_xid() instead, again for some
+ moderate simplification.
+
+ - At the moment there is no plugin actually using prepare_ordered(), so, it
+ could be removed from the design. But it fits in well, is efficient to
+ implement, and could be useful later (eg. for the requested feature of
+ releasing locks early in InnoDB).
+
+-----------------------------------------------------------------------
+Some possible follow-up projects after this is implemented:
+
+ - Add statistics about how efficient group commit is (#fsyncs/#commits in
+ each engine and binlog).
+
+ - Implement an XtraDB prepare_ordered() methods that can release row locks
+ early (Mark Callaghan from Facebook advocates this, but need to determine
+ exactly how to do this safely).
+
+ - Implement a new crash recovery algorithm that uses the consistent commit
+ ordering to need only fsync() for the binlog. At crash recovery, any
+ missing transactions in an engine is replayed from the correct point in the
+ binlog (this point must be stored transactionally inside the engine, as
+ XtraDB already does today).
+
+ - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
+ consistent snapshow, with same set of committed and not committed
+ transactions in all engines, 2) returns a corresponding consistent binlog
+ position. This should be easy by piggybacking on the synchronisation
+ implemented for ha_commit_trans().
+
+ - Use this in XtraBackup to get consistent binlog position without having to
+ block all updates with FLUSH TABLES WITH READ LOCK.
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High Level Description modified.
--- /tmp/wklog.116.old.14234 2010-05-25 13:18:07.000000000 +0000
+++ /tmp/wklog.116.new.14234 2010-05-25 13:18:07.000000000 +0000
@@ -21,3 +21,69 @@
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
+----
+
+Implementing group commit in MySQL faces some challenges from the handler
+plugin architecture:
+
+1. Because storage engine handlers have separate transaction log from the
+mysql binlog (and from each other), there are multiple fsync() calls per
+commit that need the group commit optimisation (2 per participating storage
+engine + 1 for binlog).
+
+2. The code handling commit is split in several places, in main server code
+and in storage engine code. With pluggable binlog it will be split even
+more. This requires a good abstract yet powerful API to be able to implement
+group commit simply and efficiently in plugins without the different parts
+having to rely on iternals of the others.
+
+3. We want the order of commits to be the same in all engines participating in
+multiple transactions. This requirement is the reason that InnoDB currently
+breaks group commit with the infamous prepare_commit_mutex.
+
+While currently there is no server guarantee to get same commit order in
+engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
+several reasons why this could be desirable:
+
+ - InnoDB hot backup needs to be able to extract a binlog position that is
+ consistent with the hot backup to be able to provision a new slave, and
+ this is impossible without imposing at least partial consistent ordering
+ between InnoDB and binlog.
+
+ - Other backup methods could have similar needs, eg. XtraBackup or
+ `mysqldump --single-transaction`, to have consistent commit order between
+ binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
+ or similar expensive blocking operation. (other backup methods, like LVM
+ snapshot, don't need consistent commit order, as they can restore
+ out-of-order commits during crash recovery using XA).
+
+ - If we have consistent commit order, we can think about optimising commit to
+ need only one fsync (for binlog); lost commits in storage engines can then
+ be recovered from the binlog at crash recovery by re-playing against the
+ engine from a particular point in the binlog.
+
+ - With consistent commit order, we can get better semantics for START
+ TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
+ could even get it to return also a matching binlog position). Currently,
+ this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
+ engines.
+
+ - In InnoDB, the performance in the presense of hotspots can be improved if
+ we can release row locks early in the commit phase, but this requires that we
+release them in
+ the same order as commits in the binlog to ensure consistency between
+ master and slaves.
+
+ - There was some discussions around Galera [1] synchroneous replication and
+ global transaction ID that it needed consistent commit order among
+ participating engines.
+
+ - I believe there could be other applications for guaranteed consistent
+ commit order, and that the architecture described in this worklog can
+ implement such guarantee with reasonable overhead.
+
+
+References:
+
+[1] Galera: http://www.codership.com/products/galera_replication
+
-=-=(Knielsen - Tue, 25 May 2010, 08:28)=-=-
More thoughts on and changes to the archtecture. Got to something now that I am satisfied with and
that seems to be able to handle all issues.
Implement new prepare_ordered and commit_ordered handler methods and the logic in ha_commit_trans().
Implement TC_LOG::group_log_xid() method and logic in ha_commit_trans().
Implement XtraDB part, using commit_ordered() rather than prepare_commit_mutex.
Fix test suite failures.
Proof-of-concept patch series complete now.
Do initial benchmark, getting good results. With 64 threads, see 26x improvement in queries-per-sec.
Next step: write up the architecture description.
Worked 21 hours and estimate 0 hours remain (original estimate increased by 21 hours).
-=-=(Knielsen - Wed, 12 May 2010, 06:41)=-=-
Started work on a Quilt patch series, refactoring the binlog code to prepare for implementing the
group commit, and working on the design of group commit in parallel.
Found and fixed several problems in error handling when writing to binlog.
Removed redundant table map version locking.
Split binlog writing into two parts in preparations for group commit. When ready to write to the
binlog, threads enter a queue, and the first thread in the queue handles the binlog writing for
everyone. When it obtains the LOCK_log, it first loops over all threads, executing the first part of
binlog writing (the write(2) syscall essentially). It then runs the second part (fsync(2)
essentially) only once, and then wakes up the remaining threads in the queue.
Still to be done:
Finish the proof-of-concept group commit patch, by 1) implementing the prepare_fast() and
commit_fast() callbacks in handler.cc 2) move the binlog thread enqueue from log_xid() to
binlog_prepare_fast(), 3) move fast part of InnoDB commit to innobase_commit_fast(), removing the
prepare_commit_mutex().
Write up the final design in this worklog.
Evaluate the design to see if we can do better/different.
Think about possible next steps, such as releasing innodb row locks early (in
innobase_prepare_fast), and doing crash recovery by replaying transactions from the binlog (removing
the need for engine durability and 2 of 3 fsync() in commit).
Worked 28 hours and estimate 0 hours remain (original estimate increased by 28 hours).
-=-=(Serg - Mon, 26 Apr 2010, 14:10)=-=-
Observers changed: Serg
DESCRIPTION:
Currently, in order to ensure that the server can recover after a crash to a
state in which storage engines and binary log are consistent with each other,
it is necessary to use XA with durable commits for both storage engines
(innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1).
This is _very_ expensive, since the server needs to do three fsync() operations
for every commit, as there is no working group commit when the binary log is
enabled.
The idea is to
- Implement/fix group commit to work properly with the binary log enabled.
- (Optionally) avoid the need to fsync() in the engine, and instead rely on
replaying any lost transactions from the binary log against the engine
during crash recovery.
For background see these articles:
http://kristiannielsen.livejournal.com/12254.html
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
----
Implementing group commit in MySQL faces some challenges from the handler
plugin architecture:
1. Because storage engine handlers have separate transaction log from the
mysql binlog (and from each other), there are multiple fsync() calls per
commit that need the group commit optimisation (2 per participating storage
engine + 1 for binlog).
2. The code handling commit is split in several places, in main server code
and in storage engine code. With pluggable binlog it will be split even
more. This requires a good abstract yet powerful API to be able to implement
group commit simply and efficiently in plugins without the different parts
having to rely on iternals of the others.
3. We want the order of commits to be the same in all engines participating in
multiple transactions. This requirement is the reason that InnoDB currently
breaks group commit with the infamous prepare_commit_mutex.
While currently there is no server guarantee to get same commit order in
engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
several reasons why this could be desirable:
- InnoDB hot backup needs to be able to extract a binlog position that is
consistent with the hot backup to be able to provision a new slave, and
this is impossible without imposing at least partial consistent ordering
between InnoDB and binlog.
- Other backup methods could have similar needs, eg. XtraBackup or
`mysqldump --single-transaction`, to have consistent commit order between
binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
or similar expensive blocking operation. (other backup methods, like LVM
snapshot, don't need consistent commit order, as they can restore
out-of-order commits during crash recovery using XA).
- If we have consistent commit order, we can think about optimising commit to
need only one fsync (for binlog); lost commits in storage engines can then
be recovered from the binlog at crash recovery by re-playing against the
engine from a particular point in the binlog.
- With consistent commit order, we can get better semantics for START
TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
could even get it to return also a matching binlog position). Currently,
this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
engines.
- In InnoDB, the performance in the presense of hotspots can be improved if
we can release row locks early in the commit phase, but this requires that we
release them in
the same order as commits in the binlog to ensure consistency between
master and slaves.
- There was some discussions around Galera [1] synchroneous replication and
global transaction ID that it needed consistent commit order among
participating engines.
- I believe there could be other applications for guaranteed consistent
commit order, and that the architecture described in this worklog can
implement such guarantee with reasonable overhead.
References:
[1] Galera: http://www.codership.com/products/galera_replication
HIGH-LEVEL SPECIFICATION:
The basic idea in group commit is that multiple threads, each handling one
transaction, prepare for commit and then queue up together waiting to do an
fsync() on the transaction log. Then once the log is available, a single
thread does the fsync() + other necessary book-keeping for all of the threads
at once. After this, the single thread signals the other threads that it's
done and they can finish up and return success (or failure) from the commit
operation.
So group commit has a parallel part, and a sequential part. So we need a
facility for engines/binlog to participate in both the parallel and the
sequential part.
To do this, we add two new handlerton methods:
int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
The idea is that the existing prepare() and commit() methods run in the
parallel part of group commit, and the new prepare_ordered() and
commit_ordered() run in the sequential part.
The prepare_ordered() method is called after prepare(). The order of
tranctions that call into prepare_ordered() is guaranteed to be the same among
all storage engines and binlog, and it is serialised so no two calls can be
running inside the same engine at the same time.
The commit_ordered() method is called before commit(), and similarly is
guaranteed to have same transaction order in all participants, and to be
serialised within one engine.
As the prepare_ordered() and commit_ordered() calls are serialised, the idea
is that handlers should do the minimum amount of work needed in these calls,
relaying most of the work (eg. fsync() ...) to prepare() and commit().
As a concrete example, for InnoDB the commit_ordered() method will do the
first part of commit that fixed the commit order in the transaction log
buffer, and the commit() method will write the log to disk and fsync()
it. This split already exists inside the InnoDB code, running before
respectively after releasing the prepare_commit_mutex.
In addition, the XA transaction coordinator (TC_LOG) is special, since it is
the one responsible for deciding whether to commit or rollback the
transaction. For this we need an extra method, since this decision can be done
only after we know that all prepare() and prepare_ordered() calls succeed, and
must be done to know whether to call commit_ordered()/commit(), or do rollback.
The existing method for this is TC_LOG::log_xid(). To make implementing group
commit simpler to implement in a transaction coordinator and more efficient,
we introduce a new method:
void group_log_xid(THD *first_thd);
This method runs in the sequential part of group commit. It receives a list of
transactions to perform log_xid() on, in the correct commit order. (Note that
TC_LOG can do parallel parts of group commit in its own prepare() and commit()
methods).
This method can make it easier to implement the group commit in TC_LOG, as it
gets directly the list of transactions in the right order. Without it, it
might need to compute such order anyway in a prepare_ordered() method, and the
server has to create this ordered list anyway to implement the order guarantee
for prepare_ordered() and commit_ordered().
This group_log_xid() method also is more efficient, as it avoids some
inter-thread synchronisation. Since group_log_xid() is serialised, we can run
it together with all the commit_ordered() method calls and need only a single
sequential code section. With the log_xid() methods, we would need first a
sequential part for the prepare_ordered() calls, then a parallel part with
log_xid() calls (to not loose group commit ability for log_xid()), then again
a sequential part for the commit_ordered() method calls.
The extra synchronisation is needed, as each commit_ordered() call will have
to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
should not be called), and also wait for commit_ordered() to finish in all
threads handling earlier commits. In effect we will need to bounce the
execution from one thread to the other among all participants in the group
commit.
As a consequence of the group_log_xid() optimisation, handlers must be aware
that the commit_ordered() call can happen in another thread than the one
running commit() (so thread local storage is not available). This should not
be a big issue as the THD is available for storing any needed information.
Since group_log_xid() runs for multiple transactions in a single thread, it
can not do error reporting (my_error()) as that relies on thread local
storage. Instead it sets an error code in THD::xid_error, and if there is an
error then later another method will be called (in correct thread context) to
actually report the error:
int xid_delayed_error(THD *thd)
The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
are optional (as is xid_delayed_error). A storage engine or transaction
coordinator is free to not implement them if they are not needed. In this case
there will be no order guarantee for the corresponding stage of group commit
for that engine. For example, InnoDB needs no ordering of the prepare phase,
so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
all, so does not need to implement any of them.
Note in particular that all existing engines (/binlog implementations if they
exist) will work unmodified (and also without any change in group commit
facilities or commit order guaranteed).
Using these new APIs, the work will be to
- In ha_commit_trans(), implement the correct semantics for the three new
calls.
- In XtraDB, use the new commit_ordered() call to remove the
prepare_commit_mutex (and resurrect group commit) without loosing the
consistency with binlog commit order.
- In log.cc (binlog module), implement group_log_xid() to do group commit of
multiple transactions to the binlog with a single shared fsync() call.
-----------------------------------------------------------------------
Some possible alternative for this worklog:
- We could eliminate the group_log_xid() method for a simpler API, at the
cost of extra synchronisation between threads to do in-order
commit_ordered() method calls. This would also allow to call
commit_ordered() in the correct thread context.
- Alternatively, we could eliminate log_xid() and require that all
transaction coordinators implement group_log_xid() instead, again for some
moderate simplification.
- At the moment there is no plugin actually using prepare_ordered(), so, it
could be removed from the design. But it fits in well, is efficient to
implement, and could be useful later (eg. for the requested feature of
releasing locks early in InnoDB).
-----------------------------------------------------------------------
Some possible follow-up projects after this is implemented:
- Add statistics about how efficient group commit is (#fsyncs/#commits in
each engine and binlog).
- Implement an XtraDB prepare_ordered() methods that can release row locks
early (Mark Callaghan from Facebook advocates this, but need to determine
exactly how to do this safely).
- Implement a new crash recovery algorithm that uses the consistent commit
ordering to need only fsync() for the binlog. At crash recovery, any
missing transactions in an engine is replayed from the correct point in the
binlog (this point must be stored transactionally inside the engine, as
XtraDB already does today).
- Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
consistent snapshow, with same set of committed and not committed
transactions in all engines, 2) returns a corresponding consistent binlog
position. This should be easy by piggybacking on the synchronisation
implemented for ha_commit_trans().
- Use this in XtraBackup to get consistent binlog position without having to
block all updates with FLUSH TABLES WITH READ LOCK.
LOW-LEVEL DESIGN:
1. Changes for ha_commit_trans()
The gut of the code for commit is in the function ha_commit_trans() (and in
commit_one_phase() which is called from it). This must be extended to use the
new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
1.1 Atomic queue of committing transactions
To keep the right commit order among participants, we put transactions into a
queue. The operations on the queue are non-locking:
- Insert THD at the head of the queue, and return old queue.
THD *enqueue_atomic(THD *thd)
- Fetch (and delete) the whole queue.
THD *atomic_grab_reverse_queue()
These are simple to implement with atomic compare-and-set. Note that there is
no ABA problem [2], as we do not delete individual elements from the queue, we
grab the whole queue and replace it with NULL.
A transaction enters the queue when it does prepare_ordered(). This way, the
scheduling order for prepare_ordered() calls is what determines the sequence
in the queue and effectively the commit order.
The queue is grabbed by the code doing group_log_xid() and commit_ordered()
calls. The queue is passed directly to group_log_xid(), and afterwards
iterated to do individual commit_ordered() calls.
Using a lock-free queue allows prepare_ordered() (for one transaction) to run
in parallel with commit_ordered (in another transaction), increasing potential
parallelism.
The queue is simply a linked list of THD objects, linked through a
THD::next_commit_ordered field. Since we add at the head of the queue, the
list is actually in reverse order, so must be reversed when we grab and delete
it.
The reason that enqueue_atomic() returns the old queue is so that we can check
if an insert goes to the head of the queue. The thread at the head of the
queue will do the sequential part of group commit for everyone.
1.2 Locks
1.2.1 Global LOCK_prepare_ordered
This lock is taken to serialise calls to prepare_ordered(). Note that
effectively, the commit order is decided by the order in which threads obtain
this lock.
1.2.2 Global LOCK_group_commit and COND_group_commit
This lock is used to protect the serial part of group commit. It is taken
around the code where we grab the queue, call group_log_xid() on the queue,
and call commit_ordered() on each element of the queue, to make sure they
happen serialised and in consistent order. It also protects the variable
group_commit_queue_busy, which is used when not using group_log_xid() to delay
running over a new queue until the first queue is completely done.
1.2.3 Global LOCK_commit_ordered
This lock is taken around calls to commit_ordered(), to ensure they happen
serialised.
1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
This lock protects the thd->group_commit_ready variable, as well as the
condition variable used to wake up threads after log_xid() and
commit_ordered() finishes.
1.2.5 Global LOCK_group_commit_queue
This is only used on platforms with no native compare-and-set operations, to
make the queue operations atomic.
1.3 Commit algorithm.
This is the basic algorithm, simplified by
- omitting some error handling
- omitting looping over all handlers when invoking handler methods
- omitting some possible optimisations when not all calls needed (see next
section).
- Omitting the case where no group_log_xid() is used, see below.
---- BEGIN ALGORITHM ----
ht->prepare()
// Call prepare_ordered() and enqueue in correct commit order
lock(LOCK_prepare_ordered)
ht->prepare_ordered()
old_queue= enqueue_atomic(thd)
thd->group_commit_ready= FALSE
is_group_commit_leader= (old_queue == NULL)
unlock(LOCK_prepare_ordered)
if (is_group_commit_leader)
// The first in queue handles group commit for everyone
lock(LOCK_group_commit)
// Wait while queue is busy, see below for when this occurs
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
// Grab and reverse the queue to get correct order of transactions
queue= atomic_grab_reverse_queue()
// This call will set individual error codes in thd->xid_error
// It also sets the cookie for unlog() in thd->xid_cookie
group_log_xid(queue)
lock(LOCK_commit_ordered)
for (other IN queue)
if (!other->xid_error)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
unlock(LOCK_group_commit)
// Now we are done, so wake up all the others.
for (other IN TAIL(queue))
lock(other->LOCK_commit_ordered)
other->group_commit_ready= TRUE
cond_signal(other->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
else
// If not the leader, just wait until leader did the work for us.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
// Finally do any error reporting now that we're back in own thread.
if (thd->xid_error)
xid_delayed_error(thd)
else
ht->commit(thd)
unlog(thd->xid_cookie, thd->xid)
---- END ALGORITHM ----
If the transaction coordinator does not support group_log_xid(), we have to do
things differently. In this case after the serialisation point at
prepare_ordered(), we have to parallelise again when running log_xid()
(otherwise we would loose group commit). But then when log_xid() is done, we
have to serialise again to check for any error and call commit_ordered() in
correct sequence for any transaction where log_xid() did not return error.
The central part of the algorithm in this case (when using log_xid()) is:
---- BEGIN ALGORITHM ----
cookie= log_xid(thd)
error= (cookie == 0)
if (is_group_commit_leader)
// The first to enqueue grabs the queue and runs first.
// But we must wait until a previous queue run is fully done.
lock(LOCK_group_commit)
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
queue= atomic_grab_reverse_queue()
// The queue will be busy until last thread in it is done.
group_commit_queue_busy= TRUE
unlock(LOCK_group_commit)
else
// Not first in queue -> wait for previous one to wake us up.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
if (!error) // Only if log_xid() was successful
lock(LOCK_commit_ordered)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
// Wake up the next thread, and release queue in last.
next= thd->next_commit_ordered
if (next)
lock(next->LOCK_commit_ordered)
next->group_commit_ready= TRUE
cond_signal(next->COND_commit_ordered)
unlock(next->LOCK_commit_ordered)
else
lock(LOCK_group_commit)
group_commit_queue_busy= FALSE
unlock(LOCK_group_commit)
---- END ALGORITHM ----
There are a number of locks taken in the algorithm, but in the group_log_xid()
case most of them should be uncontended most of the time. The
LOCK_group_commit of course will be contended, as new threads queue up waiting
for the previous group commit (and binlog fsync()) to finish so they can do
the next group commit. This is the whole point of implementing group commit.
The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
contended as long as handlers follow the intension of having the corresponding
handler calls execute quickly.
The per-thread LOCK_commit_ordered mutexes should not be contended; they are
only used to wake up a sleeping thread.
1.4 Optimisations when not using all three new calls
The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
optional, and if not implemented by a particular handler/transaction
coordinator, we can optimise the algorithm to take advantage of not having to
keep ordering for the missing parts.
If there is no prepare_ordered(), then we need not take the
LOCK_prepare_ordered mutex.
If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
mutex.
If there is no group_log_xid(), then we only need the queue to ensure same
ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
if either of these (or both) are also not present, we do not need to use the
queue at all.
2. Binlog code changes (log.cc)
The bulk of the work needed for the binary log is to extend the code to allow
group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
inside the binlog code for group commit.
The existing code runs most of the write + fsync to the binary lock under the
global LOCK_log mutex, preventing any group commit.
To enable group commit, this code must be split into two parts:
- one part that runs per transaction, re-writing the embedded event positions
for the correct offset, and writing this into the in-memory log cache.
- another part that writes a set of transactions to the disk, and runs
fsync().
Then in group_log_xid(), we can run the first part in a loop over all the
transactions in the passed-in queue, and run the second part only once.
The binlog code also has other code paths that write into the binlog,
eg. non-transactional statements. These have to be adapted also to work with
the new code.
In order to get some group commit facility for these also, we change that part
of the code in a similar way to ha_commit_trans. We keep another,
binlog-internal queue of such non-transactional binlog writes, and such writes
queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
LOCK_log, it loops over the queue for the fast part, and does the slow part
once, then finally wakes up the others in the queue.
In the transactional case in group_log_xid(), before we run the passed-in
queue, we add any members found in the binlog-internal queue. This allows
these non-transactional writes to share the group commit.
However, in the case where it is a non-transactional write that gets the
LOCK_log, the transactional transactions from the ha_commit_trans() queue will
not be able to take part (they will have to wait for their turn to do another
fsync). It seems difficult to cleanly let the binlog code grab the queue from
out of the ha_commit_trans() algorithm. I think the group commit is mostly
useful in transactional workloads anyway (non-transactional engines will loose
data anyway in case of crash, so why fsync() after each transaction?)
3. XtraDB changes (ha_innodb.cc)
The changes needed in XtraDB are comparatively simple, as XtraDB already
implements group commit, it just needs to be enabled with the new
commit_ordered() call.
The existing commit() method already is logically in two parts. The first part
runs under the prepare_commit_mutex() and must be run in same order as binlog
commit. This part needs to be moved to commit_ordered(). The second part runs
after releasing prepare_commit_mutex and does transaction log write+fsync; it
can remain.
Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
XtraDB option to disable it).
There are two asserts that check that the thread running the first part of
XtraDB commit is the same as the thread running the other operations for the
transaction. These have to be removed (as commit_ordered() can run in a
different thread). Also an error reporting with sql_print_error() has to be
delayed until commit() time.
4. Proof-of-concept implementation
There is a proof-of-concept implementation of this architecture, in the form
of a quilt patch series [3].
A quick benchmark was done, with sync_binlog=1 and
innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
transactions against one table.
Without the patch, we get only 25 queries per second.
With the patch, we get 650 queries per second.
5. Open issues/tasks
5.1 XA / other prepare() and commit() call sites.
Check that user-level XA is handled correctly and working. And covered
sufficiently with tests. Also check that any other calls of ha->prepare() and
ha->commit() outside of ha_commit_trans() are handled correctly.
5.2 Testing
This worklog needs additions to the test suite, including error inserts to
check error handling, and synchronisation points to check thread parallelism
correctness.
6. Alternative implementations
- The binlog code maintains its own extra atomic transaction queue to handle
non-transactional commits in a good way together with transactional (with
respect to group commit). Alternatively, we could ignore this issue and
just give up on group commit for non-transactional statements, for some
code simplifications.
- The binlog code has two ways to prepare end_event and similar, one that
uses stack-allocation, and another for when stack allocation is not
possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
so small that it would make sense to use the same code for both cases.
- Instead of adding extra fields to THD, we could allocate a separate
structure on the thd->mem_root() with the required extra fields (including
the THD pointer). Would seem to require initialising mutexes at every
commit though.
- It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
(should not be hard).
-----------------------------------------------------------------------
References:
[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
[3] https://knielsen-hq.org/maria/patches.mwl116/
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Efficient group commit for binary log (116)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Efficient group commit for binary log
CREATION DATE..: Mon, 26 Apr 2010, 13:28
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Serg
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 116 (http://askmonty.org/worklog/?tid=116)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 71
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 07 Jun 2010, 07:15)=-=-
Some more benchmarking.
Blog about results and architecture.
Fix two bugs in the proof-of-concept patch, races that causes hangs in benchmarks.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Guest - Tue, 01 Jun 2010, 14:20)=-=-
Status updated.
--- /tmp/wklog.116.old.32652 2010-06-01 14:20:15.000000000 +0000
+++ /tmp/wklog.116.new.32652 2010-06-01 14:20:15.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Mon, 31 May 2010, 06:48)=-=-
Finish first architecture draft (changed my mind a number of times before I was satisfied).
Write up architecture in worklog.
Fix remaining test failures in proof-of-concept patch + implement xtradb part.
Run some benchmarks on proof-of-concept implementation.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Tue, 25 May 2010, 13:19)=-=-
Low Level Design modified.
--- /tmp/wklog.116.old.14255 2010-05-25 13:19:00.000000000 +0000
+++ /tmp/wklog.116.new.14255 2010-05-25 13:19:00.000000000 +0000
@@ -1 +1,363 @@
+1. Changes for ha_commit_trans()
+
+The gut of the code for commit is in the function ha_commit_trans() (and in
+commit_one_phase() which is called from it). This must be extended to use the
+new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
+
+1.1 Atomic queue of committing transactions
+
+To keep the right commit order among participants, we put transactions into a
+queue. The operations on the queue are non-locking:
+
+ - Insert THD at the head of the queue, and return old queue.
+
+ THD *enqueue_atomic(THD *thd)
+
+ - Fetch (and delete) the whole queue.
+
+ THD *atomic_grab_reverse_queue()
+
+These are simple to implement with atomic compare-and-set. Note that there is
+no ABA problem [2], as we do not delete individual elements from the queue, we
+grab the whole queue and replace it with NULL.
+
+A transaction enters the queue when it does prepare_ordered(). This way, the
+scheduling order for prepare_ordered() calls is what determines the sequence
+in the queue and effectively the commit order.
+
+The queue is grabbed by the code doing group_log_xid() and commit_ordered()
+calls. The queue is passed directly to group_log_xid(), and afterwards
+iterated to do individual commit_ordered() calls.
+
+Using a lock-free queue allows prepare_ordered() (for one transaction) to run
+in parallel with commit_ordered (in another transaction), increasing potential
+parallelism.
+
+The queue is simply a linked list of THD objects, linked through a
+THD::next_commit_ordered field. Since we add at the head of the queue, the
+list is actually in reverse order, so must be reversed when we grab and delete
+it.
+
+The reason that enqueue_atomic() returns the old queue is so that we can check
+if an insert goes to the head of the queue. The thread at the head of the
+queue will do the sequential part of group commit for everyone.
+
+
+1.2 Locks
+
+1.2.1 Global LOCK_prepare_ordered
+
+This lock is taken to serialise calls to prepare_ordered(). Note that
+effectively, the commit order is decided by the order in which threads obtain
+this lock.
+
+
+1.2.2 Global LOCK_group_commit and COND_group_commit
+
+This lock is used to protect the serial part of group commit. It is taken
+around the code where we grab the queue, call group_log_xid() on the queue,
+and call commit_ordered() on each element of the queue, to make sure they
+happen serialised and in consistent order. It also protects the variable
+group_commit_queue_busy, which is used when not using group_log_xid() to delay
+running over a new queue until the first queue is completely done.
+
+
+1.2.3 Global LOCK_commit_ordered
+
+This lock is taken around calls to commit_ordered(), to ensure they happen
+serialised.
+
+
+1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
+
+This lock protects the thd->group_commit_ready variable, as well as the
+condition variable used to wake up threads after log_xid() and
+commit_ordered() finishes.
+
+
+1.2.5 Global LOCK_group_commit_queue
+
+This is only used on platforms with no native compare-and-set operations, to
+make the queue operations atomic.
+
+
+1.3 Commit algorithm.
+
+This is the basic algorithm, simplified by
+
+ - omitting some error handling
+
+ - omitting looping over all handlers when invoking handler methods
+
+ - omitting some possible optimisations when not all calls needed (see next
+ section).
+
+ - Omitting the case where no group_log_xid() is used, see below.
+
+---- BEGIN ALGORITHM ----
+ ht->prepare()
+
+ // Call prepare_ordered() and enqueue in correct commit order
+ lock(LOCK_prepare_ordered)
+ ht->prepare_ordered()
+ old_queue= enqueue_atomic(thd)
+ thd->group_commit_ready= FALSE
+ is_group_commit_leader= (old_queue == NULL)
+ unlock(LOCK_prepare_ordered)
+
+ if (is_group_commit_leader)
+
+ // The first in queue handles group commit for everyone
+
+ lock(LOCK_group_commit)
+ // Wait while queue is busy, see below for when this occurs
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+
+ // Grab and reverse the queue to get correct order of transactions
+ queue= atomic_grab_reverse_queue()
+
+ // This call will set individual error codes in thd->xid_error
+ // It also sets the cookie for unlog() in thd->xid_cookie
+ group_log_xid(queue)
+
+ lock(LOCK_commit_ordered)
+ for (other IN queue)
+ if (!other->xid_error)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ unlock(LOCK_group_commit)
+
+ // Now we are done, so wake up all the others.
+ for (other IN TAIL(queue))
+ lock(other->LOCK_commit_ordered)
+ other->group_commit_ready= TRUE
+ cond_signal(other->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+ else
+ // If not the leader, just wait until leader did the work for us.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ // Finally do any error reporting now that we're back in own thread.
+ if (thd->xid_error)
+ xid_delayed_error(thd)
+ else
+ ht->commit(thd)
+ unlog(thd->xid_cookie, thd->xid)
+---- END ALGORITHM ----
+
+If the transaction coordinator does not support group_log_xid(), we have to do
+things differently. In this case after the serialisation point at
+prepare_ordered(), we have to parallelise again when running log_xid()
+(otherwise we would loose group commit). But then when log_xid() is done, we
+have to serialise again to check for any error and call commit_ordered() in
+correct sequence for any transaction where log_xid() did not return error.
+
+The central part of the algorithm in this case (when using log_xid()) is:
+
+---- BEGIN ALGORITHM ----
+ cookie= log_xid(thd)
+ error= (cookie == 0)
+
+ if (is_group_commit_leader)
+
+ // The first to enqueue grabs the queue and runs first.
+ // But we must wait until a previous queue run is fully done.
+
+ lock(LOCK_group_commit)
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+ queue= atomic_grab_reverse_queue()
+ // The queue will be busy until last thread in it is done.
+ group_commit_queue_busy= TRUE
+ unlock(LOCK_group_commit)
+ else
+ // Not first in queue -> wait for previous one to wake us up.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ if (!error) // Only if log_xid() was successful
+ lock(LOCK_commit_ordered)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ // Wake up the next thread, and release queue in last.
+ next= thd->next_commit_ordered
+
+ if (next)
+ lock(next->LOCK_commit_ordered)
+ next->group_commit_ready= TRUE
+ cond_signal(next->COND_commit_ordered)
+ unlock(next->LOCK_commit_ordered)
+ else
+ lock(LOCK_group_commit)
+ group_commit_queue_busy= FALSE
+ unlock(LOCK_group_commit)
+---- END ALGORITHM ----
+
+There are a number of locks taken in the algorithm, but in the group_log_xid()
+case most of them should be uncontended most of the time. The
+LOCK_group_commit of course will be contended, as new threads queue up waiting
+for the previous group commit (and binlog fsync()) to finish so they can do
+the next group commit. This is the whole point of implementing group commit.
+
+The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
+contended as long as handlers follow the intension of having the corresponding
+handler calls execute quickly.
+
+The per-thread LOCK_commit_ordered mutexes should not be contended; they are
+only used to wake up a sleeping thread.
+
+
+1.4 Optimisations when not using all three new calls
+
+
+The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
+optional, and if not implemented by a particular handler/transaction
+coordinator, we can optimise the algorithm to take advantage of not having to
+keep ordering for the missing parts.
+
+If there is no prepare_ordered(), then we need not take the
+LOCK_prepare_ordered mutex.
+
+If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
+mutex.
+
+If there is no group_log_xid(), then we only need the queue to ensure same
+ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
+if either of these (or both) are also not present, we do not need to use the
+queue at all.
+
+
+2. Binlog code changes (log.cc)
+
+
+The bulk of the work needed for the binary log is to extend the code to allow
+group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
+inside the binlog code for group commit.
+
+The existing code runs most of the write + fsync to the binary lock under the
+global LOCK_log mutex, preventing any group commit.
+
+To enable group commit, this code must be split into two parts:
+
+ - one part that runs per transaction, re-writing the embedded event positions
+ for the correct offset, and writing this into the in-memory log cache.
+
+ - another part that writes a set of transactions to the disk, and runs
+ fsync().
+
+Then in group_log_xid(), we can run the first part in a loop over all the
+transactions in the passed-in queue, and run the second part only once.
+
+The binlog code also has other code paths that write into the binlog,
+eg. non-transactional statements. These have to be adapted also to work with
+the new code.
+
+In order to get some group commit facility for these also, we change that part
+of the code in a similar way to ha_commit_trans. We keep another,
+binlog-internal queue of such non-transactional binlog writes, and such writes
+queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
+LOCK_log, it loops over the queue for the fast part, and does the slow part
+once, then finally wakes up the others in the queue.
+
+In the transactional case in group_log_xid(), before we run the passed-in
+queue, we add any members found in the binlog-internal queue. This allows
+these non-transactional writes to share the group commit.
+
+However, in the case where it is a non-transactional write that gets the
+LOCK_log, the transactional transactions from the ha_commit_trans() queue will
+not be able to take part (they will have to wait for their turn to do another
+fsync). It seems difficult to cleanly let the binlog code grab the queue from
+out of the ha_commit_trans() algorithm. I think the group commit is mostly
+useful in transactional workloads anyway (non-transactional engines will loose
+data anyway in case of crash, so why fsync() after each transaction?)
+
+
+3. XtraDB changes (ha_innodb.cc)
+
+The changes needed in XtraDB are comparatively simple, as XtraDB already
+implements group commit, it just needs to be enabled with the new
+commit_ordered() call.
+
+The existing commit() method already is logically in two parts. The first part
+runs under the prepare_commit_mutex() and must be run in same order as binlog
+commit. This part needs to be moved to commit_ordered(). The second part runs
+after releasing prepare_commit_mutex and does transaction log write+fsync; it
+can remain.
+
+Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
+XtraDB option to disable it).
+
+There are two asserts that check that the thread running the first part of
+XtraDB commit is the same as the thread running the other operations for the
+transaction. These have to be removed (as commit_ordered() can run in a
+different thread). Also an error reporting with sql_print_error() has to be
+delayed until commit() time.
+
+
+4. Proof-of-concept implementation
+
+There is a proof-of-concept implementation of this architecture, in the form
+of a quilt patch series [3].
+
+A quick benchmark was done, with sync_binlog=1 and
+innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
+transactions against one table.
+
+Without the patch, we get only 25 queries per second.
+
+With the patch, we get 650 queries per second.
+
+
+5. Open issues/tasks
+
+5.1 XA / other prepare() and commit() call sites.
+
+Check that user-level XA is handled correctly and working. And covered
+sufficiently with tests. Also check that any other calls of ha->prepare() and
+ha->commit() outside of ha_commit_trans() are handled correctly.
+
+5.2 Testing
+
+This worklog needs additions to the test suite, including error inserts to
+check error handling, and synchronisation points to check thread parallelism
+correctness.
+
+
+6. Alternative implementations
+
+ - The binlog code maintains its own extra atomic transaction queue to handle
+ non-transactional commits in a good way together with transactional (with
+ respect to group commit). Alternatively, we could ignore this issue and
+ just give up on group commit for non-transactional statements, for some
+ code simplifications.
+
+ - The binlog code has two ways to prepare end_event and similar, one that
+ uses stack-allocation, and another for when stack allocation is not
+ possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
+ so small that it would make sense to use the same code for both cases.
+
+ - Instead of adding extra fields to THD, we could allocate a separate
+ structure on the thd->mem_root() with the required extra fields (including
+ the THD pointer). Would seem to require initialising mutexes at every
+ commit though.
+
+ - It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
+ (should not be hard).
+
+
+-----------------------------------------------------------------------
+
+References:
+
+[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
+
+[3] https://knielsen-hq.org/maria/patches.mwl116/
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.116.old.14249 2010-05-25 13:18:34.000000000 +0000
+++ /tmp/wklog.116.new.14249 2010-05-25 13:18:34.000000000 +0000
@@ -1 +1,157 @@
+The basic idea in group commit is that multiple threads, each handling one
+transaction, prepare for commit and then queue up together waiting to do an
+fsync() on the transaction log. Then once the log is available, a single
+thread does the fsync() + other necessary book-keeping for all of the threads
+at once. After this, the single thread signals the other threads that it's
+done and they can finish up and return success (or failure) from the commit
+operation.
+
+So group commit has a parallel part, and a sequential part. So we need a
+facility for engines/binlog to participate in both the parallel and the
+sequential part.
+
+To do this, we add two new handlerton methods:
+
+ int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
+ void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
+
+The idea is that the existing prepare() and commit() methods run in the
+parallel part of group commit, and the new prepare_ordered() and
+commit_ordered() run in the sequential part.
+
+The prepare_ordered() method is called after prepare(). The order of
+tranctions that call into prepare_ordered() is guaranteed to be the same among
+all storage engines and binlog, and it is serialised so no two calls can be
+running inside the same engine at the same time.
+
+The commit_ordered() method is called before commit(), and similarly is
+guaranteed to have same transaction order in all participants, and to be
+serialised within one engine.
+
+As the prepare_ordered() and commit_ordered() calls are serialised, the idea
+is that handlers should do the minimum amount of work needed in these calls,
+relaying most of the work (eg. fsync() ...) to prepare() and commit().
+
+As a concrete example, for InnoDB the commit_ordered() method will do the
+first part of commit that fixed the commit order in the transaction log
+buffer, and the commit() method will write the log to disk and fsync()
+it. This split already exists inside the InnoDB code, running before
+respectively after releasing the prepare_commit_mutex.
+
+In addition, the XA transaction coordinator (TC_LOG) is special, since it is
+the one responsible for deciding whether to commit or rollback the
+transaction. For this we need an extra method, since this decision can be done
+only after we know that all prepare() and prepare_ordered() calls succeed, and
+must be done to know whether to call commit_ordered()/commit(), or do rollback.
+
+The existing method for this is TC_LOG::log_xid(). To make implementing group
+commit simpler to implement in a transaction coordinator and more efficient,
+we introduce a new method:
+
+ void group_log_xid(THD *first_thd);
+
+This method runs in the sequential part of group commit. It receives a list of
+transactions to perform log_xid() on, in the correct commit order. (Note that
+TC_LOG can do parallel parts of group commit in its own prepare() and commit()
+methods).
+
+This method can make it easier to implement the group commit in TC_LOG, as it
+gets directly the list of transactions in the right order. Without it, it
+might need to compute such order anyway in a prepare_ordered() method, and the
+server has to create this ordered list anyway to implement the order guarantee
+for prepare_ordered() and commit_ordered().
+
+This group_log_xid() method also is more efficient, as it avoids some
+inter-thread synchronisation. Since group_log_xid() is serialised, we can run
+it together with all the commit_ordered() method calls and need only a single
+sequential code section. With the log_xid() methods, we would need first a
+sequential part for the prepare_ordered() calls, then a parallel part with
+log_xid() calls (to not loose group commit ability for log_xid()), then again
+a sequential part for the commit_ordered() method calls.
+
+The extra synchronisation is needed, as each commit_ordered() call will have
+to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
+should not be called), and also wait for commit_ordered() to finish in all
+threads handling earlier commits. In effect we will need to bounce the
+execution from one thread to the other among all participants in the group
+commit.
+
+As a consequence of the group_log_xid() optimisation, handlers must be aware
+that the commit_ordered() call can happen in another thread than the one
+running commit() (so thread local storage is not available). This should not
+be a big issue as the THD is available for storing any needed information.
+
+Since group_log_xid() runs for multiple transactions in a single thread, it
+can not do error reporting (my_error()) as that relies on thread local
+storage. Instead it sets an error code in THD::xid_error, and if there is an
+error then later another method will be called (in correct thread context) to
+actually report the error:
+
+ int xid_delayed_error(THD *thd)
+
+The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
+are optional (as is xid_delayed_error). A storage engine or transaction
+coordinator is free to not implement them if they are not needed. In this case
+there will be no order guarantee for the corresponding stage of group commit
+for that engine. For example, InnoDB needs no ordering of the prepare phase,
+so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
+all, so does not need to implement any of them.
+
+Note in particular that all existing engines (/binlog implementations if they
+exist) will work unmodified (and also without any change in group commit
+facilities or commit order guaranteed).
+
+Using these new APIs, the work will be to
+
+ - In ha_commit_trans(), implement the correct semantics for the three new
+ calls.
+
+ - In XtraDB, use the new commit_ordered() call to remove the
+ prepare_commit_mutex (and resurrect group commit) without loosing the
+ consistency with binlog commit order.
+
+ - In log.cc (binlog module), implement group_log_xid() to do group commit of
+ multiple transactions to the binlog with a single shared fsync() call.
+
+-----------------------------------------------------------------------
+Some possible alternative for this worklog:
+
+ - We could eliminate the group_log_xid() method for a simpler API, at the
+ cost of extra synchronisation between threads to do in-order
+ commit_ordered() method calls. This would also allow to call
+ commit_ordered() in the correct thread context.
+
+ - Alternatively, we could eliminate log_xid() and require that all
+ transaction coordinators implement group_log_xid() instead, again for some
+ moderate simplification.
+
+ - At the moment there is no plugin actually using prepare_ordered(), so, it
+ could be removed from the design. But it fits in well, is efficient to
+ implement, and could be useful later (eg. for the requested feature of
+ releasing locks early in InnoDB).
+
+-----------------------------------------------------------------------
+Some possible follow-up projects after this is implemented:
+
+ - Add statistics about how efficient group commit is (#fsyncs/#commits in
+ each engine and binlog).
+
+ - Implement an XtraDB prepare_ordered() methods that can release row locks
+ early (Mark Callaghan from Facebook advocates this, but need to determine
+ exactly how to do this safely).
+
+ - Implement a new crash recovery algorithm that uses the consistent commit
+ ordering to need only fsync() for the binlog. At crash recovery, any
+ missing transactions in an engine is replayed from the correct point in the
+ binlog (this point must be stored transactionally inside the engine, as
+ XtraDB already does today).
+
+ - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
+ consistent snapshow, with same set of committed and not committed
+ transactions in all engines, 2) returns a corresponding consistent binlog
+ position. This should be easy by piggybacking on the synchronisation
+ implemented for ha_commit_trans().
+
+ - Use this in XtraBackup to get consistent binlog position without having to
+ block all updates with FLUSH TABLES WITH READ LOCK.
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High Level Description modified.
--- /tmp/wklog.116.old.14234 2010-05-25 13:18:07.000000000 +0000
+++ /tmp/wklog.116.new.14234 2010-05-25 13:18:07.000000000 +0000
@@ -21,3 +21,69 @@
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
+----
+
+Implementing group commit in MySQL faces some challenges from the handler
+plugin architecture:
+
+1. Because storage engine handlers have separate transaction log from the
+mysql binlog (and from each other), there are multiple fsync() calls per
+commit that need the group commit optimisation (2 per participating storage
+engine + 1 for binlog).
+
+2. The code handling commit is split in several places, in main server code
+and in storage engine code. With pluggable binlog it will be split even
+more. This requires a good abstract yet powerful API to be able to implement
+group commit simply and efficiently in plugins without the different parts
+having to rely on iternals of the others.
+
+3. We want the order of commits to be the same in all engines participating in
+multiple transactions. This requirement is the reason that InnoDB currently
+breaks group commit with the infamous prepare_commit_mutex.
+
+While currently there is no server guarantee to get same commit order in
+engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
+several reasons why this could be desirable:
+
+ - InnoDB hot backup needs to be able to extract a binlog position that is
+ consistent with the hot backup to be able to provision a new slave, and
+ this is impossible without imposing at least partial consistent ordering
+ between InnoDB and binlog.
+
+ - Other backup methods could have similar needs, eg. XtraBackup or
+ `mysqldump --single-transaction`, to have consistent commit order between
+ binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
+ or similar expensive blocking operation. (other backup methods, like LVM
+ snapshot, don't need consistent commit order, as they can restore
+ out-of-order commits during crash recovery using XA).
+
+ - If we have consistent commit order, we can think about optimising commit to
+ need only one fsync (for binlog); lost commits in storage engines can then
+ be recovered from the binlog at crash recovery by re-playing against the
+ engine from a particular point in the binlog.
+
+ - With consistent commit order, we can get better semantics for START
+ TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
+ could even get it to return also a matching binlog position). Currently,
+ this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
+ engines.
+
+ - In InnoDB, the performance in the presense of hotspots can be improved if
+ we can release row locks early in the commit phase, but this requires that we
+release them in
+ the same order as commits in the binlog to ensure consistency between
+ master and slaves.
+
+ - There was some discussions around Galera [1] synchroneous replication and
+ global transaction ID that it needed consistent commit order among
+ participating engines.
+
+ - I believe there could be other applications for guaranteed consistent
+ commit order, and that the architecture described in this worklog can
+ implement such guarantee with reasonable overhead.
+
+
+References:
+
+[1] Galera: http://www.codership.com/products/galera_replication
+
-=-=(Knielsen - Tue, 25 May 2010, 08:28)=-=-
More thoughts on and changes to the archtecture. Got to something now that I am satisfied with and
that seems to be able to handle all issues.
Implement new prepare_ordered and commit_ordered handler methods and the logic in ha_commit_trans().
Implement TC_LOG::group_log_xid() method and logic in ha_commit_trans().
Implement XtraDB part, using commit_ordered() rather than prepare_commit_mutex.
Fix test suite failures.
Proof-of-concept patch series complete now.
Do initial benchmark, getting good results. With 64 threads, see 26x improvement in queries-per-sec.
Next step: write up the architecture description.
Worked 21 hours and estimate 0 hours remain (original estimate increased by 21 hours).
-=-=(Knielsen - Wed, 12 May 2010, 06:41)=-=-
Started work on a Quilt patch series, refactoring the binlog code to prepare for implementing the
group commit, and working on the design of group commit in parallel.
Found and fixed several problems in error handling when writing to binlog.
Removed redundant table map version locking.
Split binlog writing into two parts in preparations for group commit. When ready to write to the
binlog, threads enter a queue, and the first thread in the queue handles the binlog writing for
everyone. When it obtains the LOCK_log, it first loops over all threads, executing the first part of
binlog writing (the write(2) syscall essentially). It then runs the second part (fsync(2)
essentially) only once, and then wakes up the remaining threads in the queue.
Still to be done:
Finish the proof-of-concept group commit patch, by 1) implementing the prepare_fast() and
commit_fast() callbacks in handler.cc 2) move the binlog thread enqueue from log_xid() to
binlog_prepare_fast(), 3) move fast part of InnoDB commit to innobase_commit_fast(), removing the
prepare_commit_mutex().
Write up the final design in this worklog.
Evaluate the design to see if we can do better/different.
Think about possible next steps, such as releasing innodb row locks early (in
innobase_prepare_fast), and doing crash recovery by replaying transactions from the binlog (removing
the need for engine durability and 2 of 3 fsync() in commit).
Worked 28 hours and estimate 0 hours remain (original estimate increased by 28 hours).
-=-=(Serg - Mon, 26 Apr 2010, 14:10)=-=-
Observers changed: Serg
DESCRIPTION:
Currently, in order to ensure that the server can recover after a crash to a
state in which storage engines and binary log are consistent with each other,
it is necessary to use XA with durable commits for both storage engines
(innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1).
This is _very_ expensive, since the server needs to do three fsync() operations
for every commit, as there is no working group commit when the binary log is
enabled.
The idea is to
- Implement/fix group commit to work properly with the binary log enabled.
- (Optionally) avoid the need to fsync() in the engine, and instead rely on
replaying any lost transactions from the binary log against the engine
during crash recovery.
For background see these articles:
http://kristiannielsen.livejournal.com/12254.html
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
----
Implementing group commit in MySQL faces some challenges from the handler
plugin architecture:
1. Because storage engine handlers have separate transaction log from the
mysql binlog (and from each other), there are multiple fsync() calls per
commit that need the group commit optimisation (2 per participating storage
engine + 1 for binlog).
2. The code handling commit is split in several places, in main server code
and in storage engine code. With pluggable binlog it will be split even
more. This requires a good abstract yet powerful API to be able to implement
group commit simply and efficiently in plugins without the different parts
having to rely on iternals of the others.
3. We want the order of commits to be the same in all engines participating in
multiple transactions. This requirement is the reason that InnoDB currently
breaks group commit with the infamous prepare_commit_mutex.
While currently there is no server guarantee to get same commit order in
engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
several reasons why this could be desirable:
- InnoDB hot backup needs to be able to extract a binlog position that is
consistent with the hot backup to be able to provision a new slave, and
this is impossible without imposing at least partial consistent ordering
between InnoDB and binlog.
- Other backup methods could have similar needs, eg. XtraBackup or
`mysqldump --single-transaction`, to have consistent commit order between
binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
or similar expensive blocking operation. (other backup methods, like LVM
snapshot, don't need consistent commit order, as they can restore
out-of-order commits during crash recovery using XA).
- If we have consistent commit order, we can think about optimising commit to
need only one fsync (for binlog); lost commits in storage engines can then
be recovered from the binlog at crash recovery by re-playing against the
engine from a particular point in the binlog.
- With consistent commit order, we can get better semantics for START
TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
could even get it to return also a matching binlog position). Currently,
this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
engines.
- In InnoDB, the performance in the presense of hotspots can be improved if
we can release row locks early in the commit phase, but this requires that we
release them in
the same order as commits in the binlog to ensure consistency between
master and slaves.
- There was some discussions around Galera [1] synchroneous replication and
global transaction ID that it needed consistent commit order among
participating engines.
- I believe there could be other applications for guaranteed consistent
commit order, and that the architecture described in this worklog can
implement such guarantee with reasonable overhead.
References:
[1] Galera: http://www.codership.com/products/galera_replication
HIGH-LEVEL SPECIFICATION:
The basic idea in group commit is that multiple threads, each handling one
transaction, prepare for commit and then queue up together waiting to do an
fsync() on the transaction log. Then once the log is available, a single
thread does the fsync() + other necessary book-keeping for all of the threads
at once. After this, the single thread signals the other threads that it's
done and they can finish up and return success (or failure) from the commit
operation.
So group commit has a parallel part, and a sequential part. So we need a
facility for engines/binlog to participate in both the parallel and the
sequential part.
To do this, we add two new handlerton methods:
int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
The idea is that the existing prepare() and commit() methods run in the
parallel part of group commit, and the new prepare_ordered() and
commit_ordered() run in the sequential part.
The prepare_ordered() method is called after prepare(). The order of
tranctions that call into prepare_ordered() is guaranteed to be the same among
all storage engines and binlog, and it is serialised so no two calls can be
running inside the same engine at the same time.
The commit_ordered() method is called before commit(), and similarly is
guaranteed to have same transaction order in all participants, and to be
serialised within one engine.
As the prepare_ordered() and commit_ordered() calls are serialised, the idea
is that handlers should do the minimum amount of work needed in these calls,
relaying most of the work (eg. fsync() ...) to prepare() and commit().
As a concrete example, for InnoDB the commit_ordered() method will do the
first part of commit that fixed the commit order in the transaction log
buffer, and the commit() method will write the log to disk and fsync()
it. This split already exists inside the InnoDB code, running before
respectively after releasing the prepare_commit_mutex.
In addition, the XA transaction coordinator (TC_LOG) is special, since it is
the one responsible for deciding whether to commit or rollback the
transaction. For this we need an extra method, since this decision can be done
only after we know that all prepare() and prepare_ordered() calls succeed, and
must be done to know whether to call commit_ordered()/commit(), or do rollback.
The existing method for this is TC_LOG::log_xid(). To make implementing group
commit simpler to implement in a transaction coordinator and more efficient,
we introduce a new method:
void group_log_xid(THD *first_thd);
This method runs in the sequential part of group commit. It receives a list of
transactions to perform log_xid() on, in the correct commit order. (Note that
TC_LOG can do parallel parts of group commit in its own prepare() and commit()
methods).
This method can make it easier to implement the group commit in TC_LOG, as it
gets directly the list of transactions in the right order. Without it, it
might need to compute such order anyway in a prepare_ordered() method, and the
server has to create this ordered list anyway to implement the order guarantee
for prepare_ordered() and commit_ordered().
This group_log_xid() method also is more efficient, as it avoids some
inter-thread synchronisation. Since group_log_xid() is serialised, we can run
it together with all the commit_ordered() method calls and need only a single
sequential code section. With the log_xid() methods, we would need first a
sequential part for the prepare_ordered() calls, then a parallel part with
log_xid() calls (to not loose group commit ability for log_xid()), then again
a sequential part for the commit_ordered() method calls.
The extra synchronisation is needed, as each commit_ordered() call will have
to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
should not be called), and also wait for commit_ordered() to finish in all
threads handling earlier commits. In effect we will need to bounce the
execution from one thread to the other among all participants in the group
commit.
As a consequence of the group_log_xid() optimisation, handlers must be aware
that the commit_ordered() call can happen in another thread than the one
running commit() (so thread local storage is not available). This should not
be a big issue as the THD is available for storing any needed information.
Since group_log_xid() runs for multiple transactions in a single thread, it
can not do error reporting (my_error()) as that relies on thread local
storage. Instead it sets an error code in THD::xid_error, and if there is an
error then later another method will be called (in correct thread context) to
actually report the error:
int xid_delayed_error(THD *thd)
The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
are optional (as is xid_delayed_error). A storage engine or transaction
coordinator is free to not implement them if they are not needed. In this case
there will be no order guarantee for the corresponding stage of group commit
for that engine. For example, InnoDB needs no ordering of the prepare phase,
so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
all, so does not need to implement any of them.
Note in particular that all existing engines (/binlog implementations if they
exist) will work unmodified (and also without any change in group commit
facilities or commit order guaranteed).
Using these new APIs, the work will be to
- In ha_commit_trans(), implement the correct semantics for the three new
calls.
- In XtraDB, use the new commit_ordered() call to remove the
prepare_commit_mutex (and resurrect group commit) without loosing the
consistency with binlog commit order.
- In log.cc (binlog module), implement group_log_xid() to do group commit of
multiple transactions to the binlog with a single shared fsync() call.
-----------------------------------------------------------------------
Some possible alternative for this worklog:
- We could eliminate the group_log_xid() method for a simpler API, at the
cost of extra synchronisation between threads to do in-order
commit_ordered() method calls. This would also allow to call
commit_ordered() in the correct thread context.
- Alternatively, we could eliminate log_xid() and require that all
transaction coordinators implement group_log_xid() instead, again for some
moderate simplification.
- At the moment there is no plugin actually using prepare_ordered(), so, it
could be removed from the design. But it fits in well, is efficient to
implement, and could be useful later (eg. for the requested feature of
releasing locks early in InnoDB).
-----------------------------------------------------------------------
Some possible follow-up projects after this is implemented:
- Add statistics about how efficient group commit is (#fsyncs/#commits in
each engine and binlog).
- Implement an XtraDB prepare_ordered() methods that can release row locks
early (Mark Callaghan from Facebook advocates this, but need to determine
exactly how to do this safely).
- Implement a new crash recovery algorithm that uses the consistent commit
ordering to need only fsync() for the binlog. At crash recovery, any
missing transactions in an engine is replayed from the correct point in the
binlog (this point must be stored transactionally inside the engine, as
XtraDB already does today).
- Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
consistent snapshow, with same set of committed and not committed
transactions in all engines, 2) returns a corresponding consistent binlog
position. This should be easy by piggybacking on the synchronisation
implemented for ha_commit_trans().
- Use this in XtraBackup to get consistent binlog position without having to
block all updates with FLUSH TABLES WITH READ LOCK.
LOW-LEVEL DESIGN:
1. Changes for ha_commit_trans()
The gut of the code for commit is in the function ha_commit_trans() (and in
commit_one_phase() which is called from it). This must be extended to use the
new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
1.1 Atomic queue of committing transactions
To keep the right commit order among participants, we put transactions into a
queue. The operations on the queue are non-locking:
- Insert THD at the head of the queue, and return old queue.
THD *enqueue_atomic(THD *thd)
- Fetch (and delete) the whole queue.
THD *atomic_grab_reverse_queue()
These are simple to implement with atomic compare-and-set. Note that there is
no ABA problem [2], as we do not delete individual elements from the queue, we
grab the whole queue and replace it with NULL.
A transaction enters the queue when it does prepare_ordered(). This way, the
scheduling order for prepare_ordered() calls is what determines the sequence
in the queue and effectively the commit order.
The queue is grabbed by the code doing group_log_xid() and commit_ordered()
calls. The queue is passed directly to group_log_xid(), and afterwards
iterated to do individual commit_ordered() calls.
Using a lock-free queue allows prepare_ordered() (for one transaction) to run
in parallel with commit_ordered (in another transaction), increasing potential
parallelism.
The queue is simply a linked list of THD objects, linked through a
THD::next_commit_ordered field. Since we add at the head of the queue, the
list is actually in reverse order, so must be reversed when we grab and delete
it.
The reason that enqueue_atomic() returns the old queue is so that we can check
if an insert goes to the head of the queue. The thread at the head of the
queue will do the sequential part of group commit for everyone.
1.2 Locks
1.2.1 Global LOCK_prepare_ordered
This lock is taken to serialise calls to prepare_ordered(). Note that
effectively, the commit order is decided by the order in which threads obtain
this lock.
1.2.2 Global LOCK_group_commit and COND_group_commit
This lock is used to protect the serial part of group commit. It is taken
around the code where we grab the queue, call group_log_xid() on the queue,
and call commit_ordered() on each element of the queue, to make sure they
happen serialised and in consistent order. It also protects the variable
group_commit_queue_busy, which is used when not using group_log_xid() to delay
running over a new queue until the first queue is completely done.
1.2.3 Global LOCK_commit_ordered
This lock is taken around calls to commit_ordered(), to ensure they happen
serialised.
1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
This lock protects the thd->group_commit_ready variable, as well as the
condition variable used to wake up threads after log_xid() and
commit_ordered() finishes.
1.2.5 Global LOCK_group_commit_queue
This is only used on platforms with no native compare-and-set operations, to
make the queue operations atomic.
1.3 Commit algorithm.
This is the basic algorithm, simplified by
- omitting some error handling
- omitting looping over all handlers when invoking handler methods
- omitting some possible optimisations when not all calls needed (see next
section).
- Omitting the case where no group_log_xid() is used, see below.
---- BEGIN ALGORITHM ----
ht->prepare()
// Call prepare_ordered() and enqueue in correct commit order
lock(LOCK_prepare_ordered)
ht->prepare_ordered()
old_queue= enqueue_atomic(thd)
thd->group_commit_ready= FALSE
is_group_commit_leader= (old_queue == NULL)
unlock(LOCK_prepare_ordered)
if (is_group_commit_leader)
// The first in queue handles group commit for everyone
lock(LOCK_group_commit)
// Wait while queue is busy, see below for when this occurs
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
// Grab and reverse the queue to get correct order of transactions
queue= atomic_grab_reverse_queue()
// This call will set individual error codes in thd->xid_error
// It also sets the cookie for unlog() in thd->xid_cookie
group_log_xid(queue)
lock(LOCK_commit_ordered)
for (other IN queue)
if (!other->xid_error)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
unlock(LOCK_group_commit)
// Now we are done, so wake up all the others.
for (other IN TAIL(queue))
lock(other->LOCK_commit_ordered)
other->group_commit_ready= TRUE
cond_signal(other->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
else
// If not the leader, just wait until leader did the work for us.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
// Finally do any error reporting now that we're back in own thread.
if (thd->xid_error)
xid_delayed_error(thd)
else
ht->commit(thd)
unlog(thd->xid_cookie, thd->xid)
---- END ALGORITHM ----
If the transaction coordinator does not support group_log_xid(), we have to do
things differently. In this case after the serialisation point at
prepare_ordered(), we have to parallelise again when running log_xid()
(otherwise we would loose group commit). But then when log_xid() is done, we
have to serialise again to check for any error and call commit_ordered() in
correct sequence for any transaction where log_xid() did not return error.
The central part of the algorithm in this case (when using log_xid()) is:
---- BEGIN ALGORITHM ----
cookie= log_xid(thd)
error= (cookie == 0)
if (is_group_commit_leader)
// The first to enqueue grabs the queue and runs first.
// But we must wait until a previous queue run is fully done.
lock(LOCK_group_commit)
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
queue= atomic_grab_reverse_queue()
// The queue will be busy until last thread in it is done.
group_commit_queue_busy= TRUE
unlock(LOCK_group_commit)
else
// Not first in queue -> wait for previous one to wake us up.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
if (!error) // Only if log_xid() was successful
lock(LOCK_commit_ordered)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
// Wake up the next thread, and release queue in last.
next= thd->next_commit_ordered
if (next)
lock(next->LOCK_commit_ordered)
next->group_commit_ready= TRUE
cond_signal(next->COND_commit_ordered)
unlock(next->LOCK_commit_ordered)
else
lock(LOCK_group_commit)
group_commit_queue_busy= FALSE
unlock(LOCK_group_commit)
---- END ALGORITHM ----
There are a number of locks taken in the algorithm, but in the group_log_xid()
case most of them should be uncontended most of the time. The
LOCK_group_commit of course will be contended, as new threads queue up waiting
for the previous group commit (and binlog fsync()) to finish so they can do
the next group commit. This is the whole point of implementing group commit.
The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
contended as long as handlers follow the intension of having the corresponding
handler calls execute quickly.
The per-thread LOCK_commit_ordered mutexes should not be contended; they are
only used to wake up a sleeping thread.
1.4 Optimisations when not using all three new calls
The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
optional, and if not implemented by a particular handler/transaction
coordinator, we can optimise the algorithm to take advantage of not having to
keep ordering for the missing parts.
If there is no prepare_ordered(), then we need not take the
LOCK_prepare_ordered mutex.
If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
mutex.
If there is no group_log_xid(), then we only need the queue to ensure same
ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
if either of these (or both) are also not present, we do not need to use the
queue at all.
2. Binlog code changes (log.cc)
The bulk of the work needed for the binary log is to extend the code to allow
group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
inside the binlog code for group commit.
The existing code runs most of the write + fsync to the binary lock under the
global LOCK_log mutex, preventing any group commit.
To enable group commit, this code must be split into two parts:
- one part that runs per transaction, re-writing the embedded event positions
for the correct offset, and writing this into the in-memory log cache.
- another part that writes a set of transactions to the disk, and runs
fsync().
Then in group_log_xid(), we can run the first part in a loop over all the
transactions in the passed-in queue, and run the second part only once.
The binlog code also has other code paths that write into the binlog,
eg. non-transactional statements. These have to be adapted also to work with
the new code.
In order to get some group commit facility for these also, we change that part
of the code in a similar way to ha_commit_trans. We keep another,
binlog-internal queue of such non-transactional binlog writes, and such writes
queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
LOCK_log, it loops over the queue for the fast part, and does the slow part
once, then finally wakes up the others in the queue.
In the transactional case in group_log_xid(), before we run the passed-in
queue, we add any members found in the binlog-internal queue. This allows
these non-transactional writes to share the group commit.
However, in the case where it is a non-transactional write that gets the
LOCK_log, the transactional transactions from the ha_commit_trans() queue will
not be able to take part (they will have to wait for their turn to do another
fsync). It seems difficult to cleanly let the binlog code grab the queue from
out of the ha_commit_trans() algorithm. I think the group commit is mostly
useful in transactional workloads anyway (non-transactional engines will loose
data anyway in case of crash, so why fsync() after each transaction?)
3. XtraDB changes (ha_innodb.cc)
The changes needed in XtraDB are comparatively simple, as XtraDB already
implements group commit, it just needs to be enabled with the new
commit_ordered() call.
The existing commit() method already is logically in two parts. The first part
runs under the prepare_commit_mutex() and must be run in same order as binlog
commit. This part needs to be moved to commit_ordered(). The second part runs
after releasing prepare_commit_mutex and does transaction log write+fsync; it
can remain.
Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
XtraDB option to disable it).
There are two asserts that check that the thread running the first part of
XtraDB commit is the same as the thread running the other operations for the
transaction. These have to be removed (as commit_ordered() can run in a
different thread). Also an error reporting with sql_print_error() has to be
delayed until commit() time.
4. Proof-of-concept implementation
There is a proof-of-concept implementation of this architecture, in the form
of a quilt patch series [3].
A quick benchmark was done, with sync_binlog=1 and
innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
transactions against one table.
Without the patch, we get only 25 queries per second.
With the patch, we get 650 queries per second.
5. Open issues/tasks
5.1 XA / other prepare() and commit() call sites.
Check that user-level XA is handled correctly and working. And covered
sufficiently with tests. Also check that any other calls of ha->prepare() and
ha->commit() outside of ha_commit_trans() are handled correctly.
5.2 Testing
This worklog needs additions to the test suite, including error inserts to
check error handling, and synchronisation points to check thread parallelism
correctness.
6. Alternative implementations
- The binlog code maintains its own extra atomic transaction queue to handle
non-transactional commits in a good way together with transactional (with
respect to group commit). Alternatively, we could ignore this issue and
just give up on group commit for non-transactional statements, for some
code simplifications.
- The binlog code has two ways to prepare end_event and similar, one that
uses stack-allocation, and another for when stack allocation is not
possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
so small that it would make sense to use the same code for both cases.
- Instead of adding extra fields to THD, we could allocate a separate
structure on the thd->mem_root() with the required extra fields (including
the THD pointer). Would seem to require initialising mutexes at every
commit though.
- It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
(should not be hard).
-----------------------------------------------------------------------
References:
[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
[3] https://knielsen-hq.org/maria/patches.mwl116/
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 41
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
-=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=-
Status updated.
--- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000
+++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=-
Worked 20 hours (alexi)
Worked 20 hours and estimate 15 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 33 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 41
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
-=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=-
Status updated.
--- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000
+++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=-
Worked 20 hours (alexi)
Worked 20 hours and estimate 15 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 33 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 41
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
-=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=-
Status updated.
--- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000
+++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=-
Worked 20 hours (alexi)
Worked 20 hours and estimate 15 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 33 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 41
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
-=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=-
Status updated.
--- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000
+++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=-
Worked 20 hours (alexi)
Worked 20 hours and estimate 15 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 33 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add Sphinx storage engine to MariaDB
CREATION DATE..: Mon, 10 Aug 2009, 23:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 42 (http://askmonty.org/worklog/?tid=42)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 9
ESTIMATE.......: 7 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
-=-=(Knielsen - Mon, 07 Jun 2010, 07:12)=-=-
Help Andrew with the integration.
Worked 2 hours and estimate 7 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Wrote patch that allows to test SphinxSE in mysql-test-run, using external Sphinx daemon.
Worked 7 hours and estimate 9 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 28 May 2010, 07:49)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.4369 2010-05-28 07:49:13.000000000 +0000
+++ /tmp/wklog.42.new.4369 2010-05-28 07:49:13.000000000 +0000
@@ -49,6 +49,10 @@
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
+I pushed a proof-of-concept patch for this here:
+
+ lp:~knielsen/maria/5.2-sphinxse
+
Here is a sample test case using this:
--source include/have_sphinx.inc
-=-=(Knielsen - Fri, 28 May 2010, 06:31)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.32746 2010-05-28 06:31:24.000000000 +0000
+++ /tmp/wklog.42.new.32746 2010-05-28 06:31:24.000000000 +0000
@@ -1 +1,63 @@
+Code
+----
+
+Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
+the MariaDB tree.
+
+It is a plugin, so it can be added to the tree just by including the
+sub-directory storage/sphinx/.
+
+The Sphinx plugin is already of some maturity, having been used with MySQL for
+some time.
+
+
+Testing
+-------
+
+To get testing in the mysql-test-run framework, some extensions are needed.
+
+To use the Sphinx storage engine, the external Sphinx search daemon needs to
+be running with some data directory containing indexed data. It also needs to
+be allocated a port.
+
+This is the indended approach:
+
+1. Testing will use an external Sphinx setup installed on the machine. Sphinx
+binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
+or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
+and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
+binaries can not be found, then Sphinx tests will be disabled (using some
+--source include/have_sphinx.inc in the test cases).
+
+2. The mysql-test-run framework will install Sphinx search data and start/stop
+the Sphinx search daemon for the test cases, similarly how it is done for the
+other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
+options --console, --config, and --pidfile.
+
+3. The mysql-test-run framework will generate a Sphinx config file from a
+template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
+ports and data directories appropriate for avoiding conflicts between multiple
+simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
+similar to MySQL my.cnf that we can use the existing framework for generating
+config file, with just a slightly modified variant of the code writing the
+file to disk.
+
+4. The mysql-test-run framework will pre-load the mysql database with tables
+and data for Sphinx to index. It will then run the `indexer` program to
+generate the indexes, and then start the `searchd` daemon. These three steps
+must be done in order, as each step depends on the previous. ALTERNATIVE: it
+might be possible to pre-generate the necessary data/index files and store
+them in the source tree.
+
+Here is a sample test case using this:
+
+--source include/have_sphinx.inc
+--source include/have_sphinxse.inc
+
+--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
+eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
+not null, index(q) ) engine=sphinx
+connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
+select * from ts where q='test';
+drop table ts;
-=-=(Knielsen - Fri, 28 May 2010, 06:07)=-=-
Version updated.
--- /tmp/wklog.42.old.32184 2010-05-28 06:07:00.000000000 +0000
+++ /tmp/wklog.42.new.32184 2010-05-28 06:07:00.000000000 +0000
@@ -1 +1 @@
-9.x
+Server-5.2
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Category updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-BackLog
+Server-Sprint
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Version updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-2.0
+9.x
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Status updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Guest - Tue, 15 Sep 2009, 02:25)=-=-
no
Reported zero hours worked. Estimate unchanged.
-=-=(Guest - Tue, 15 Sep 2009, 02:24)=-=-
Version updated.
--- /tmp/wklog.42.old.13241 2009-09-15 02:24:07.000000000 +0300
+++ /tmp/wklog.42.new.13241 2009-09-15 02:24:07.000000000 +0300
@@ -1 +1 @@
-Connector/.NET-5.1
+Maria-2.0
DESCRIPTION:
Add the Sphinx storage engine to the MariaDB tree
HIGH-LEVEL SPECIFICATION:
Code
----
Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
the MariaDB tree.
It is a plugin, so it can be added to the tree just by including the
sub-directory storage/sphinx/.
The Sphinx plugin is already of some maturity, having been used with MySQL for
some time.
Testing
-------
To get testing in the mysql-test-run framework, some extensions are needed.
To use the Sphinx storage engine, the external Sphinx search daemon needs to
be running with some data directory containing indexed data. It also needs to
be allocated a port.
This is the indended approach:
1. Testing will use an external Sphinx setup installed on the machine. Sphinx
binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
binaries can not be found, then Sphinx tests will be disabled (using some
--source include/have_sphinx.inc in the test cases).
2. The mysql-test-run framework will install Sphinx search data and start/stop
the Sphinx search daemon for the test cases, similarly how it is done for the
other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
options --console, --config, and --pidfile.
3. The mysql-test-run framework will generate a Sphinx config file from a
template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
ports and data directories appropriate for avoiding conflicts between multiple
simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
similar to MySQL my.cnf that we can use the existing framework for generating
config file, with just a slightly modified variant of the code writing the
file to disk.
4. The mysql-test-run framework will pre-load the mysql database with tables
and data for Sphinx to index. It will then run the `indexer` program to
generate the indexes, and then start the `searchd` daemon. These three steps
must be done in order, as each step depends on the previous. ALTERNATIVE: it
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
I pushed a proof-of-concept patch for this here:
lp:~knielsen/maria/5.2-sphinxse
Here is a sample test case using this:
--source include/have_sphinx.inc
--source include/have_sphinxse.inc
--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
not null, index(q) ) engine=sphinx
connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
select * from ts where q='test';
drop table ts;
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply@askmonty.org 07 Jun '10
by worklog-noreply@askmonty.org 07 Jun '10
07 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add Sphinx storage engine to MariaDB
CREATION DATE..: Mon, 10 Aug 2009, 23:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 42 (http://askmonty.org/worklog/?tid=42)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 9
ESTIMATE.......: 7 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
-=-=(Knielsen - Mon, 07 Jun 2010, 07:12)=-=-
Help Andrew with the integration.
Worked 2 hours and estimate 7 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Wrote patch that allows to test SphinxSE in mysql-test-run, using external Sphinx daemon.
Worked 7 hours and estimate 9 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 28 May 2010, 07:49)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.4369 2010-05-28 07:49:13.000000000 +0000
+++ /tmp/wklog.42.new.4369 2010-05-28 07:49:13.000000000 +0000
@@ -49,6 +49,10 @@
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
+I pushed a proof-of-concept patch for this here:
+
+ lp:~knielsen/maria/5.2-sphinxse
+
Here is a sample test case using this:
--source include/have_sphinx.inc
-=-=(Knielsen - Fri, 28 May 2010, 06:31)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.32746 2010-05-28 06:31:24.000000000 +0000
+++ /tmp/wklog.42.new.32746 2010-05-28 06:31:24.000000000 +0000
@@ -1 +1,63 @@
+Code
+----
+
+Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
+the MariaDB tree.
+
+It is a plugin, so it can be added to the tree just by including the
+sub-directory storage/sphinx/.
+
+The Sphinx plugin is already of some maturity, having been used with MySQL for
+some time.
+
+
+Testing
+-------
+
+To get testing in the mysql-test-run framework, some extensions are needed.
+
+To use the Sphinx storage engine, the external Sphinx search daemon needs to
+be running with some data directory containing indexed data. It also needs to
+be allocated a port.
+
+This is the indended approach:
+
+1. Testing will use an external Sphinx setup installed on the machine. Sphinx
+binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
+or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
+and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
+binaries can not be found, then Sphinx tests will be disabled (using some
+--source include/have_sphinx.inc in the test cases).
+
+2. The mysql-test-run framework will install Sphinx search data and start/stop
+the Sphinx search daemon for the test cases, similarly how it is done for the
+other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
+options --console, --config, and --pidfile.
+
+3. The mysql-test-run framework will generate a Sphinx config file from a
+template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
+ports and data directories appropriate for avoiding conflicts between multiple
+simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
+similar to MySQL my.cnf that we can use the existing framework for generating
+config file, with just a slightly modified variant of the code writing the
+file to disk.
+
+4. The mysql-test-run framework will pre-load the mysql database with tables
+and data for Sphinx to index. It will then run the `indexer` program to
+generate the indexes, and then start the `searchd` daemon. These three steps
+must be done in order, as each step depends on the previous. ALTERNATIVE: it
+might be possible to pre-generate the necessary data/index files and store
+them in the source tree.
+
+Here is a sample test case using this:
+
+--source include/have_sphinx.inc
+--source include/have_sphinxse.inc
+
+--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
+eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
+not null, index(q) ) engine=sphinx
+connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
+select * from ts where q='test';
+drop table ts;
-=-=(Knielsen - Fri, 28 May 2010, 06:07)=-=-
Version updated.
--- /tmp/wklog.42.old.32184 2010-05-28 06:07:00.000000000 +0000
+++ /tmp/wklog.42.new.32184 2010-05-28 06:07:00.000000000 +0000
@@ -1 +1 @@
-9.x
+Server-5.2
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Category updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-BackLog
+Server-Sprint
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Version updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-2.0
+9.x
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Status updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Guest - Tue, 15 Sep 2009, 02:25)=-=-
no
Reported zero hours worked. Estimate unchanged.
-=-=(Guest - Tue, 15 Sep 2009, 02:24)=-=-
Version updated.
--- /tmp/wklog.42.old.13241 2009-09-15 02:24:07.000000000 +0300
+++ /tmp/wklog.42.new.13241 2009-09-15 02:24:07.000000000 +0300
@@ -1 +1 @@
-Connector/.NET-5.1
+Maria-2.0
DESCRIPTION:
Add the Sphinx storage engine to the MariaDB tree
HIGH-LEVEL SPECIFICATION:
Code
----
Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
the MariaDB tree.
It is a plugin, so it can be added to the tree just by including the
sub-directory storage/sphinx/.
The Sphinx plugin is already of some maturity, having been used with MySQL for
some time.
Testing
-------
To get testing in the mysql-test-run framework, some extensions are needed.
To use the Sphinx storage engine, the external Sphinx search daemon needs to
be running with some data directory containing indexed data. It also needs to
be allocated a port.
This is the indended approach:
1. Testing will use an external Sphinx setup installed on the machine. Sphinx
binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
binaries can not be found, then Sphinx tests will be disabled (using some
--source include/have_sphinx.inc in the test cases).
2. The mysql-test-run framework will install Sphinx search data and start/stop
the Sphinx search daemon for the test cases, similarly how it is done for the
other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
options --console, --config, and --pidfile.
3. The mysql-test-run framework will generate a Sphinx config file from a
template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
ports and data directories appropriate for avoiding conflicts between multiple
simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
similar to MySQL my.cnf that we can use the existing framework for generating
config file, with just a slightly modified variant of the code writing the
file to disk.
4. The mysql-test-run framework will pre-load the mysql database with tables
and data for Sphinx to index. It will then run the `indexer` program to
generate the indexes, and then start the `searchd` daemon. These three steps
must be done in order, as each step depends on the previous. ALTERNATIVE: it
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
I pushed a proof-of-concept patch for this here:
lp:~knielsen/maria/5.2-sphinxse
Here is a sample test case using this:
--source include/have_sphinx.inc
--source include/have_sphinxse.inc
--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
not null, index(q) ) engine=sphinx
connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
select * from ts where q='test';
drop table ts;
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Rev 2795: Bugfixes in file:///home/bell/maria/bzr/work-maria-5.3-scache/
by sanja@askmonty.org 05 Jun '10
by sanja@askmonty.org 05 Jun '10
05 Jun '10
At file:///home/bell/maria/bzr/work-maria-5.3-scache/
------------------------------------------------------------
revno: 2795
revision-id: sanja(a)askmonty.org-20100605195727-7rrc5k75lr0a4o9z
parent: sanja(a)askmonty.org-20100527182744-1tu96cgyiaodzs32
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-scache
timestamp: Sat 2010-06-05 22:57:27 +0300
message:
Bugfixes
=== modified file 'mysql-test/r/myisam_mrr.result'
--- a/mysql-test/r/myisam_mrr.result 2010-03-11 21:43:31 +0000
+++ b/mysql-test/r/myisam_mrr.result 2010-06-05 19:57:27 +0000
@@ -394,7 +394,7 @@
# - engine_condition_pushdown does not affect ICP
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int, key(a));
=== modified file 'mysql-test/r/subquery_cache.result'
--- a/mysql-test/r/subquery_cache.result 2010-05-27 17:41:38 +0000
+++ b/mysql-test/r/subquery_cache.result 2010-06-05 19:57:27 +0000
@@ -588,4 +588,28 @@
Subquery_cache_hit 0
Subquery_cache_miss 4
drop table t1;
+#test of sql_big_tables switch and outer table reference in subquery with grouping
+set option sql_big_tables=1;
+CREATE TABLE t1 (a INT PRIMARY KEY, b INT);
+INSERT INTO t1 VALUES (1,1),(2,1),(3,2),(4,2),(5,3),(6,3);
+SELECT (SELECT t1_outer.a FROM t1 AS t1_inner GROUP BY b LIMIT 1) FROM t1 AS t1_outer;
+(SELECT t1_outer.a FROM t1 AS t1_inner GROUP BY b LIMIT 1)
+1
+2
+3
+4
+5
+6
+drop table t1;
+set option sql_big_tables=0;
+#test of function reference to outer query
+set local group_concat_max_len=400;
+create table t2 (a int, b int);
+insert into t2 values (1,1), (2,2);
+select b x, (select group_concat(x) from t2) from t2;
+x (select group_concat(x) from t2)
+1 1,1
+2 2,2
+drop table t2;
+set local group_concat_max_len=default;
set optimizer_switch='subquery_cache=default';
=== modified file 'mysql-test/t/subquery_cache.test'
--- a/mysql-test/t/subquery_cache.test 2010-05-27 17:41:38 +0000
+++ b/mysql-test/t/subquery_cache.test 2010-06-05 19:57:27 +0000
@@ -201,4 +201,20 @@
show status like "subquery_cache%";
drop table t1;
+--echo #test of sql_big_tables switch and outer table reference in subquery with grouping
+set option sql_big_tables=1;
+CREATE TABLE t1 (a INT PRIMARY KEY, b INT);
+INSERT INTO t1 VALUES (1,1),(2,1),(3,2),(4,2),(5,3),(6,3);
+SELECT (SELECT t1_outer.a FROM t1 AS t1_inner GROUP BY b LIMIT 1) FROM t1 AS t1_outer;
+drop table t1;
+set option sql_big_tables=0;
+
+--echo #test of function reference to outer query
+set local group_concat_max_len=400;
+create table t2 (a int, b int);
+insert into t2 values (1,1), (2,2);
+select b x, (select group_concat(x) from t2) from t2;
+drop table t2;
+set local group_concat_max_len=default;
+
set optimizer_switch='subquery_cache=default';
=== modified file 'sql/item.cc'
--- a/sql/item.cc 2010-05-27 17:41:38 +0000
+++ b/sql/item.cc 2010-06-05 19:57:27 +0000
@@ -5110,6 +5110,19 @@
}
+/**
+ Saves one Fields of an Item of in other Field
+
+ @param from Field to copy value from
+ @param null_value reference on item null_value to set it if it is needed
+ @param to Field to cope value to
+ @param no_conversions how to deal with NULL value (see
+ set_field_to_null_with_conversions())
+
+ @retval FALSE OK
+ @retval TRUE Error
+*/
+
static int save_field_in_field(Field *from, my_bool *null_value,
Field *to, bool no_conversions)
{
@@ -5139,6 +5152,10 @@
int Item_field::save_in_field(Field *to, bool no_conversions)
{
+ /* if it is external field */
+ if (unlikely(depended_from))
+ return save_field_in_field(field, &null_value, to, no_conversions);
+
return save_field_in_field(result_field, &null_value, to, no_conversions);
}
@@ -6346,7 +6363,7 @@
int Item_ref::save_in_field(Field *to, bool no_conversions)
{
int res;
- if (result_field)
+ if (result_field && !depended_from)
return save_field_in_field(result_field, &null_value, to, no_conversions);
res= (*ref)->save_in_field(to, no_conversions);
null_value= (*ref)->null_value;
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-05-25 18:29:14 +0000
+++ b/sql/item_subselect.cc 2010-06-05 19:57:27 +0000
@@ -1,4 +1,4 @@
-/* Copyrigh (C) 2000 MySQL AB
+/* Copyright (C) 2000 MySQL AB
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -818,6 +818,12 @@
exec();
}
+/**
+ Checks subquery cache for value
+
+ @retval NULL nothing found
+ @retval reference on item representing value found in the cache
+*/
Item *Item_subselect::check_cache()
{
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-05-24 17:29:56 +0000
+++ b/sql/item_subselect.h 2010-06-05 19:57:27 +0000
@@ -95,7 +95,10 @@
st_select_lex *parent_select;
/**
- List of items subquery depends on (externally resolved);
+ List of references on items subquery depends on (externally resolved);
+
+ @note We can't store direct links on Items because it could be
+ substituted with other item (for example for grouping).
*/
List<Item*> depends_on;
=== modified file 'sql/sql_subquery_cache.cc'
--- a/sql/sql_subquery_cache.cc 2010-05-27 18:27:44 +0000
+++ b/sql/sql_subquery_cache.cc 2010-06-05 19:57:27 +0000
@@ -96,6 +96,10 @@
/**
Creates equalities expression.
+ @note For some type of fields index lookup do not return failure but set
+ pointer on the next record. To check exact match we use expression like:
+ field1=value1 and field2=value2 ...
+
@retval FALSE OK
@retval TRUE Error
*/
@@ -111,6 +115,7 @@
for (uint i= 1 /* skip result filed */; (ref= li++); i++)
{
Field *fld= cache_table->field[i];
+ /* Only some field types should be checked after lookup */
if (fld->type() == MYSQL_TYPE_VARCHAR ||
fld->type() == MYSQL_TYPE_TINY_BLOB ||
fld->type() == MYSQL_TYPE_MEDIUM_BLOB ||
@@ -140,11 +145,22 @@
}
+/**
+ Enumerates all fields in field number order.
+
+ @param arg reference on current field number
+
+ @return field number
+*/
+
static uint field_enumerator(uchar *arg)
{
return ((uint*)arg)[0]++;
}
+/**
+ Initializes temporary table and index for this cache
+*/
void Subquery_cache_tmptable::init()
{
@@ -182,8 +198,10 @@
if (!(cache_table= create_tmp_table(table_thd, &cache_table_param,
items, (ORDER*) NULL,
FALSE, FALSE,
- (table_thd->options |
- TMP_TABLE_ALL_COLUMNS),
+ ((table_thd->options |
+ TMP_TABLE_ALL_COLUMNS) &
+ ~(OPTION_BIG_TABLES |
+ TMP_TABLE_FORCE_MYISAM)),
HA_POS_ERROR,
(char *)"subquery-cache-table")))
{
@@ -191,14 +209,16 @@
DBUG_VOID_RETURN;
}
- if (cache_table->s->blob_fields)
+ if (cache_table->s->db_type() != heap_hton)
{
- DBUG_PRINT("error", ("we do not need blobs"));
+ DBUG_PRINT("error", ("we need only heap table"));
goto error;
}
+ /* first field in the table is result value, so we skip it */
li_items++;
field_counter=1;
+
if (cache_table->alloc_keys(1) ||
(cache_table->add_tmp_key(0, items.elements - 1,
&field_enumerator,
@@ -224,6 +244,7 @@
DBUG_PRINT("error", ("Creating Item_field failed"));
goto error;
}
+
if (make_equalities())
{
DBUG_PRINT("error", ("Creating equalities failed"));
@@ -247,11 +268,26 @@
}
+/**
+ Checks if current key present in the cache and returns value if it is true
+
+ @param value assigned Item with value from the cache if key
+ is found
+ @return result of the key lookup
+*/
+
Subquery_cache::result Subquery_cache_tmptable::check_value(Item **value)
{
int res;
DBUG_ENTER("Subquery_cache_tmptable::check_value");
+ /*
+ We delay cache initialization to get item references which should be
+ used at the moment of query execution. I.e. we store reference on item
+ reference at the moment of class creation but for table creation and
+ index supply structures (join_tab) we need real Items which used at the
+ moment of execution so we can resolve reference only at this point.
+ */
if (!inited)
init();
@@ -275,6 +311,15 @@
}
+/**
+ Puts given value in the cache
+
+ @param value Value to put in the cache
+
+ @retval FALSE OK
+ @retval TRUE Error
+*/
+
my_bool Subquery_cache_tmptable::put_value(Item *value)
{
int error;
@@ -313,9 +358,3 @@
cache_table= NULL;
DBUG_RETURN(TRUE);
}
-
-
-void Subquery_cache_tmptable::cleanup()
-{
- cache_table->file->ha_delete_all_rows();
-}
=== modified file 'sql/sql_subquery_cache.h'
--- a/sql/sql_subquery_cache.h 2010-05-25 10:45:36 +0000
+++ b/sql/sql_subquery_cache.h 2010-06-05 19:57:27 +0000
@@ -23,10 +23,6 @@
Puts value into this cache (key should be taken from cache owner)
*/
virtual my_bool put_value(Item *value)= 0;
- /**
- Cleans up and reset cache before reusing
- */
- virtual void cleanup()= 0;
};
struct st_table_ref;
@@ -45,10 +41,9 @@
virtual ~Subquery_cache_tmptable();
virtual result check_value(Item **value);
virtual my_bool put_value(Item *value);
- virtual void cleanup();
+
+private:
void init();
-
-private:
bool make_equalities();
/* tmp table parameters */
=== modified file 'sql/table.cc'
--- a/sql/table.cc 2010-05-27 17:41:38 +0000
+++ b/sql/table.cc 2010-06-05 19:57:27 +0000
@@ -5187,10 +5187,16 @@
key_part_info->store_length= key_part_info->length;
if ((*reg_field)->real_maybe_null())
+ {
key_part_info->store_length+= HA_KEY_NULL_LENGTH;
+ keyinfo->key_length+= HA_KEY_NULL_LENGTH;
+ }
if ((*reg_field)->type() == MYSQL_TYPE_BLOB ||
(*reg_field)->real_type() == MYSQL_TYPE_VARCHAR)
+ {
key_part_info->store_length+= HA_KEY_BLOB_LENGTH;
+ keyinfo->key_length+= HA_KEY_BLOB_LENGTH; // ???
+ }
key_part_info->type= (uint8) (*reg_field)->key_type();
key_part_info->key_type =
1
0
>From dispatch_command() in sql_parse.cc net_end_statement() is called after
ha_autocommit_or_rollback() but before close_thread_tables(). What can go
wrong in the call to close_thread_tables() after the response to the client?
Commit or rollback was done before a response was sent to the client.
/* If commit fails, we should be able to reset the OK status. */
thd->main_da.can_overwrite_status= TRUE;
ha_autocommit_or_rollback(thd, thd->is_error());
thd->main_da.can_overwrite_status= FALSE;
thd->transaction.stmt.reset();
net_end_statement(thd);
query_cache_end_of_result(thd);
thd->proc_info= "closing tables";
/* Free tables */
close_thread_tables(thd);
--
Mark Callaghan
mdcallag(a)gmail.com
2
1

Re: [Maria-developers] [Bug 314570] Re: update is not changing internal auto increment value
by Sergei Golubchik 05 Jun '10
by Sergei Golubchik 05 Jun '10
05 Jun '10
Hi, Michael!
On Jun 04, Michael Widenius wrote:
>
> hi!
>
> >>>>> "Sergei" == Sergei <sergii(a)pisem.net> writes:
>
> Sergei> ** Changed in: maria
> Sergei> Importance: Undecided => Low
>
> Sergei> --
> Sergei> update is not changing internal auto increment value
> Sergei> https://bugs.launchpad.net/bugs/314570
>
> Why low ?
>
> Looks like a serious issue that we should get Percona to fix at once!
Because Heikki said it's not a bug, but intentional InnoDB behavior.
I'm not sure we should fix it at all. Heikki is certainly not fixing
it.
Regards,
Sergei
2
1

04 Jun '10
All,
MariaDB 5.2.1 is getting closer to being released so I've started
filling out the Release Notes and Changelog pages:
http://askmonty.org/wiki/Manual:MariaDB_5.2.1_Release_Notes
http://askmonty.org/wiki/Manual:MariaDB_5.2.1_Changelog
On the documentation TODO list for this release is a page on the OQGraph
storage engine for the manual. Any volunteers? :) (I'll get to it next
week, but if someone wants to put something up right away I wouldn't
object.)
Thanks.
--
Daniel Bartholomew
Monty Program - http://askmonty.org
1
0

Re: [Maria-developers] [Commits] Rev 2802: few small MySQL bugs/issues that impact the engines, as discussed in the SE summit in http://bazaar.launchpad.net/~maria-captains/maria/5.2/
by Sergei Golubchik 03 Jun '10
by Sergei Golubchik 03 Jun '10
03 Jun '10
Hi, Monty!
Thanks for the review!
See my replies below.
On Jun 03, Michael Widenius wrote:
>
> > At http://bazaar.launchpad.net/~maria-captains/maria/5.2/
> > ------------------------------------------------------------
> > revno: 2802
>
> > few small MySQL bugs/issues that impact the engines, as discussed in the SE summit
> > * remove handler::index_read_last()
> > * create handler::keyread_read_time() (was get_index_only_read_time() in opt_range.cc)
> > * ha_show_status() allows engine's show_status() to fail
> > * remove HTON_FLUSH_AFTER_RENAME
> > * fix key_cmp_if_same() to work for floats and doubles
> > * set table->status in the server, don't force engines to do it
> > * increment status vars in the server, don't force engines to do it
>
> > +++ b/mysql-test/r/status_user.result 2010-06-01 22:39:29 +0000
> > @@ -100,8 +100,8 @@ Handler_commit 19
> > Handler_delete 1
> > Handler_discover 0
> > Handler_prepare 18
> > -Handler_read_first 1
> > -Handler_read_key 8
> > +Handler_read_first 0
> > +Handler_read_key 3
>
> Any explanation why this change happened (as the test didn't change
> and I can't understand how the values could suddently be less now).
This change is correct. Before my commit, calls were counted in the
handler, say, in the index_first() and index_next().
And ha_innobase::rnd_next() is implemented by calling
index_first/index_next.
So, innodb was incrementing Handler_read_first and Handler_read_next for
table scans (and of course it was incrementing Handler_read_rnd_next too
- double counting).
It was wrong - first, it was double counting. Second, Handler_*
should count handler calls as done by mysql, not expose internal
implementation of the engine. For example, mi_rfirst() calls mi_rnext()
internally, but we don't count it as Handler_read_next. The same should
be true for any engine, even if it mixes implementation levels.
> By the way, it would be nice if the file comments would be part of the
> commit email (as I assume you documented this issue there).
I have not :(
But I will, when I recommit.
> > +++ b/mysql-test/r/partition_pruning.result 2010-06-01 22:39:29 +0000
> > @@ -2373,7 +2373,7 @@ flush status;
> > update t1 set a=100 where a+1=5+1;
> > show status like 'Handler_read_rnd_next';
> > Variable_name Value
> > -Handler_read_rnd_next 10
> > +Handler_read_rnd_next 19
>
> Any explanation why this change happened (as the test didn't change)
> Is it because we don't anymore count rows read in 'show' commands?
This is questionable change, I wanted to discuss it.
ha_partition::index_next (for example) calls underlying engine's
file->ha_index_next(), not file->index_next().
After my change Handler_read_key_next is incremented for both
ha_partition::index_next and file->index_next(). Double counting.
Before my change when a partition was pruned, Handler_read_key* counters
were not incremented at all (as ha_partition::index_read did not call
file->index_read() at all). Now it is incremented - that's why the
numbers are increased.
Possible solutions:
* do not increment Handler_read* counters for ha_partition methods,
only count calls to the underlying engines.
* do not increment Handler_read* counters for underlying engines - only
count calls from the upper layer into the handler, this is logical but
counters won't show partition pruning or handler call overhead caused by
many partitions. this can be solved by adding special set of counters
Handler_partition_read_* (or something).
> > === modified file 'sql/handler.cc'
> > --- a/sql/handler.cc 2010-06-01 19:52:20 +0000
> > +++ b/sql/handler.cc 2010-06-01 22:39:29 +0000
> > @@ -2131,8 +2125,6 @@ int handler::read_first_row(uchar * buf,
> > register int error;
> > DBUG_ENTER("handler::read_first_row");
> >
> > - ha_statistic_increment(&SSV::ha_read_first_count);
>
> The above is wrong; We are later calling 'index_first()' in this
> function, not ha_index_first(), so we miss one increment (which was
> shown in the test cases). Note that we do also call rnd_next() in
> this function, without any counting of rows so we need to fix other
> things in this function too!
That's fine, the counter is incremented in ha_read_first_row() wrapper.
If anything, the old code was wrong as it was incrementing
ha_read_first_count twice (once here and once in index_first).
> Simplest solution is to change to call ha_index_first / ha_rnd_next()
> in this function. This will also fix the 'table->status' variable that
> your are not counting anymore.
> This should be ok as we very seldom use 'handler::read_first_row()'
see above. I think it's ok to just increment ha_read_first_count in the
wrapper. Especially because read_first_row() is rarely used.
> Note that we should do same change in other functions that are calling
> handler functions directly:
>
> handler::read_range_first
> - This calls index_first() and index_read_map()
> get_auto_increment()
> - This calls index_last() and index_read_map()
> index_read_idx_map()
> - This calls index_read_map()
> - Note that we can't trivially change this to call ha_index_read_map()
> as we increment things statistics in ha_index_read_idx_map()
> - We need to update table->status in this function!
Yes. But read_range_first() for example has no dedicated counter, so
either it increments Handler_read_key* counters in the default
implementation, or it increments nothing at all when any engine provides
its own implementation :(
> > +/*
> > + Calculate cost of 'index only' scan for given index and number of records.
> > +
> > + SYNOPSIS
> > + handler->keyread_read_time()
> > + param parameters structure
> > + records #of records to read
> > + keynr key to read
> > +
> > + NOTES
> > + It is assumed that we will read trough the whole key range and that all
> > + key blocks are half full (normally things are much better). It is also
> > + assumed that each time we read the next key from the index, the handler
> > + performs a random seek, thus the cost is proportional to the number of
> > + blocks read.
> > +*/
> > +
> > +double handler::keyread_read_time(uint index, uint ranges, ha_rows rows)
> > +{
> > + double read_time;
> > + uint keys_per_block= (stats.block_size/2/
> > + (table->key_info[index].key_length + ref_length) + 1);
> > + read_time=((double) (rows+keys_per_block-1)/ (double) keys_per_block);
> > + return read_time;
> > +}
>
> Do we really need the 'ranges' argument ?
> (It's always '1' in the current code and you are not using it)
I don't know :)
I've copied it from the handler::read_time(), just to have the
interface the same for consistency. After all - logically - if the
read_time() may depend on the number of ranges, keyread_read_time()
certainly can do too.
> > === modified file 'sql/key.cc'
> > --- a/sql/key.cc 2008-10-10 10:01:01 +0000
> > +++ b/sql/key.cc 2010-06-01 22:39:29 +0000
> > @@ -278,8 +278,10 @@ bool key_cmp_if_same(TABLE *table,const
> > key++;
> > store_length--;
> > }
> > - if (key_part->key_part_flag & (HA_BLOB_PART | HA_VAR_LENGTH_PART |
> > - HA_BIT_PART))
> > + if ((key_part->key_part_flag & (HA_BLOB_PART | HA_VAR_LENGTH_PART |
> > + HA_BIT_PART)) ||
> > + key_part->type == HA_KEYTYPE_FLOAT ||
> > + key_part->type == HA_KEYTYPE_DOUBLE)
> > {
> > if (key_part->field->key_cmp(key, key_part->length))
> > return 1;
>
> I understand that for float and double there is some extraordinary
> cases where memcmp() is not same as =, but who has had a problem with
> this?
there was a bug report in mysql bugdb.
http://bugs.mysql.com/bug.php?id=44372
> As a separate note, I think it would be better to add to key_part_flag
> HA_NO_CMP_WITH_MEMCMP for key_parts of type FLOAT or DOUBLE
> when we open the table. This would simplify this test a bit.
I'll try to
> > === modified file 'sql/table.h'
> > --- a/sql/table.h 2010-06-01 19:52:20 +0000
> > +++ b/sql/table.h 2010-06-01 22:39:29 +0000
> > @@ -13,6 +13,8 @@
> > along with this program; if not, write to the Free Software
> > Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */
> >
> > +#ifndef SQL_TABLE_INCLUDED
> > +#define SQL_TABLE_INCLUDED
>
> Do we really need this one as it's automaticly included by mysql_priv.h ?
> Anyway, it should probably be MYSQL_TABLE_H to be similar our other defines.
This is actually unrelated change, I tried to include table.h to handler.h
(to solve the problem of inline handler methods needing TABLE) and had
to add include guards. later I solved the problem differently but kept
the guards as they're a good thing anyway.
As for the name of the guard, it's new (~1 yr old) MySQL style. As I
personally don't care about the name of the guards, as long as they all
use a consistent style, I use the MySQL naming style here.
> > === modified file 'storage/myisam/ha_myisam.cc'
>
> <cut>
>
> > int ha_myisam::index_next(uchar *buf)
> > {
> > DBUG_ASSERT(inited==INDEX);
> > - ha_statistic_increment(&SSV::ha_read_next_count);
> > int error=mi_rnext(file,buf,active_index);
> > table->status=error ? STATUS_NOT_FOUND: 0;
>
> you should probably remove the setting of table->status here
Neither updating table->status not ha_statistic_increment() can
hurt here, and as you have seen I've not updated any other engine at
all. I've only did it in MyISAM to check that the change works, the code
compiles, test results don't change, and so on.
But I'll remove table->status updates from MyISAM.
> The whole function can the be changed to:
>
> return mi_rnext(file,buf,active_index);
>
> Same goes for all other instances of setting table->status in this file
>
Regards,
Sergei
1
0

02 Jun '10
Hi,
I was looking today at some optimizer code, and bumped again
into sql_select.cc:find_best(). We have been using the greedy
optimizer for years, and this function has been dead code for
a while, isn't it time to remove it?
The less code, the better.
Timour
2
3
Hello Kristian,
Thursday, May 27, 2010, 1:20:59 PM, you wrote:
KN> [Cc:ed maria-developers@ for general interest, hope that's ok]
That's fine. Seems you bcc-ed though. ;)
>> Mine was 1.10. Downgrading to 1.9.6 did the trick, thanks.
KN> Ok, good, at some point we can get someone to help sort out what
KN> the problem might be with 1.10.
Or at least add some check that'd *clearly* complain about improper
version. Fighting with 1.10 was.. emotional.
>> 1. Use prebuilt searchd binary, sphinx.conf file and test index
KN> My idea is that mysql-test-run.pl will look for an already
KN> installed searchd and indexer binary (in eg. SPHINXSEARCH_INDEXER,
KN> SPHINXSEARCH_SEARCD, and maybe $PATH). If not found, sphinxse
That's also good. For some reason I though everything should be self
contained (ie. work immediately out of bzr clone).
KN> tests will be skipped, if found mysql-test-run.pl will generate a
KN> simple .conf and start the daemon for the test. There is already a
Hmm, why *generate* that? I'd just bundle .conf and source .xml data
for indexer. Maybe prebuilt .sp* indexes too. Indexes are binary but
test ones can be kept tiny.
--
Best regards,
Andrew mailto:shodan@shodan.ru
2
3

[Maria-developers] Updated (by Guest): Efficient group commit for binary log (116)
by worklog-noreply@askmonty.org 01 Jun '10
by worklog-noreply@askmonty.org 01 Jun '10
01 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Efficient group commit for binary log
CREATION DATE..: Mon, 26 Apr 2010, 13:28
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Serg
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 116 (http://askmonty.org/worklog/?tid=116)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 60
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Guest - Tue, 01 Jun 2010, 14:20)=-=-
Status updated.
--- /tmp/wklog.116.old.32652 2010-06-01 14:20:15.000000000 +0000
+++ /tmp/wklog.116.new.32652 2010-06-01 14:20:15.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Mon, 31 May 2010, 06:48)=-=-
Finish first architecture draft (changed my mind a number of times before I was satisfied).
Write up architecture in worklog.
Fix remaining test failures in proof-of-concept patch + implement xtradb part.
Run some benchmarks on proof-of-concept implementation.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Tue, 25 May 2010, 13:19)=-=-
Low Level Design modified.
--- /tmp/wklog.116.old.14255 2010-05-25 13:19:00.000000000 +0000
+++ /tmp/wklog.116.new.14255 2010-05-25 13:19:00.000000000 +0000
@@ -1 +1,363 @@
+1. Changes for ha_commit_trans()
+
+The gut of the code for commit is in the function ha_commit_trans() (and in
+commit_one_phase() which is called from it). This must be extended to use the
+new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
+
+1.1 Atomic queue of committing transactions
+
+To keep the right commit order among participants, we put transactions into a
+queue. The operations on the queue are non-locking:
+
+ - Insert THD at the head of the queue, and return old queue.
+
+ THD *enqueue_atomic(THD *thd)
+
+ - Fetch (and delete) the whole queue.
+
+ THD *atomic_grab_reverse_queue()
+
+These are simple to implement with atomic compare-and-set. Note that there is
+no ABA problem [2], as we do not delete individual elements from the queue, we
+grab the whole queue and replace it with NULL.
+
+A transaction enters the queue when it does prepare_ordered(). This way, the
+scheduling order for prepare_ordered() calls is what determines the sequence
+in the queue and effectively the commit order.
+
+The queue is grabbed by the code doing group_log_xid() and commit_ordered()
+calls. The queue is passed directly to group_log_xid(), and afterwards
+iterated to do individual commit_ordered() calls.
+
+Using a lock-free queue allows prepare_ordered() (for one transaction) to run
+in parallel with commit_ordered (in another transaction), increasing potential
+parallelism.
+
+The queue is simply a linked list of THD objects, linked through a
+THD::next_commit_ordered field. Since we add at the head of the queue, the
+list is actually in reverse order, so must be reversed when we grab and delete
+it.
+
+The reason that enqueue_atomic() returns the old queue is so that we can check
+if an insert goes to the head of the queue. The thread at the head of the
+queue will do the sequential part of group commit for everyone.
+
+
+1.2 Locks
+
+1.2.1 Global LOCK_prepare_ordered
+
+This lock is taken to serialise calls to prepare_ordered(). Note that
+effectively, the commit order is decided by the order in which threads obtain
+this lock.
+
+
+1.2.2 Global LOCK_group_commit and COND_group_commit
+
+This lock is used to protect the serial part of group commit. It is taken
+around the code where we grab the queue, call group_log_xid() on the queue,
+and call commit_ordered() on each element of the queue, to make sure they
+happen serialised and in consistent order. It also protects the variable
+group_commit_queue_busy, which is used when not using group_log_xid() to delay
+running over a new queue until the first queue is completely done.
+
+
+1.2.3 Global LOCK_commit_ordered
+
+This lock is taken around calls to commit_ordered(), to ensure they happen
+serialised.
+
+
+1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
+
+This lock protects the thd->group_commit_ready variable, as well as the
+condition variable used to wake up threads after log_xid() and
+commit_ordered() finishes.
+
+
+1.2.5 Global LOCK_group_commit_queue
+
+This is only used on platforms with no native compare-and-set operations, to
+make the queue operations atomic.
+
+
+1.3 Commit algorithm.
+
+This is the basic algorithm, simplified by
+
+ - omitting some error handling
+
+ - omitting looping over all handlers when invoking handler methods
+
+ - omitting some possible optimisations when not all calls needed (see next
+ section).
+
+ - Omitting the case where no group_log_xid() is used, see below.
+
+---- BEGIN ALGORITHM ----
+ ht->prepare()
+
+ // Call prepare_ordered() and enqueue in correct commit order
+ lock(LOCK_prepare_ordered)
+ ht->prepare_ordered()
+ old_queue= enqueue_atomic(thd)
+ thd->group_commit_ready= FALSE
+ is_group_commit_leader= (old_queue == NULL)
+ unlock(LOCK_prepare_ordered)
+
+ if (is_group_commit_leader)
+
+ // The first in queue handles group commit for everyone
+
+ lock(LOCK_group_commit)
+ // Wait while queue is busy, see below for when this occurs
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+
+ // Grab and reverse the queue to get correct order of transactions
+ queue= atomic_grab_reverse_queue()
+
+ // This call will set individual error codes in thd->xid_error
+ // It also sets the cookie for unlog() in thd->xid_cookie
+ group_log_xid(queue)
+
+ lock(LOCK_commit_ordered)
+ for (other IN queue)
+ if (!other->xid_error)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ unlock(LOCK_group_commit)
+
+ // Now we are done, so wake up all the others.
+ for (other IN TAIL(queue))
+ lock(other->LOCK_commit_ordered)
+ other->group_commit_ready= TRUE
+ cond_signal(other->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+ else
+ // If not the leader, just wait until leader did the work for us.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ // Finally do any error reporting now that we're back in own thread.
+ if (thd->xid_error)
+ xid_delayed_error(thd)
+ else
+ ht->commit(thd)
+ unlog(thd->xid_cookie, thd->xid)
+---- END ALGORITHM ----
+
+If the transaction coordinator does not support group_log_xid(), we have to do
+things differently. In this case after the serialisation point at
+prepare_ordered(), we have to parallelise again when running log_xid()
+(otherwise we would loose group commit). But then when log_xid() is done, we
+have to serialise again to check for any error and call commit_ordered() in
+correct sequence for any transaction where log_xid() did not return error.
+
+The central part of the algorithm in this case (when using log_xid()) is:
+
+---- BEGIN ALGORITHM ----
+ cookie= log_xid(thd)
+ error= (cookie == 0)
+
+ if (is_group_commit_leader)
+
+ // The first to enqueue grabs the queue and runs first.
+ // But we must wait until a previous queue run is fully done.
+
+ lock(LOCK_group_commit)
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+ queue= atomic_grab_reverse_queue()
+ // The queue will be busy until last thread in it is done.
+ group_commit_queue_busy= TRUE
+ unlock(LOCK_group_commit)
+ else
+ // Not first in queue -> wait for previous one to wake us up.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ if (!error) // Only if log_xid() was successful
+ lock(LOCK_commit_ordered)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ // Wake up the next thread, and release queue in last.
+ next= thd->next_commit_ordered
+
+ if (next)
+ lock(next->LOCK_commit_ordered)
+ next->group_commit_ready= TRUE
+ cond_signal(next->COND_commit_ordered)
+ unlock(next->LOCK_commit_ordered)
+ else
+ lock(LOCK_group_commit)
+ group_commit_queue_busy= FALSE
+ unlock(LOCK_group_commit)
+---- END ALGORITHM ----
+
+There are a number of locks taken in the algorithm, but in the group_log_xid()
+case most of them should be uncontended most of the time. The
+LOCK_group_commit of course will be contended, as new threads queue up waiting
+for the previous group commit (and binlog fsync()) to finish so they can do
+the next group commit. This is the whole point of implementing group commit.
+
+The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
+contended as long as handlers follow the intension of having the corresponding
+handler calls execute quickly.
+
+The per-thread LOCK_commit_ordered mutexes should not be contended; they are
+only used to wake up a sleeping thread.
+
+
+1.4 Optimisations when not using all three new calls
+
+
+The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
+optional, and if not implemented by a particular handler/transaction
+coordinator, we can optimise the algorithm to take advantage of not having to
+keep ordering for the missing parts.
+
+If there is no prepare_ordered(), then we need not take the
+LOCK_prepare_ordered mutex.
+
+If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
+mutex.
+
+If there is no group_log_xid(), then we only need the queue to ensure same
+ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
+if either of these (or both) are also not present, we do not need to use the
+queue at all.
+
+
+2. Binlog code changes (log.cc)
+
+
+The bulk of the work needed for the binary log is to extend the code to allow
+group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
+inside the binlog code for group commit.
+
+The existing code runs most of the write + fsync to the binary lock under the
+global LOCK_log mutex, preventing any group commit.
+
+To enable group commit, this code must be split into two parts:
+
+ - one part that runs per transaction, re-writing the embedded event positions
+ for the correct offset, and writing this into the in-memory log cache.
+
+ - another part that writes a set of transactions to the disk, and runs
+ fsync().
+
+Then in group_log_xid(), we can run the first part in a loop over all the
+transactions in the passed-in queue, and run the second part only once.
+
+The binlog code also has other code paths that write into the binlog,
+eg. non-transactional statements. These have to be adapted also to work with
+the new code.
+
+In order to get some group commit facility for these also, we change that part
+of the code in a similar way to ha_commit_trans. We keep another,
+binlog-internal queue of such non-transactional binlog writes, and such writes
+queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
+LOCK_log, it loops over the queue for the fast part, and does the slow part
+once, then finally wakes up the others in the queue.
+
+In the transactional case in group_log_xid(), before we run the passed-in
+queue, we add any members found in the binlog-internal queue. This allows
+these non-transactional writes to share the group commit.
+
+However, in the case where it is a non-transactional write that gets the
+LOCK_log, the transactional transactions from the ha_commit_trans() queue will
+not be able to take part (they will have to wait for their turn to do another
+fsync). It seems difficult to cleanly let the binlog code grab the queue from
+out of the ha_commit_trans() algorithm. I think the group commit is mostly
+useful in transactional workloads anyway (non-transactional engines will loose
+data anyway in case of crash, so why fsync() after each transaction?)
+
+
+3. XtraDB changes (ha_innodb.cc)
+
+The changes needed in XtraDB are comparatively simple, as XtraDB already
+implements group commit, it just needs to be enabled with the new
+commit_ordered() call.
+
+The existing commit() method already is logically in two parts. The first part
+runs under the prepare_commit_mutex() and must be run in same order as binlog
+commit. This part needs to be moved to commit_ordered(). The second part runs
+after releasing prepare_commit_mutex and does transaction log write+fsync; it
+can remain.
+
+Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
+XtraDB option to disable it).
+
+There are two asserts that check that the thread running the first part of
+XtraDB commit is the same as the thread running the other operations for the
+transaction. These have to be removed (as commit_ordered() can run in a
+different thread). Also an error reporting with sql_print_error() has to be
+delayed until commit() time.
+
+
+4. Proof-of-concept implementation
+
+There is a proof-of-concept implementation of this architecture, in the form
+of a quilt patch series [3].
+
+A quick benchmark was done, with sync_binlog=1 and
+innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
+transactions against one table.
+
+Without the patch, we get only 25 queries per second.
+
+With the patch, we get 650 queries per second.
+
+
+5. Open issues/tasks
+
+5.1 XA / other prepare() and commit() call sites.
+
+Check that user-level XA is handled correctly and working. And covered
+sufficiently with tests. Also check that any other calls of ha->prepare() and
+ha->commit() outside of ha_commit_trans() are handled correctly.
+
+5.2 Testing
+
+This worklog needs additions to the test suite, including error inserts to
+check error handling, and synchronisation points to check thread parallelism
+correctness.
+
+
+6. Alternative implementations
+
+ - The binlog code maintains its own extra atomic transaction queue to handle
+ non-transactional commits in a good way together with transactional (with
+ respect to group commit). Alternatively, we could ignore this issue and
+ just give up on group commit for non-transactional statements, for some
+ code simplifications.
+
+ - The binlog code has two ways to prepare end_event and similar, one that
+ uses stack-allocation, and another for when stack allocation is not
+ possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
+ so small that it would make sense to use the same code for both cases.
+
+ - Instead of adding extra fields to THD, we could allocate a separate
+ structure on the thd->mem_root() with the required extra fields (including
+ the THD pointer). Would seem to require initialising mutexes at every
+ commit though.
+
+ - It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
+ (should not be hard).
+
+
+-----------------------------------------------------------------------
+
+References:
+
+[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
+
+[3] https://knielsen-hq.org/maria/patches.mwl116/
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.116.old.14249 2010-05-25 13:18:34.000000000 +0000
+++ /tmp/wklog.116.new.14249 2010-05-25 13:18:34.000000000 +0000
@@ -1 +1,157 @@
+The basic idea in group commit is that multiple threads, each handling one
+transaction, prepare for commit and then queue up together waiting to do an
+fsync() on the transaction log. Then once the log is available, a single
+thread does the fsync() + other necessary book-keeping for all of the threads
+at once. After this, the single thread signals the other threads that it's
+done and they can finish up and return success (or failure) from the commit
+operation.
+
+So group commit has a parallel part, and a sequential part. So we need a
+facility for engines/binlog to participate in both the parallel and the
+sequential part.
+
+To do this, we add two new handlerton methods:
+
+ int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
+ void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
+
+The idea is that the existing prepare() and commit() methods run in the
+parallel part of group commit, and the new prepare_ordered() and
+commit_ordered() run in the sequential part.
+
+The prepare_ordered() method is called after prepare(). The order of
+tranctions that call into prepare_ordered() is guaranteed to be the same among
+all storage engines and binlog, and it is serialised so no two calls can be
+running inside the same engine at the same time.
+
+The commit_ordered() method is called before commit(), and similarly is
+guaranteed to have same transaction order in all participants, and to be
+serialised within one engine.
+
+As the prepare_ordered() and commit_ordered() calls are serialised, the idea
+is that handlers should do the minimum amount of work needed in these calls,
+relaying most of the work (eg. fsync() ...) to prepare() and commit().
+
+As a concrete example, for InnoDB the commit_ordered() method will do the
+first part of commit that fixed the commit order in the transaction log
+buffer, and the commit() method will write the log to disk and fsync()
+it. This split already exists inside the InnoDB code, running before
+respectively after releasing the prepare_commit_mutex.
+
+In addition, the XA transaction coordinator (TC_LOG) is special, since it is
+the one responsible for deciding whether to commit or rollback the
+transaction. For this we need an extra method, since this decision can be done
+only after we know that all prepare() and prepare_ordered() calls succeed, and
+must be done to know whether to call commit_ordered()/commit(), or do rollback.
+
+The existing method for this is TC_LOG::log_xid(). To make implementing group
+commit simpler to implement in a transaction coordinator and more efficient,
+we introduce a new method:
+
+ void group_log_xid(THD *first_thd);
+
+This method runs in the sequential part of group commit. It receives a list of
+transactions to perform log_xid() on, in the correct commit order. (Note that
+TC_LOG can do parallel parts of group commit in its own prepare() and commit()
+methods).
+
+This method can make it easier to implement the group commit in TC_LOG, as it
+gets directly the list of transactions in the right order. Without it, it
+might need to compute such order anyway in a prepare_ordered() method, and the
+server has to create this ordered list anyway to implement the order guarantee
+for prepare_ordered() and commit_ordered().
+
+This group_log_xid() method also is more efficient, as it avoids some
+inter-thread synchronisation. Since group_log_xid() is serialised, we can run
+it together with all the commit_ordered() method calls and need only a single
+sequential code section. With the log_xid() methods, we would need first a
+sequential part for the prepare_ordered() calls, then a parallel part with
+log_xid() calls (to not loose group commit ability for log_xid()), then again
+a sequential part for the commit_ordered() method calls.
+
+The extra synchronisation is needed, as each commit_ordered() call will have
+to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
+should not be called), and also wait for commit_ordered() to finish in all
+threads handling earlier commits. In effect we will need to bounce the
+execution from one thread to the other among all participants in the group
+commit.
+
+As a consequence of the group_log_xid() optimisation, handlers must be aware
+that the commit_ordered() call can happen in another thread than the one
+running commit() (so thread local storage is not available). This should not
+be a big issue as the THD is available for storing any needed information.
+
+Since group_log_xid() runs for multiple transactions in a single thread, it
+can not do error reporting (my_error()) as that relies on thread local
+storage. Instead it sets an error code in THD::xid_error, and if there is an
+error then later another method will be called (in correct thread context) to
+actually report the error:
+
+ int xid_delayed_error(THD *thd)
+
+The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
+are optional (as is xid_delayed_error). A storage engine or transaction
+coordinator is free to not implement them if they are not needed. In this case
+there will be no order guarantee for the corresponding stage of group commit
+for that engine. For example, InnoDB needs no ordering of the prepare phase,
+so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
+all, so does not need to implement any of them.
+
+Note in particular that all existing engines (/binlog implementations if they
+exist) will work unmodified (and also without any change in group commit
+facilities or commit order guaranteed).
+
+Using these new APIs, the work will be to
+
+ - In ha_commit_trans(), implement the correct semantics for the three new
+ calls.
+
+ - In XtraDB, use the new commit_ordered() call to remove the
+ prepare_commit_mutex (and resurrect group commit) without loosing the
+ consistency with binlog commit order.
+
+ - In log.cc (binlog module), implement group_log_xid() to do group commit of
+ multiple transactions to the binlog with a single shared fsync() call.
+
+-----------------------------------------------------------------------
+Some possible alternative for this worklog:
+
+ - We could eliminate the group_log_xid() method for a simpler API, at the
+ cost of extra synchronisation between threads to do in-order
+ commit_ordered() method calls. This would also allow to call
+ commit_ordered() in the correct thread context.
+
+ - Alternatively, we could eliminate log_xid() and require that all
+ transaction coordinators implement group_log_xid() instead, again for some
+ moderate simplification.
+
+ - At the moment there is no plugin actually using prepare_ordered(), so, it
+ could be removed from the design. But it fits in well, is efficient to
+ implement, and could be useful later (eg. for the requested feature of
+ releasing locks early in InnoDB).
+
+-----------------------------------------------------------------------
+Some possible follow-up projects after this is implemented:
+
+ - Add statistics about how efficient group commit is (#fsyncs/#commits in
+ each engine and binlog).
+
+ - Implement an XtraDB prepare_ordered() methods that can release row locks
+ early (Mark Callaghan from Facebook advocates this, but need to determine
+ exactly how to do this safely).
+
+ - Implement a new crash recovery algorithm that uses the consistent commit
+ ordering to need only fsync() for the binlog. At crash recovery, any
+ missing transactions in an engine is replayed from the correct point in the
+ binlog (this point must be stored transactionally inside the engine, as
+ XtraDB already does today).
+
+ - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
+ consistent snapshow, with same set of committed and not committed
+ transactions in all engines, 2) returns a corresponding consistent binlog
+ position. This should be easy by piggybacking on the synchronisation
+ implemented for ha_commit_trans().
+
+ - Use this in XtraBackup to get consistent binlog position without having to
+ block all updates with FLUSH TABLES WITH READ LOCK.
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High Level Description modified.
--- /tmp/wklog.116.old.14234 2010-05-25 13:18:07.000000000 +0000
+++ /tmp/wklog.116.new.14234 2010-05-25 13:18:07.000000000 +0000
@@ -21,3 +21,69 @@
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
+----
+
+Implementing group commit in MySQL faces some challenges from the handler
+plugin architecture:
+
+1. Because storage engine handlers have separate transaction log from the
+mysql binlog (and from each other), there are multiple fsync() calls per
+commit that need the group commit optimisation (2 per participating storage
+engine + 1 for binlog).
+
+2. The code handling commit is split in several places, in main server code
+and in storage engine code. With pluggable binlog it will be split even
+more. This requires a good abstract yet powerful API to be able to implement
+group commit simply and efficiently in plugins without the different parts
+having to rely on iternals of the others.
+
+3. We want the order of commits to be the same in all engines participating in
+multiple transactions. This requirement is the reason that InnoDB currently
+breaks group commit with the infamous prepare_commit_mutex.
+
+While currently there is no server guarantee to get same commit order in
+engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
+several reasons why this could be desirable:
+
+ - InnoDB hot backup needs to be able to extract a binlog position that is
+ consistent with the hot backup to be able to provision a new slave, and
+ this is impossible without imposing at least partial consistent ordering
+ between InnoDB and binlog.
+
+ - Other backup methods could have similar needs, eg. XtraBackup or
+ `mysqldump --single-transaction`, to have consistent commit order between
+ binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
+ or similar expensive blocking operation. (other backup methods, like LVM
+ snapshot, don't need consistent commit order, as they can restore
+ out-of-order commits during crash recovery using XA).
+
+ - If we have consistent commit order, we can think about optimising commit to
+ need only one fsync (for binlog); lost commits in storage engines can then
+ be recovered from the binlog at crash recovery by re-playing against the
+ engine from a particular point in the binlog.
+
+ - With consistent commit order, we can get better semantics for START
+ TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
+ could even get it to return also a matching binlog position). Currently,
+ this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
+ engines.
+
+ - In InnoDB, the performance in the presense of hotspots can be improved if
+ we can release row locks early in the commit phase, but this requires that we
+release them in
+ the same order as commits in the binlog to ensure consistency between
+ master and slaves.
+
+ - There was some discussions around Galera [1] synchroneous replication and
+ global transaction ID that it needed consistent commit order among
+ participating engines.
+
+ - I believe there could be other applications for guaranteed consistent
+ commit order, and that the architecture described in this worklog can
+ implement such guarantee with reasonable overhead.
+
+
+References:
+
+[1] Galera: http://www.codership.com/products/galera_replication
+
-=-=(Knielsen - Tue, 25 May 2010, 08:28)=-=-
More thoughts on and changes to the archtecture. Got to something now that I am satisfied with and
that seems to be able to handle all issues.
Implement new prepare_ordered and commit_ordered handler methods and the logic in ha_commit_trans().
Implement TC_LOG::group_log_xid() method and logic in ha_commit_trans().
Implement XtraDB part, using commit_ordered() rather than prepare_commit_mutex.
Fix test suite failures.
Proof-of-concept patch series complete now.
Do initial benchmark, getting good results. With 64 threads, see 26x improvement in queries-per-sec.
Next step: write up the architecture description.
Worked 21 hours and estimate 0 hours remain (original estimate increased by 21 hours).
-=-=(Knielsen - Wed, 12 May 2010, 06:41)=-=-
Started work on a Quilt patch series, refactoring the binlog code to prepare for implementing the
group commit, and working on the design of group commit in parallel.
Found and fixed several problems in error handling when writing to binlog.
Removed redundant table map version locking.
Split binlog writing into two parts in preparations for group commit. When ready to write to the
binlog, threads enter a queue, and the first thread in the queue handles the binlog writing for
everyone. When it obtains the LOCK_log, it first loops over all threads, executing the first part of
binlog writing (the write(2) syscall essentially). It then runs the second part (fsync(2)
essentially) only once, and then wakes up the remaining threads in the queue.
Still to be done:
Finish the proof-of-concept group commit patch, by 1) implementing the prepare_fast() and
commit_fast() callbacks in handler.cc 2) move the binlog thread enqueue from log_xid() to
binlog_prepare_fast(), 3) move fast part of InnoDB commit to innobase_commit_fast(), removing the
prepare_commit_mutex().
Write up the final design in this worklog.
Evaluate the design to see if we can do better/different.
Think about possible next steps, such as releasing innodb row locks early (in
innobase_prepare_fast), and doing crash recovery by replaying transactions from the binlog (removing
the need for engine durability and 2 of 3 fsync() in commit).
Worked 28 hours and estimate 0 hours remain (original estimate increased by 28 hours).
-=-=(Serg - Mon, 26 Apr 2010, 14:10)=-=-
Observers changed: Serg
DESCRIPTION:
Currently, in order to ensure that the server can recover after a crash to a
state in which storage engines and binary log are consistent with each other,
it is necessary to use XA with durable commits for both storage engines
(innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1).
This is _very_ expensive, since the server needs to do three fsync() operations
for every commit, as there is no working group commit when the binary log is
enabled.
The idea is to
- Implement/fix group commit to work properly with the binary log enabled.
- (Optionally) avoid the need to fsync() in the engine, and instead rely on
replaying any lost transactions from the binary log against the engine
during crash recovery.
For background see these articles:
http://kristiannielsen.livejournal.com/12254.html
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
----
Implementing group commit in MySQL faces some challenges from the handler
plugin architecture:
1. Because storage engine handlers have separate transaction log from the
mysql binlog (and from each other), there are multiple fsync() calls per
commit that need the group commit optimisation (2 per participating storage
engine + 1 for binlog).
2. The code handling commit is split in several places, in main server code
and in storage engine code. With pluggable binlog it will be split even
more. This requires a good abstract yet powerful API to be able to implement
group commit simply and efficiently in plugins without the different parts
having to rely on iternals of the others.
3. We want the order of commits to be the same in all engines participating in
multiple transactions. This requirement is the reason that InnoDB currently
breaks group commit with the infamous prepare_commit_mutex.
While currently there is no server guarantee to get same commit order in
engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
several reasons why this could be desirable:
- InnoDB hot backup needs to be able to extract a binlog position that is
consistent with the hot backup to be able to provision a new slave, and
this is impossible without imposing at least partial consistent ordering
between InnoDB and binlog.
- Other backup methods could have similar needs, eg. XtraBackup or
`mysqldump --single-transaction`, to have consistent commit order between
binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
or similar expensive blocking operation. (other backup methods, like LVM
snapshot, don't need consistent commit order, as they can restore
out-of-order commits during crash recovery using XA).
- If we have consistent commit order, we can think about optimising commit to
need only one fsync (for binlog); lost commits in storage engines can then
be recovered from the binlog at crash recovery by re-playing against the
engine from a particular point in the binlog.
- With consistent commit order, we can get better semantics for START
TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
could even get it to return also a matching binlog position). Currently,
this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
engines.
- In InnoDB, the performance in the presense of hotspots can be improved if
we can release row locks early in the commit phase, but this requires that we
release them in
the same order as commits in the binlog to ensure consistency between
master and slaves.
- There was some discussions around Galera [1] synchroneous replication and
global transaction ID that it needed consistent commit order among
participating engines.
- I believe there could be other applications for guaranteed consistent
commit order, and that the architecture described in this worklog can
implement such guarantee with reasonable overhead.
References:
[1] Galera: http://www.codership.com/products/galera_replication
HIGH-LEVEL SPECIFICATION:
The basic idea in group commit is that multiple threads, each handling one
transaction, prepare for commit and then queue up together waiting to do an
fsync() on the transaction log. Then once the log is available, a single
thread does the fsync() + other necessary book-keeping for all of the threads
at once. After this, the single thread signals the other threads that it's
done and they can finish up and return success (or failure) from the commit
operation.
So group commit has a parallel part, and a sequential part. So we need a
facility for engines/binlog to participate in both the parallel and the
sequential part.
To do this, we add two new handlerton methods:
int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
The idea is that the existing prepare() and commit() methods run in the
parallel part of group commit, and the new prepare_ordered() and
commit_ordered() run in the sequential part.
The prepare_ordered() method is called after prepare(). The order of
tranctions that call into prepare_ordered() is guaranteed to be the same among
all storage engines and binlog, and it is serialised so no two calls can be
running inside the same engine at the same time.
The commit_ordered() method is called before commit(), and similarly is
guaranteed to have same transaction order in all participants, and to be
serialised within one engine.
As the prepare_ordered() and commit_ordered() calls are serialised, the idea
is that handlers should do the minimum amount of work needed in these calls,
relaying most of the work (eg. fsync() ...) to prepare() and commit().
As a concrete example, for InnoDB the commit_ordered() method will do the
first part of commit that fixed the commit order in the transaction log
buffer, and the commit() method will write the log to disk and fsync()
it. This split already exists inside the InnoDB code, running before
respectively after releasing the prepare_commit_mutex.
In addition, the XA transaction coordinator (TC_LOG) is special, since it is
the one responsible for deciding whether to commit or rollback the
transaction. For this we need an extra method, since this decision can be done
only after we know that all prepare() and prepare_ordered() calls succeed, and
must be done to know whether to call commit_ordered()/commit(), or do rollback.
The existing method for this is TC_LOG::log_xid(). To make implementing group
commit simpler to implement in a transaction coordinator and more efficient,
we introduce a new method:
void group_log_xid(THD *first_thd);
This method runs in the sequential part of group commit. It receives a list of
transactions to perform log_xid() on, in the correct commit order. (Note that
TC_LOG can do parallel parts of group commit in its own prepare() and commit()
methods).
This method can make it easier to implement the group commit in TC_LOG, as it
gets directly the list of transactions in the right order. Without it, it
might need to compute such order anyway in a prepare_ordered() method, and the
server has to create this ordered list anyway to implement the order guarantee
for prepare_ordered() and commit_ordered().
This group_log_xid() method also is more efficient, as it avoids some
inter-thread synchronisation. Since group_log_xid() is serialised, we can run
it together with all the commit_ordered() method calls and need only a single
sequential code section. With the log_xid() methods, we would need first a
sequential part for the prepare_ordered() calls, then a parallel part with
log_xid() calls (to not loose group commit ability for log_xid()), then again
a sequential part for the commit_ordered() method calls.
The extra synchronisation is needed, as each commit_ordered() call will have
to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
should not be called), and also wait for commit_ordered() to finish in all
threads handling earlier commits. In effect we will need to bounce the
execution from one thread to the other among all participants in the group
commit.
As a consequence of the group_log_xid() optimisation, handlers must be aware
that the commit_ordered() call can happen in another thread than the one
running commit() (so thread local storage is not available). This should not
be a big issue as the THD is available for storing any needed information.
Since group_log_xid() runs for multiple transactions in a single thread, it
can not do error reporting (my_error()) as that relies on thread local
storage. Instead it sets an error code in THD::xid_error, and if there is an
error then later another method will be called (in correct thread context) to
actually report the error:
int xid_delayed_error(THD *thd)
The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
are optional (as is xid_delayed_error). A storage engine or transaction
coordinator is free to not implement them if they are not needed. In this case
there will be no order guarantee for the corresponding stage of group commit
for that engine. For example, InnoDB needs no ordering of the prepare phase,
so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
all, so does not need to implement any of them.
Note in particular that all existing engines (/binlog implementations if they
exist) will work unmodified (and also without any change in group commit
facilities or commit order guaranteed).
Using these new APIs, the work will be to
- In ha_commit_trans(), implement the correct semantics for the three new
calls.
- In XtraDB, use the new commit_ordered() call to remove the
prepare_commit_mutex (and resurrect group commit) without loosing the
consistency with binlog commit order.
- In log.cc (binlog module), implement group_log_xid() to do group commit of
multiple transactions to the binlog with a single shared fsync() call.
-----------------------------------------------------------------------
Some possible alternative for this worklog:
- We could eliminate the group_log_xid() method for a simpler API, at the
cost of extra synchronisation between threads to do in-order
commit_ordered() method calls. This would also allow to call
commit_ordered() in the correct thread context.
- Alternatively, we could eliminate log_xid() and require that all
transaction coordinators implement group_log_xid() instead, again for some
moderate simplification.
- At the moment there is no plugin actually using prepare_ordered(), so, it
could be removed from the design. But it fits in well, is efficient to
implement, and could be useful later (eg. for the requested feature of
releasing locks early in InnoDB).
-----------------------------------------------------------------------
Some possible follow-up projects after this is implemented:
- Add statistics about how efficient group commit is (#fsyncs/#commits in
each engine and binlog).
- Implement an XtraDB prepare_ordered() methods that can release row locks
early (Mark Callaghan from Facebook advocates this, but need to determine
exactly how to do this safely).
- Implement a new crash recovery algorithm that uses the consistent commit
ordering to need only fsync() for the binlog. At crash recovery, any
missing transactions in an engine is replayed from the correct point in the
binlog (this point must be stored transactionally inside the engine, as
XtraDB already does today).
- Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
consistent snapshow, with same set of committed and not committed
transactions in all engines, 2) returns a corresponding consistent binlog
position. This should be easy by piggybacking on the synchronisation
implemented for ha_commit_trans().
- Use this in XtraBackup to get consistent binlog position without having to
block all updates with FLUSH TABLES WITH READ LOCK.
LOW-LEVEL DESIGN:
1. Changes for ha_commit_trans()
The gut of the code for commit is in the function ha_commit_trans() (and in
commit_one_phase() which is called from it). This must be extended to use the
new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
1.1 Atomic queue of committing transactions
To keep the right commit order among participants, we put transactions into a
queue. The operations on the queue are non-locking:
- Insert THD at the head of the queue, and return old queue.
THD *enqueue_atomic(THD *thd)
- Fetch (and delete) the whole queue.
THD *atomic_grab_reverse_queue()
These are simple to implement with atomic compare-and-set. Note that there is
no ABA problem [2], as we do not delete individual elements from the queue, we
grab the whole queue and replace it with NULL.
A transaction enters the queue when it does prepare_ordered(). This way, the
scheduling order for prepare_ordered() calls is what determines the sequence
in the queue and effectively the commit order.
The queue is grabbed by the code doing group_log_xid() and commit_ordered()
calls. The queue is passed directly to group_log_xid(), and afterwards
iterated to do individual commit_ordered() calls.
Using a lock-free queue allows prepare_ordered() (for one transaction) to run
in parallel with commit_ordered (in another transaction), increasing potential
parallelism.
The queue is simply a linked list of THD objects, linked through a
THD::next_commit_ordered field. Since we add at the head of the queue, the
list is actually in reverse order, so must be reversed when we grab and delete
it.
The reason that enqueue_atomic() returns the old queue is so that we can check
if an insert goes to the head of the queue. The thread at the head of the
queue will do the sequential part of group commit for everyone.
1.2 Locks
1.2.1 Global LOCK_prepare_ordered
This lock is taken to serialise calls to prepare_ordered(). Note that
effectively, the commit order is decided by the order in which threads obtain
this lock.
1.2.2 Global LOCK_group_commit and COND_group_commit
This lock is used to protect the serial part of group commit. It is taken
around the code where we grab the queue, call group_log_xid() on the queue,
and call commit_ordered() on each element of the queue, to make sure they
happen serialised and in consistent order. It also protects the variable
group_commit_queue_busy, which is used when not using group_log_xid() to delay
running over a new queue until the first queue is completely done.
1.2.3 Global LOCK_commit_ordered
This lock is taken around calls to commit_ordered(), to ensure they happen
serialised.
1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
This lock protects the thd->group_commit_ready variable, as well as the
condition variable used to wake up threads after log_xid() and
commit_ordered() finishes.
1.2.5 Global LOCK_group_commit_queue
This is only used on platforms with no native compare-and-set operations, to
make the queue operations atomic.
1.3 Commit algorithm.
This is the basic algorithm, simplified by
- omitting some error handling
- omitting looping over all handlers when invoking handler methods
- omitting some possible optimisations when not all calls needed (see next
section).
- Omitting the case where no group_log_xid() is used, see below.
---- BEGIN ALGORITHM ----
ht->prepare()
// Call prepare_ordered() and enqueue in correct commit order
lock(LOCK_prepare_ordered)
ht->prepare_ordered()
old_queue= enqueue_atomic(thd)
thd->group_commit_ready= FALSE
is_group_commit_leader= (old_queue == NULL)
unlock(LOCK_prepare_ordered)
if (is_group_commit_leader)
// The first in queue handles group commit for everyone
lock(LOCK_group_commit)
// Wait while queue is busy, see below for when this occurs
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
// Grab and reverse the queue to get correct order of transactions
queue= atomic_grab_reverse_queue()
// This call will set individual error codes in thd->xid_error
// It also sets the cookie for unlog() in thd->xid_cookie
group_log_xid(queue)
lock(LOCK_commit_ordered)
for (other IN queue)
if (!other->xid_error)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
unlock(LOCK_group_commit)
// Now we are done, so wake up all the others.
for (other IN TAIL(queue))
lock(other->LOCK_commit_ordered)
other->group_commit_ready= TRUE
cond_signal(other->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
else
// If not the leader, just wait until leader did the work for us.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
// Finally do any error reporting now that we're back in own thread.
if (thd->xid_error)
xid_delayed_error(thd)
else
ht->commit(thd)
unlog(thd->xid_cookie, thd->xid)
---- END ALGORITHM ----
If the transaction coordinator does not support group_log_xid(), we have to do
things differently. In this case after the serialisation point at
prepare_ordered(), we have to parallelise again when running log_xid()
(otherwise we would loose group commit). But then when log_xid() is done, we
have to serialise again to check for any error and call commit_ordered() in
correct sequence for any transaction where log_xid() did not return error.
The central part of the algorithm in this case (when using log_xid()) is:
---- BEGIN ALGORITHM ----
cookie= log_xid(thd)
error= (cookie == 0)
if (is_group_commit_leader)
// The first to enqueue grabs the queue and runs first.
// But we must wait until a previous queue run is fully done.
lock(LOCK_group_commit)
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
queue= atomic_grab_reverse_queue()
// The queue will be busy until last thread in it is done.
group_commit_queue_busy= TRUE
unlock(LOCK_group_commit)
else
// Not first in queue -> wait for previous one to wake us up.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
if (!error) // Only if log_xid() was successful
lock(LOCK_commit_ordered)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
// Wake up the next thread, and release queue in last.
next= thd->next_commit_ordered
if (next)
lock(next->LOCK_commit_ordered)
next->group_commit_ready= TRUE
cond_signal(next->COND_commit_ordered)
unlock(next->LOCK_commit_ordered)
else
lock(LOCK_group_commit)
group_commit_queue_busy= FALSE
unlock(LOCK_group_commit)
---- END ALGORITHM ----
There are a number of locks taken in the algorithm, but in the group_log_xid()
case most of them should be uncontended most of the time. The
LOCK_group_commit of course will be contended, as new threads queue up waiting
for the previous group commit (and binlog fsync()) to finish so they can do
the next group commit. This is the whole point of implementing group commit.
The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
contended as long as handlers follow the intension of having the corresponding
handler calls execute quickly.
The per-thread LOCK_commit_ordered mutexes should not be contended; they are
only used to wake up a sleeping thread.
1.4 Optimisations when not using all three new calls
The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
optional, and if not implemented by a particular handler/transaction
coordinator, we can optimise the algorithm to take advantage of not having to
keep ordering for the missing parts.
If there is no prepare_ordered(), then we need not take the
LOCK_prepare_ordered mutex.
If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
mutex.
If there is no group_log_xid(), then we only need the queue to ensure same
ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
if either of these (or both) are also not present, we do not need to use the
queue at all.
2. Binlog code changes (log.cc)
The bulk of the work needed for the binary log is to extend the code to allow
group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
inside the binlog code for group commit.
The existing code runs most of the write + fsync to the binary lock under the
global LOCK_log mutex, preventing any group commit.
To enable group commit, this code must be split into two parts:
- one part that runs per transaction, re-writing the embedded event positions
for the correct offset, and writing this into the in-memory log cache.
- another part that writes a set of transactions to the disk, and runs
fsync().
Then in group_log_xid(), we can run the first part in a loop over all the
transactions in the passed-in queue, and run the second part only once.
The binlog code also has other code paths that write into the binlog,
eg. non-transactional statements. These have to be adapted also to work with
the new code.
In order to get some group commit facility for these also, we change that part
of the code in a similar way to ha_commit_trans. We keep another,
binlog-internal queue of such non-transactional binlog writes, and such writes
queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
LOCK_log, it loops over the queue for the fast part, and does the slow part
once, then finally wakes up the others in the queue.
In the transactional case in group_log_xid(), before we run the passed-in
queue, we add any members found in the binlog-internal queue. This allows
these non-transactional writes to share the group commit.
However, in the case where it is a non-transactional write that gets the
LOCK_log, the transactional transactions from the ha_commit_trans() queue will
not be able to take part (they will have to wait for their turn to do another
fsync). It seems difficult to cleanly let the binlog code grab the queue from
out of the ha_commit_trans() algorithm. I think the group commit is mostly
useful in transactional workloads anyway (non-transactional engines will loose
data anyway in case of crash, so why fsync() after each transaction?)
3. XtraDB changes (ha_innodb.cc)
The changes needed in XtraDB are comparatively simple, as XtraDB already
implements group commit, it just needs to be enabled with the new
commit_ordered() call.
The existing commit() method already is logically in two parts. The first part
runs under the prepare_commit_mutex() and must be run in same order as binlog
commit. This part needs to be moved to commit_ordered(). The second part runs
after releasing prepare_commit_mutex and does transaction log write+fsync; it
can remain.
Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
XtraDB option to disable it).
There are two asserts that check that the thread running the first part of
XtraDB commit is the same as the thread running the other operations for the
transaction. These have to be removed (as commit_ordered() can run in a
different thread). Also an error reporting with sql_print_error() has to be
delayed until commit() time.
4. Proof-of-concept implementation
There is a proof-of-concept implementation of this architecture, in the form
of a quilt patch series [3].
A quick benchmark was done, with sync_binlog=1 and
innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
transactions against one table.
Without the patch, we get only 25 queries per second.
With the patch, we get 650 queries per second.
5. Open issues/tasks
5.1 XA / other prepare() and commit() call sites.
Check that user-level XA is handled correctly and working. And covered
sufficiently with tests. Also check that any other calls of ha->prepare() and
ha->commit() outside of ha_commit_trans() are handled correctly.
5.2 Testing
This worklog needs additions to the test suite, including error inserts to
check error handling, and synchronisation points to check thread parallelism
correctness.
6. Alternative implementations
- The binlog code maintains its own extra atomic transaction queue to handle
non-transactional commits in a good way together with transactional (with
respect to group commit). Alternatively, we could ignore this issue and
just give up on group commit for non-transactional statements, for some
code simplifications.
- The binlog code has two ways to prepare end_event and similar, one that
uses stack-allocation, and another for when stack allocation is not
possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
so small that it would make sense to use the same code for both cases.
- Instead of adding extra fields to THD, we could allocate a separate
structure on the thd->mem_root() with the required extra fields (including
the THD pointer). Would seem to require initialising mutexes at every
commit though.
- It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
(should not be hard).
-----------------------------------------------------------------------
References:
[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
[3] https://knielsen-hq.org/maria/patches.mwl116/
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Guest): Efficient group commit for binary log (116)
by worklog-noreply@askmonty.org 01 Jun '10
by worklog-noreply@askmonty.org 01 Jun '10
01 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Efficient group commit for binary log
CREATION DATE..: Mon, 26 Apr 2010, 13:28
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Serg
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 116 (http://askmonty.org/worklog/?tid=116)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 60
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Guest - Tue, 01 Jun 2010, 14:20)=-=-
Status updated.
--- /tmp/wklog.116.old.32652 2010-06-01 14:20:15.000000000 +0000
+++ /tmp/wklog.116.new.32652 2010-06-01 14:20:15.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Mon, 31 May 2010, 06:48)=-=-
Finish first architecture draft (changed my mind a number of times before I was satisfied).
Write up architecture in worklog.
Fix remaining test failures in proof-of-concept patch + implement xtradb part.
Run some benchmarks on proof-of-concept implementation.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Tue, 25 May 2010, 13:19)=-=-
Low Level Design modified.
--- /tmp/wklog.116.old.14255 2010-05-25 13:19:00.000000000 +0000
+++ /tmp/wklog.116.new.14255 2010-05-25 13:19:00.000000000 +0000
@@ -1 +1,363 @@
+1. Changes for ha_commit_trans()
+
+The gut of the code for commit is in the function ha_commit_trans() (and in
+commit_one_phase() which is called from it). This must be extended to use the
+new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
+
+1.1 Atomic queue of committing transactions
+
+To keep the right commit order among participants, we put transactions into a
+queue. The operations on the queue are non-locking:
+
+ - Insert THD at the head of the queue, and return old queue.
+
+ THD *enqueue_atomic(THD *thd)
+
+ - Fetch (and delete) the whole queue.
+
+ THD *atomic_grab_reverse_queue()
+
+These are simple to implement with atomic compare-and-set. Note that there is
+no ABA problem [2], as we do not delete individual elements from the queue, we
+grab the whole queue and replace it with NULL.
+
+A transaction enters the queue when it does prepare_ordered(). This way, the
+scheduling order for prepare_ordered() calls is what determines the sequence
+in the queue and effectively the commit order.
+
+The queue is grabbed by the code doing group_log_xid() and commit_ordered()
+calls. The queue is passed directly to group_log_xid(), and afterwards
+iterated to do individual commit_ordered() calls.
+
+Using a lock-free queue allows prepare_ordered() (for one transaction) to run
+in parallel with commit_ordered (in another transaction), increasing potential
+parallelism.
+
+The queue is simply a linked list of THD objects, linked through a
+THD::next_commit_ordered field. Since we add at the head of the queue, the
+list is actually in reverse order, so must be reversed when we grab and delete
+it.
+
+The reason that enqueue_atomic() returns the old queue is so that we can check
+if an insert goes to the head of the queue. The thread at the head of the
+queue will do the sequential part of group commit for everyone.
+
+
+1.2 Locks
+
+1.2.1 Global LOCK_prepare_ordered
+
+This lock is taken to serialise calls to prepare_ordered(). Note that
+effectively, the commit order is decided by the order in which threads obtain
+this lock.
+
+
+1.2.2 Global LOCK_group_commit and COND_group_commit
+
+This lock is used to protect the serial part of group commit. It is taken
+around the code where we grab the queue, call group_log_xid() on the queue,
+and call commit_ordered() on each element of the queue, to make sure they
+happen serialised and in consistent order. It also protects the variable
+group_commit_queue_busy, which is used when not using group_log_xid() to delay
+running over a new queue until the first queue is completely done.
+
+
+1.2.3 Global LOCK_commit_ordered
+
+This lock is taken around calls to commit_ordered(), to ensure they happen
+serialised.
+
+
+1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
+
+This lock protects the thd->group_commit_ready variable, as well as the
+condition variable used to wake up threads after log_xid() and
+commit_ordered() finishes.
+
+
+1.2.5 Global LOCK_group_commit_queue
+
+This is only used on platforms with no native compare-and-set operations, to
+make the queue operations atomic.
+
+
+1.3 Commit algorithm.
+
+This is the basic algorithm, simplified by
+
+ - omitting some error handling
+
+ - omitting looping over all handlers when invoking handler methods
+
+ - omitting some possible optimisations when not all calls needed (see next
+ section).
+
+ - Omitting the case where no group_log_xid() is used, see below.
+
+---- BEGIN ALGORITHM ----
+ ht->prepare()
+
+ // Call prepare_ordered() and enqueue in correct commit order
+ lock(LOCK_prepare_ordered)
+ ht->prepare_ordered()
+ old_queue= enqueue_atomic(thd)
+ thd->group_commit_ready= FALSE
+ is_group_commit_leader= (old_queue == NULL)
+ unlock(LOCK_prepare_ordered)
+
+ if (is_group_commit_leader)
+
+ // The first in queue handles group commit for everyone
+
+ lock(LOCK_group_commit)
+ // Wait while queue is busy, see below for when this occurs
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+
+ // Grab and reverse the queue to get correct order of transactions
+ queue= atomic_grab_reverse_queue()
+
+ // This call will set individual error codes in thd->xid_error
+ // It also sets the cookie for unlog() in thd->xid_cookie
+ group_log_xid(queue)
+
+ lock(LOCK_commit_ordered)
+ for (other IN queue)
+ if (!other->xid_error)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ unlock(LOCK_group_commit)
+
+ // Now we are done, so wake up all the others.
+ for (other IN TAIL(queue))
+ lock(other->LOCK_commit_ordered)
+ other->group_commit_ready= TRUE
+ cond_signal(other->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+ else
+ // If not the leader, just wait until leader did the work for us.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ // Finally do any error reporting now that we're back in own thread.
+ if (thd->xid_error)
+ xid_delayed_error(thd)
+ else
+ ht->commit(thd)
+ unlog(thd->xid_cookie, thd->xid)
+---- END ALGORITHM ----
+
+If the transaction coordinator does not support group_log_xid(), we have to do
+things differently. In this case after the serialisation point at
+prepare_ordered(), we have to parallelise again when running log_xid()
+(otherwise we would loose group commit). But then when log_xid() is done, we
+have to serialise again to check for any error and call commit_ordered() in
+correct sequence for any transaction where log_xid() did not return error.
+
+The central part of the algorithm in this case (when using log_xid()) is:
+
+---- BEGIN ALGORITHM ----
+ cookie= log_xid(thd)
+ error= (cookie == 0)
+
+ if (is_group_commit_leader)
+
+ // The first to enqueue grabs the queue and runs first.
+ // But we must wait until a previous queue run is fully done.
+
+ lock(LOCK_group_commit)
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+ queue= atomic_grab_reverse_queue()
+ // The queue will be busy until last thread in it is done.
+ group_commit_queue_busy= TRUE
+ unlock(LOCK_group_commit)
+ else
+ // Not first in queue -> wait for previous one to wake us up.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ if (!error) // Only if log_xid() was successful
+ lock(LOCK_commit_ordered)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ // Wake up the next thread, and release queue in last.
+ next= thd->next_commit_ordered
+
+ if (next)
+ lock(next->LOCK_commit_ordered)
+ next->group_commit_ready= TRUE
+ cond_signal(next->COND_commit_ordered)
+ unlock(next->LOCK_commit_ordered)
+ else
+ lock(LOCK_group_commit)
+ group_commit_queue_busy= FALSE
+ unlock(LOCK_group_commit)
+---- END ALGORITHM ----
+
+There are a number of locks taken in the algorithm, but in the group_log_xid()
+case most of them should be uncontended most of the time. The
+LOCK_group_commit of course will be contended, as new threads queue up waiting
+for the previous group commit (and binlog fsync()) to finish so they can do
+the next group commit. This is the whole point of implementing group commit.
+
+The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
+contended as long as handlers follow the intension of having the corresponding
+handler calls execute quickly.
+
+The per-thread LOCK_commit_ordered mutexes should not be contended; they are
+only used to wake up a sleeping thread.
+
+
+1.4 Optimisations when not using all three new calls
+
+
+The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
+optional, and if not implemented by a particular handler/transaction
+coordinator, we can optimise the algorithm to take advantage of not having to
+keep ordering for the missing parts.
+
+If there is no prepare_ordered(), then we need not take the
+LOCK_prepare_ordered mutex.
+
+If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
+mutex.
+
+If there is no group_log_xid(), then we only need the queue to ensure same
+ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
+if either of these (or both) are also not present, we do not need to use the
+queue at all.
+
+
+2. Binlog code changes (log.cc)
+
+
+The bulk of the work needed for the binary log is to extend the code to allow
+group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
+inside the binlog code for group commit.
+
+The existing code runs most of the write + fsync to the binary lock under the
+global LOCK_log mutex, preventing any group commit.
+
+To enable group commit, this code must be split into two parts:
+
+ - one part that runs per transaction, re-writing the embedded event positions
+ for the correct offset, and writing this into the in-memory log cache.
+
+ - another part that writes a set of transactions to the disk, and runs
+ fsync().
+
+Then in group_log_xid(), we can run the first part in a loop over all the
+transactions in the passed-in queue, and run the second part only once.
+
+The binlog code also has other code paths that write into the binlog,
+eg. non-transactional statements. These have to be adapted also to work with
+the new code.
+
+In order to get some group commit facility for these also, we change that part
+of the code in a similar way to ha_commit_trans. We keep another,
+binlog-internal queue of such non-transactional binlog writes, and such writes
+queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
+LOCK_log, it loops over the queue for the fast part, and does the slow part
+once, then finally wakes up the others in the queue.
+
+In the transactional case in group_log_xid(), before we run the passed-in
+queue, we add any members found in the binlog-internal queue. This allows
+these non-transactional writes to share the group commit.
+
+However, in the case where it is a non-transactional write that gets the
+LOCK_log, the transactional transactions from the ha_commit_trans() queue will
+not be able to take part (they will have to wait for their turn to do another
+fsync). It seems difficult to cleanly let the binlog code grab the queue from
+out of the ha_commit_trans() algorithm. I think the group commit is mostly
+useful in transactional workloads anyway (non-transactional engines will loose
+data anyway in case of crash, so why fsync() after each transaction?)
+
+
+3. XtraDB changes (ha_innodb.cc)
+
+The changes needed in XtraDB are comparatively simple, as XtraDB already
+implements group commit, it just needs to be enabled with the new
+commit_ordered() call.
+
+The existing commit() method already is logically in two parts. The first part
+runs under the prepare_commit_mutex() and must be run in same order as binlog
+commit. This part needs to be moved to commit_ordered(). The second part runs
+after releasing prepare_commit_mutex and does transaction log write+fsync; it
+can remain.
+
+Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
+XtraDB option to disable it).
+
+There are two asserts that check that the thread running the first part of
+XtraDB commit is the same as the thread running the other operations for the
+transaction. These have to be removed (as commit_ordered() can run in a
+different thread). Also an error reporting with sql_print_error() has to be
+delayed until commit() time.
+
+
+4. Proof-of-concept implementation
+
+There is a proof-of-concept implementation of this architecture, in the form
+of a quilt patch series [3].
+
+A quick benchmark was done, with sync_binlog=1 and
+innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
+transactions against one table.
+
+Without the patch, we get only 25 queries per second.
+
+With the patch, we get 650 queries per second.
+
+
+5. Open issues/tasks
+
+5.1 XA / other prepare() and commit() call sites.
+
+Check that user-level XA is handled correctly and working. And covered
+sufficiently with tests. Also check that any other calls of ha->prepare() and
+ha->commit() outside of ha_commit_trans() are handled correctly.
+
+5.2 Testing
+
+This worklog needs additions to the test suite, including error inserts to
+check error handling, and synchronisation points to check thread parallelism
+correctness.
+
+
+6. Alternative implementations
+
+ - The binlog code maintains its own extra atomic transaction queue to handle
+ non-transactional commits in a good way together with transactional (with
+ respect to group commit). Alternatively, we could ignore this issue and
+ just give up on group commit for non-transactional statements, for some
+ code simplifications.
+
+ - The binlog code has two ways to prepare end_event and similar, one that
+ uses stack-allocation, and another for when stack allocation is not
+ possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
+ so small that it would make sense to use the same code for both cases.
+
+ - Instead of adding extra fields to THD, we could allocate a separate
+ structure on the thd->mem_root() with the required extra fields (including
+ the THD pointer). Would seem to require initialising mutexes at every
+ commit though.
+
+ - It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
+ (should not be hard).
+
+
+-----------------------------------------------------------------------
+
+References:
+
+[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
+
+[3] https://knielsen-hq.org/maria/patches.mwl116/
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.116.old.14249 2010-05-25 13:18:34.000000000 +0000
+++ /tmp/wklog.116.new.14249 2010-05-25 13:18:34.000000000 +0000
@@ -1 +1,157 @@
+The basic idea in group commit is that multiple threads, each handling one
+transaction, prepare for commit and then queue up together waiting to do an
+fsync() on the transaction log. Then once the log is available, a single
+thread does the fsync() + other necessary book-keeping for all of the threads
+at once. After this, the single thread signals the other threads that it's
+done and they can finish up and return success (or failure) from the commit
+operation.
+
+So group commit has a parallel part, and a sequential part. So we need a
+facility for engines/binlog to participate in both the parallel and the
+sequential part.
+
+To do this, we add two new handlerton methods:
+
+ int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
+ void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
+
+The idea is that the existing prepare() and commit() methods run in the
+parallel part of group commit, and the new prepare_ordered() and
+commit_ordered() run in the sequential part.
+
+The prepare_ordered() method is called after prepare(). The order of
+tranctions that call into prepare_ordered() is guaranteed to be the same among
+all storage engines and binlog, and it is serialised so no two calls can be
+running inside the same engine at the same time.
+
+The commit_ordered() method is called before commit(), and similarly is
+guaranteed to have same transaction order in all participants, and to be
+serialised within one engine.
+
+As the prepare_ordered() and commit_ordered() calls are serialised, the idea
+is that handlers should do the minimum amount of work needed in these calls,
+relaying most of the work (eg. fsync() ...) to prepare() and commit().
+
+As a concrete example, for InnoDB the commit_ordered() method will do the
+first part of commit that fixed the commit order in the transaction log
+buffer, and the commit() method will write the log to disk and fsync()
+it. This split already exists inside the InnoDB code, running before
+respectively after releasing the prepare_commit_mutex.
+
+In addition, the XA transaction coordinator (TC_LOG) is special, since it is
+the one responsible for deciding whether to commit or rollback the
+transaction. For this we need an extra method, since this decision can be done
+only after we know that all prepare() and prepare_ordered() calls succeed, and
+must be done to know whether to call commit_ordered()/commit(), or do rollback.
+
+The existing method for this is TC_LOG::log_xid(). To make implementing group
+commit simpler to implement in a transaction coordinator and more efficient,
+we introduce a new method:
+
+ void group_log_xid(THD *first_thd);
+
+This method runs in the sequential part of group commit. It receives a list of
+transactions to perform log_xid() on, in the correct commit order. (Note that
+TC_LOG can do parallel parts of group commit in its own prepare() and commit()
+methods).
+
+This method can make it easier to implement the group commit in TC_LOG, as it
+gets directly the list of transactions in the right order. Without it, it
+might need to compute such order anyway in a prepare_ordered() method, and the
+server has to create this ordered list anyway to implement the order guarantee
+for prepare_ordered() and commit_ordered().
+
+This group_log_xid() method also is more efficient, as it avoids some
+inter-thread synchronisation. Since group_log_xid() is serialised, we can run
+it together with all the commit_ordered() method calls and need only a single
+sequential code section. With the log_xid() methods, we would need first a
+sequential part for the prepare_ordered() calls, then a parallel part with
+log_xid() calls (to not loose group commit ability for log_xid()), then again
+a sequential part for the commit_ordered() method calls.
+
+The extra synchronisation is needed, as each commit_ordered() call will have
+to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
+should not be called), and also wait for commit_ordered() to finish in all
+threads handling earlier commits. In effect we will need to bounce the
+execution from one thread to the other among all participants in the group
+commit.
+
+As a consequence of the group_log_xid() optimisation, handlers must be aware
+that the commit_ordered() call can happen in another thread than the one
+running commit() (so thread local storage is not available). This should not
+be a big issue as the THD is available for storing any needed information.
+
+Since group_log_xid() runs for multiple transactions in a single thread, it
+can not do error reporting (my_error()) as that relies on thread local
+storage. Instead it sets an error code in THD::xid_error, and if there is an
+error then later another method will be called (in correct thread context) to
+actually report the error:
+
+ int xid_delayed_error(THD *thd)
+
+The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
+are optional (as is xid_delayed_error). A storage engine or transaction
+coordinator is free to not implement them if they are not needed. In this case
+there will be no order guarantee for the corresponding stage of group commit
+for that engine. For example, InnoDB needs no ordering of the prepare phase,
+so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
+all, so does not need to implement any of them.
+
+Note in particular that all existing engines (/binlog implementations if they
+exist) will work unmodified (and also without any change in group commit
+facilities or commit order guaranteed).
+
+Using these new APIs, the work will be to
+
+ - In ha_commit_trans(), implement the correct semantics for the three new
+ calls.
+
+ - In XtraDB, use the new commit_ordered() call to remove the
+ prepare_commit_mutex (and resurrect group commit) without loosing the
+ consistency with binlog commit order.
+
+ - In log.cc (binlog module), implement group_log_xid() to do group commit of
+ multiple transactions to the binlog with a single shared fsync() call.
+
+-----------------------------------------------------------------------
+Some possible alternative for this worklog:
+
+ - We could eliminate the group_log_xid() method for a simpler API, at the
+ cost of extra synchronisation between threads to do in-order
+ commit_ordered() method calls. This would also allow to call
+ commit_ordered() in the correct thread context.
+
+ - Alternatively, we could eliminate log_xid() and require that all
+ transaction coordinators implement group_log_xid() instead, again for some
+ moderate simplification.
+
+ - At the moment there is no plugin actually using prepare_ordered(), so, it
+ could be removed from the design. But it fits in well, is efficient to
+ implement, and could be useful later (eg. for the requested feature of
+ releasing locks early in InnoDB).
+
+-----------------------------------------------------------------------
+Some possible follow-up projects after this is implemented:
+
+ - Add statistics about how efficient group commit is (#fsyncs/#commits in
+ each engine and binlog).
+
+ - Implement an XtraDB prepare_ordered() methods that can release row locks
+ early (Mark Callaghan from Facebook advocates this, but need to determine
+ exactly how to do this safely).
+
+ - Implement a new crash recovery algorithm that uses the consistent commit
+ ordering to need only fsync() for the binlog. At crash recovery, any
+ missing transactions in an engine is replayed from the correct point in the
+ binlog (this point must be stored transactionally inside the engine, as
+ XtraDB already does today).
+
+ - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
+ consistent snapshow, with same set of committed and not committed
+ transactions in all engines, 2) returns a corresponding consistent binlog
+ position. This should be easy by piggybacking on the synchronisation
+ implemented for ha_commit_trans().
+
+ - Use this in XtraBackup to get consistent binlog position without having to
+ block all updates with FLUSH TABLES WITH READ LOCK.
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High Level Description modified.
--- /tmp/wklog.116.old.14234 2010-05-25 13:18:07.000000000 +0000
+++ /tmp/wklog.116.new.14234 2010-05-25 13:18:07.000000000 +0000
@@ -21,3 +21,69 @@
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
+----
+
+Implementing group commit in MySQL faces some challenges from the handler
+plugin architecture:
+
+1. Because storage engine handlers have separate transaction log from the
+mysql binlog (and from each other), there are multiple fsync() calls per
+commit that need the group commit optimisation (2 per participating storage
+engine + 1 for binlog).
+
+2. The code handling commit is split in several places, in main server code
+and in storage engine code. With pluggable binlog it will be split even
+more. This requires a good abstract yet powerful API to be able to implement
+group commit simply and efficiently in plugins without the different parts
+having to rely on iternals of the others.
+
+3. We want the order of commits to be the same in all engines participating in
+multiple transactions. This requirement is the reason that InnoDB currently
+breaks group commit with the infamous prepare_commit_mutex.
+
+While currently there is no server guarantee to get same commit order in
+engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
+several reasons why this could be desirable:
+
+ - InnoDB hot backup needs to be able to extract a binlog position that is
+ consistent with the hot backup to be able to provision a new slave, and
+ this is impossible without imposing at least partial consistent ordering
+ between InnoDB and binlog.
+
+ - Other backup methods could have similar needs, eg. XtraBackup or
+ `mysqldump --single-transaction`, to have consistent commit order between
+ binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
+ or similar expensive blocking operation. (other backup methods, like LVM
+ snapshot, don't need consistent commit order, as they can restore
+ out-of-order commits during crash recovery using XA).
+
+ - If we have consistent commit order, we can think about optimising commit to
+ need only one fsync (for binlog); lost commits in storage engines can then
+ be recovered from the binlog at crash recovery by re-playing against the
+ engine from a particular point in the binlog.
+
+ - With consistent commit order, we can get better semantics for START
+ TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
+ could even get it to return also a matching binlog position). Currently,
+ this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
+ engines.
+
+ - In InnoDB, the performance in the presense of hotspots can be improved if
+ we can release row locks early in the commit phase, but this requires that we
+release them in
+ the same order as commits in the binlog to ensure consistency between
+ master and slaves.
+
+ - There was some discussions around Galera [1] synchroneous replication and
+ global transaction ID that it needed consistent commit order among
+ participating engines.
+
+ - I believe there could be other applications for guaranteed consistent
+ commit order, and that the architecture described in this worklog can
+ implement such guarantee with reasonable overhead.
+
+
+References:
+
+[1] Galera: http://www.codership.com/products/galera_replication
+
-=-=(Knielsen - Tue, 25 May 2010, 08:28)=-=-
More thoughts on and changes to the archtecture. Got to something now that I am satisfied with and
that seems to be able to handle all issues.
Implement new prepare_ordered and commit_ordered handler methods and the logic in ha_commit_trans().
Implement TC_LOG::group_log_xid() method and logic in ha_commit_trans().
Implement XtraDB part, using commit_ordered() rather than prepare_commit_mutex.
Fix test suite failures.
Proof-of-concept patch series complete now.
Do initial benchmark, getting good results. With 64 threads, see 26x improvement in queries-per-sec.
Next step: write up the architecture description.
Worked 21 hours and estimate 0 hours remain (original estimate increased by 21 hours).
-=-=(Knielsen - Wed, 12 May 2010, 06:41)=-=-
Started work on a Quilt patch series, refactoring the binlog code to prepare for implementing the
group commit, and working on the design of group commit in parallel.
Found and fixed several problems in error handling when writing to binlog.
Removed redundant table map version locking.
Split binlog writing into two parts in preparations for group commit. When ready to write to the
binlog, threads enter a queue, and the first thread in the queue handles the binlog writing for
everyone. When it obtains the LOCK_log, it first loops over all threads, executing the first part of
binlog writing (the write(2) syscall essentially). It then runs the second part (fsync(2)
essentially) only once, and then wakes up the remaining threads in the queue.
Still to be done:
Finish the proof-of-concept group commit patch, by 1) implementing the prepare_fast() and
commit_fast() callbacks in handler.cc 2) move the binlog thread enqueue from log_xid() to
binlog_prepare_fast(), 3) move fast part of InnoDB commit to innobase_commit_fast(), removing the
prepare_commit_mutex().
Write up the final design in this worklog.
Evaluate the design to see if we can do better/different.
Think about possible next steps, such as releasing innodb row locks early (in
innobase_prepare_fast), and doing crash recovery by replaying transactions from the binlog (removing
the need for engine durability and 2 of 3 fsync() in commit).
Worked 28 hours and estimate 0 hours remain (original estimate increased by 28 hours).
-=-=(Serg - Mon, 26 Apr 2010, 14:10)=-=-
Observers changed: Serg
DESCRIPTION:
Currently, in order to ensure that the server can recover after a crash to a
state in which storage engines and binary log are consistent with each other,
it is necessary to use XA with durable commits for both storage engines
(innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1).
This is _very_ expensive, since the server needs to do three fsync() operations
for every commit, as there is no working group commit when the binary log is
enabled.
The idea is to
- Implement/fix group commit to work properly with the binary log enabled.
- (Optionally) avoid the need to fsync() in the engine, and instead rely on
replaying any lost transactions from the binary log against the engine
during crash recovery.
For background see these articles:
http://kristiannielsen.livejournal.com/12254.html
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
----
Implementing group commit in MySQL faces some challenges from the handler
plugin architecture:
1. Because storage engine handlers have separate transaction log from the
mysql binlog (and from each other), there are multiple fsync() calls per
commit that need the group commit optimisation (2 per participating storage
engine + 1 for binlog).
2. The code handling commit is split in several places, in main server code
and in storage engine code. With pluggable binlog it will be split even
more. This requires a good abstract yet powerful API to be able to implement
group commit simply and efficiently in plugins without the different parts
having to rely on iternals of the others.
3. We want the order of commits to be the same in all engines participating in
multiple transactions. This requirement is the reason that InnoDB currently
breaks group commit with the infamous prepare_commit_mutex.
While currently there is no server guarantee to get same commit order in
engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
several reasons why this could be desirable:
- InnoDB hot backup needs to be able to extract a binlog position that is
consistent with the hot backup to be able to provision a new slave, and
this is impossible without imposing at least partial consistent ordering
between InnoDB and binlog.
- Other backup methods could have similar needs, eg. XtraBackup or
`mysqldump --single-transaction`, to have consistent commit order between
binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
or similar expensive blocking operation. (other backup methods, like LVM
snapshot, don't need consistent commit order, as they can restore
out-of-order commits during crash recovery using XA).
- If we have consistent commit order, we can think about optimising commit to
need only one fsync (for binlog); lost commits in storage engines can then
be recovered from the binlog at crash recovery by re-playing against the
engine from a particular point in the binlog.
- With consistent commit order, we can get better semantics for START
TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
could even get it to return also a matching binlog position). Currently,
this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
engines.
- In InnoDB, the performance in the presense of hotspots can be improved if
we can release row locks early in the commit phase, but this requires that we
release them in
the same order as commits in the binlog to ensure consistency between
master and slaves.
- There was some discussions around Galera [1] synchroneous replication and
global transaction ID that it needed consistent commit order among
participating engines.
- I believe there could be other applications for guaranteed consistent
commit order, and that the architecture described in this worklog can
implement such guarantee with reasonable overhead.
References:
[1] Galera: http://www.codership.com/products/galera_replication
HIGH-LEVEL SPECIFICATION:
The basic idea in group commit is that multiple threads, each handling one
transaction, prepare for commit and then queue up together waiting to do an
fsync() on the transaction log. Then once the log is available, a single
thread does the fsync() + other necessary book-keeping for all of the threads
at once. After this, the single thread signals the other threads that it's
done and they can finish up and return success (or failure) from the commit
operation.
So group commit has a parallel part, and a sequential part. So we need a
facility for engines/binlog to participate in both the parallel and the
sequential part.
To do this, we add two new handlerton methods:
int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
The idea is that the existing prepare() and commit() methods run in the
parallel part of group commit, and the new prepare_ordered() and
commit_ordered() run in the sequential part.
The prepare_ordered() method is called after prepare(). The order of
tranctions that call into prepare_ordered() is guaranteed to be the same among
all storage engines and binlog, and it is serialised so no two calls can be
running inside the same engine at the same time.
The commit_ordered() method is called before commit(), and similarly is
guaranteed to have same transaction order in all participants, and to be
serialised within one engine.
As the prepare_ordered() and commit_ordered() calls are serialised, the idea
is that handlers should do the minimum amount of work needed in these calls,
relaying most of the work (eg. fsync() ...) to prepare() and commit().
As a concrete example, for InnoDB the commit_ordered() method will do the
first part of commit that fixed the commit order in the transaction log
buffer, and the commit() method will write the log to disk and fsync()
it. This split already exists inside the InnoDB code, running before
respectively after releasing the prepare_commit_mutex.
In addition, the XA transaction coordinator (TC_LOG) is special, since it is
the one responsible for deciding whether to commit or rollback the
transaction. For this we need an extra method, since this decision can be done
only after we know that all prepare() and prepare_ordered() calls succeed, and
must be done to know whether to call commit_ordered()/commit(), or do rollback.
The existing method for this is TC_LOG::log_xid(). To make implementing group
commit simpler to implement in a transaction coordinator and more efficient,
we introduce a new method:
void group_log_xid(THD *first_thd);
This method runs in the sequential part of group commit. It receives a list of
transactions to perform log_xid() on, in the correct commit order. (Note that
TC_LOG can do parallel parts of group commit in its own prepare() and commit()
methods).
This method can make it easier to implement the group commit in TC_LOG, as it
gets directly the list of transactions in the right order. Without it, it
might need to compute such order anyway in a prepare_ordered() method, and the
server has to create this ordered list anyway to implement the order guarantee
for prepare_ordered() and commit_ordered().
This group_log_xid() method also is more efficient, as it avoids some
inter-thread synchronisation. Since group_log_xid() is serialised, we can run
it together with all the commit_ordered() method calls and need only a single
sequential code section. With the log_xid() methods, we would need first a
sequential part for the prepare_ordered() calls, then a parallel part with
log_xid() calls (to not loose group commit ability for log_xid()), then again
a sequential part for the commit_ordered() method calls.
The extra synchronisation is needed, as each commit_ordered() call will have
to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
should not be called), and also wait for commit_ordered() to finish in all
threads handling earlier commits. In effect we will need to bounce the
execution from one thread to the other among all participants in the group
commit.
As a consequence of the group_log_xid() optimisation, handlers must be aware
that the commit_ordered() call can happen in another thread than the one
running commit() (so thread local storage is not available). This should not
be a big issue as the THD is available for storing any needed information.
Since group_log_xid() runs for multiple transactions in a single thread, it
can not do error reporting (my_error()) as that relies on thread local
storage. Instead it sets an error code in THD::xid_error, and if there is an
error then later another method will be called (in correct thread context) to
actually report the error:
int xid_delayed_error(THD *thd)
The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
are optional (as is xid_delayed_error). A storage engine or transaction
coordinator is free to not implement them if they are not needed. In this case
there will be no order guarantee for the corresponding stage of group commit
for that engine. For example, InnoDB needs no ordering of the prepare phase,
so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
all, so does not need to implement any of them.
Note in particular that all existing engines (/binlog implementations if they
exist) will work unmodified (and also without any change in group commit
facilities or commit order guaranteed).
Using these new APIs, the work will be to
- In ha_commit_trans(), implement the correct semantics for the three new
calls.
- In XtraDB, use the new commit_ordered() call to remove the
prepare_commit_mutex (and resurrect group commit) without loosing the
consistency with binlog commit order.
- In log.cc (binlog module), implement group_log_xid() to do group commit of
multiple transactions to the binlog with a single shared fsync() call.
-----------------------------------------------------------------------
Some possible alternative for this worklog:
- We could eliminate the group_log_xid() method for a simpler API, at the
cost of extra synchronisation between threads to do in-order
commit_ordered() method calls. This would also allow to call
commit_ordered() in the correct thread context.
- Alternatively, we could eliminate log_xid() and require that all
transaction coordinators implement group_log_xid() instead, again for some
moderate simplification.
- At the moment there is no plugin actually using prepare_ordered(), so, it
could be removed from the design. But it fits in well, is efficient to
implement, and could be useful later (eg. for the requested feature of
releasing locks early in InnoDB).
-----------------------------------------------------------------------
Some possible follow-up projects after this is implemented:
- Add statistics about how efficient group commit is (#fsyncs/#commits in
each engine and binlog).
- Implement an XtraDB prepare_ordered() methods that can release row locks
early (Mark Callaghan from Facebook advocates this, but need to determine
exactly how to do this safely).
- Implement a new crash recovery algorithm that uses the consistent commit
ordering to need only fsync() for the binlog. At crash recovery, any
missing transactions in an engine is replayed from the correct point in the
binlog (this point must be stored transactionally inside the engine, as
XtraDB already does today).
- Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
consistent snapshow, with same set of committed and not committed
transactions in all engines, 2) returns a corresponding consistent binlog
position. This should be easy by piggybacking on the synchronisation
implemented for ha_commit_trans().
- Use this in XtraBackup to get consistent binlog position without having to
block all updates with FLUSH TABLES WITH READ LOCK.
LOW-LEVEL DESIGN:
1. Changes for ha_commit_trans()
The gut of the code for commit is in the function ha_commit_trans() (and in
commit_one_phase() which is called from it). This must be extended to use the
new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
1.1 Atomic queue of committing transactions
To keep the right commit order among participants, we put transactions into a
queue. The operations on the queue are non-locking:
- Insert THD at the head of the queue, and return old queue.
THD *enqueue_atomic(THD *thd)
- Fetch (and delete) the whole queue.
THD *atomic_grab_reverse_queue()
These are simple to implement with atomic compare-and-set. Note that there is
no ABA problem [2], as we do not delete individual elements from the queue, we
grab the whole queue and replace it with NULL.
A transaction enters the queue when it does prepare_ordered(). This way, the
scheduling order for prepare_ordered() calls is what determines the sequence
in the queue and effectively the commit order.
The queue is grabbed by the code doing group_log_xid() and commit_ordered()
calls. The queue is passed directly to group_log_xid(), and afterwards
iterated to do individual commit_ordered() calls.
Using a lock-free queue allows prepare_ordered() (for one transaction) to run
in parallel with commit_ordered (in another transaction), increasing potential
parallelism.
The queue is simply a linked list of THD objects, linked through a
THD::next_commit_ordered field. Since we add at the head of the queue, the
list is actually in reverse order, so must be reversed when we grab and delete
it.
The reason that enqueue_atomic() returns the old queue is so that we can check
if an insert goes to the head of the queue. The thread at the head of the
queue will do the sequential part of group commit for everyone.
1.2 Locks
1.2.1 Global LOCK_prepare_ordered
This lock is taken to serialise calls to prepare_ordered(). Note that
effectively, the commit order is decided by the order in which threads obtain
this lock.
1.2.2 Global LOCK_group_commit and COND_group_commit
This lock is used to protect the serial part of group commit. It is taken
around the code where we grab the queue, call group_log_xid() on the queue,
and call commit_ordered() on each element of the queue, to make sure they
happen serialised and in consistent order. It also protects the variable
group_commit_queue_busy, which is used when not using group_log_xid() to delay
running over a new queue until the first queue is completely done.
1.2.3 Global LOCK_commit_ordered
This lock is taken around calls to commit_ordered(), to ensure they happen
serialised.
1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
This lock protects the thd->group_commit_ready variable, as well as the
condition variable used to wake up threads after log_xid() and
commit_ordered() finishes.
1.2.5 Global LOCK_group_commit_queue
This is only used on platforms with no native compare-and-set operations, to
make the queue operations atomic.
1.3 Commit algorithm.
This is the basic algorithm, simplified by
- omitting some error handling
- omitting looping over all handlers when invoking handler methods
- omitting some possible optimisations when not all calls needed (see next
section).
- Omitting the case where no group_log_xid() is used, see below.
---- BEGIN ALGORITHM ----
ht->prepare()
// Call prepare_ordered() and enqueue in correct commit order
lock(LOCK_prepare_ordered)
ht->prepare_ordered()
old_queue= enqueue_atomic(thd)
thd->group_commit_ready= FALSE
is_group_commit_leader= (old_queue == NULL)
unlock(LOCK_prepare_ordered)
if (is_group_commit_leader)
// The first in queue handles group commit for everyone
lock(LOCK_group_commit)
// Wait while queue is busy, see below for when this occurs
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
// Grab and reverse the queue to get correct order of transactions
queue= atomic_grab_reverse_queue()
// This call will set individual error codes in thd->xid_error
// It also sets the cookie for unlog() in thd->xid_cookie
group_log_xid(queue)
lock(LOCK_commit_ordered)
for (other IN queue)
if (!other->xid_error)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
unlock(LOCK_group_commit)
// Now we are done, so wake up all the others.
for (other IN TAIL(queue))
lock(other->LOCK_commit_ordered)
other->group_commit_ready= TRUE
cond_signal(other->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
else
// If not the leader, just wait until leader did the work for us.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
// Finally do any error reporting now that we're back in own thread.
if (thd->xid_error)
xid_delayed_error(thd)
else
ht->commit(thd)
unlog(thd->xid_cookie, thd->xid)
---- END ALGORITHM ----
If the transaction coordinator does not support group_log_xid(), we have to do
things differently. In this case after the serialisation point at
prepare_ordered(), we have to parallelise again when running log_xid()
(otherwise we would loose group commit). But then when log_xid() is done, we
have to serialise again to check for any error and call commit_ordered() in
correct sequence for any transaction where log_xid() did not return error.
The central part of the algorithm in this case (when using log_xid()) is:
---- BEGIN ALGORITHM ----
cookie= log_xid(thd)
error= (cookie == 0)
if (is_group_commit_leader)
// The first to enqueue grabs the queue and runs first.
// But we must wait until a previous queue run is fully done.
lock(LOCK_group_commit)
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
queue= atomic_grab_reverse_queue()
// The queue will be busy until last thread in it is done.
group_commit_queue_busy= TRUE
unlock(LOCK_group_commit)
else
// Not first in queue -> wait for previous one to wake us up.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
if (!error) // Only if log_xid() was successful
lock(LOCK_commit_ordered)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
// Wake up the next thread, and release queue in last.
next= thd->next_commit_ordered
if (next)
lock(next->LOCK_commit_ordered)
next->group_commit_ready= TRUE
cond_signal(next->COND_commit_ordered)
unlock(next->LOCK_commit_ordered)
else
lock(LOCK_group_commit)
group_commit_queue_busy= FALSE
unlock(LOCK_group_commit)
---- END ALGORITHM ----
There are a number of locks taken in the algorithm, but in the group_log_xid()
case most of them should be uncontended most of the time. The
LOCK_group_commit of course will be contended, as new threads queue up waiting
for the previous group commit (and binlog fsync()) to finish so they can do
the next group commit. This is the whole point of implementing group commit.
The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
contended as long as handlers follow the intension of having the corresponding
handler calls execute quickly.
The per-thread LOCK_commit_ordered mutexes should not be contended; they are
only used to wake up a sleeping thread.
1.4 Optimisations when not using all three new calls
The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
optional, and if not implemented by a particular handler/transaction
coordinator, we can optimise the algorithm to take advantage of not having to
keep ordering for the missing parts.
If there is no prepare_ordered(), then we need not take the
LOCK_prepare_ordered mutex.
If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
mutex.
If there is no group_log_xid(), then we only need the queue to ensure same
ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
if either of these (or both) are also not present, we do not need to use the
queue at all.
2. Binlog code changes (log.cc)
The bulk of the work needed for the binary log is to extend the code to allow
group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
inside the binlog code for group commit.
The existing code runs most of the write + fsync to the binary lock under the
global LOCK_log mutex, preventing any group commit.
To enable group commit, this code must be split into two parts:
- one part that runs per transaction, re-writing the embedded event positions
for the correct offset, and writing this into the in-memory log cache.
- another part that writes a set of transactions to the disk, and runs
fsync().
Then in group_log_xid(), we can run the first part in a loop over all the
transactions in the passed-in queue, and run the second part only once.
The binlog code also has other code paths that write into the binlog,
eg. non-transactional statements. These have to be adapted also to work with
the new code.
In order to get some group commit facility for these also, we change that part
of the code in a similar way to ha_commit_trans. We keep another,
binlog-internal queue of such non-transactional binlog writes, and such writes
queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
LOCK_log, it loops over the queue for the fast part, and does the slow part
once, then finally wakes up the others in the queue.
In the transactional case in group_log_xid(), before we run the passed-in
queue, we add any members found in the binlog-internal queue. This allows
these non-transactional writes to share the group commit.
However, in the case where it is a non-transactional write that gets the
LOCK_log, the transactional transactions from the ha_commit_trans() queue will
not be able to take part (they will have to wait for their turn to do another
fsync). It seems difficult to cleanly let the binlog code grab the queue from
out of the ha_commit_trans() algorithm. I think the group commit is mostly
useful in transactional workloads anyway (non-transactional engines will loose
data anyway in case of crash, so why fsync() after each transaction?)
3. XtraDB changes (ha_innodb.cc)
The changes needed in XtraDB are comparatively simple, as XtraDB already
implements group commit, it just needs to be enabled with the new
commit_ordered() call.
The existing commit() method already is logically in two parts. The first part
runs under the prepare_commit_mutex() and must be run in same order as binlog
commit. This part needs to be moved to commit_ordered(). The second part runs
after releasing prepare_commit_mutex and does transaction log write+fsync; it
can remain.
Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
XtraDB option to disable it).
There are two asserts that check that the thread running the first part of
XtraDB commit is the same as the thread running the other operations for the
transaction. These have to be removed (as commit_ordered() can run in a
different thread). Also an error reporting with sql_print_error() has to be
delayed until commit() time.
4. Proof-of-concept implementation
There is a proof-of-concept implementation of this architecture, in the form
of a quilt patch series [3].
A quick benchmark was done, with sync_binlog=1 and
innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
transactions against one table.
Without the patch, we get only 25 queries per second.
With the patch, we get 650 queries per second.
5. Open issues/tasks
5.1 XA / other prepare() and commit() call sites.
Check that user-level XA is handled correctly and working. And covered
sufficiently with tests. Also check that any other calls of ha->prepare() and
ha->commit() outside of ha_commit_trans() are handled correctly.
5.2 Testing
This worklog needs additions to the test suite, including error inserts to
check error handling, and synchronisation points to check thread parallelism
correctness.
6. Alternative implementations
- The binlog code maintains its own extra atomic transaction queue to handle
non-transactional commits in a good way together with transactional (with
respect to group commit). Alternatively, we could ignore this issue and
just give up on group commit for non-transactional statements, for some
code simplifications.
- The binlog code has two ways to prepare end_event and similar, one that
uses stack-allocation, and another for when stack allocation is not
possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
so small that it would make sense to use the same code for both cases.
- Instead of adding extra fields to THD, we could allocate a separate
structure on the thd->mem_root() with the required extra fields (including
the THD pointer). Would seem to require initialising mutexes at every
commit though.
- It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
(should not be hard).
-----------------------------------------------------------------------
References:
[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
[3] https://knielsen-hq.org/maria/patches.mwl116/
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] bzr commit into file:///home/tsk/mprog/src/5.3-mwl89/ branch (timour:2793)
by timour@askmonty.org 01 Jun '10
by timour@askmonty.org 01 Jun '10
01 Jun '10
#At file:///home/tsk/mprog/src/5.3-mwl89/ based on revid:timour@askmonty.org-20100527131347-unr62oupctbp912x
2793 timour(a)askmonty.org 2010-06-01
MWL#89: Cost-based choice between Materialization and IN->EXISTS transformation
Phase 2: Changed the code-generation for subquery materialization to be
performed in runtime memory for each (re)execution, instead of in
statement memory (once per prepared statement).
- Item_in_subselect::setup_engine() no longer wraps materialization related
objects to be created in statement memory.
- Merged subselect_hash_sj_engine::init_permanent and
subselect_hash_sj_engine::init_runtime into subselect_hash_sj_engine::init,
which is called for each (re)execution.
- Fixed deletion of the temp table accordingly.
modified:
sql/item_subselect.cc
sql/item_subselect.h
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-05-27 13:13:47 +0000
+++ b/sql/item_subselect.cc 2010-06-01 11:57:35 +0000
@@ -148,6 +148,7 @@ void Item_in_subselect::cleanup()
Item_subselect::~Item_subselect()
{
delete engine;
+ engine= NULL;
}
Item_subselect::trans_res
@@ -2090,82 +2091,62 @@ void Item_in_subselect::update_used_tabl
bool Item_in_subselect::setup_engine()
{
- subselect_hash_sj_engine *new_engine= NULL;
- bool res= FALSE;
+ subselect_hash_sj_engine *mat_engine= NULL;
+ subselect_single_select_engine *select_engine;
DBUG_ENTER("Item_in_subselect::setup_engine");
+ /*
+ The select (IN=>EXISTS) engine is pre-created already at parse time, and
+ is stored in statment memory (preserved across PS executions).
+ */
+ DBUG_ASSERT(engine->engine_type() == subselect_engine::SINGLE_SELECT_ENGINE);
+ select_engine= (subselect_single_select_engine*) engine;
- if (engine->engine_type() == subselect_engine::SINGLE_SELECT_ENGINE)
- {
- /* Create/initialize objects in permanent memory. */
- subselect_single_select_engine *old_engine;
- Query_arena *arena= thd->stmt_arena, backup;
-
- old_engine= (subselect_single_select_engine*) engine;
-
- if (arena->is_conventional())
- arena= 0;
- else
- thd->set_n_backup_active_arena(arena, &backup);
-
- if (!(new_engine= new subselect_hash_sj_engine(thd, this,
- old_engine)) ||
- new_engine->init_permanent(&old_engine->join->fields_list))
- {
- Item_subselect::trans_res trans_res;
- /*
- If for some reason we cannot use materialization for this IN predicate,
- delete all materialization-related objects, and apply the IN=>EXISTS
- transformation.
- */
- delete new_engine;
- new_engine= NULL;
- exec_method= NOT_TRANSFORMED;
- if (left_expr->cols() == 1)
- trans_res= single_value_in_to_exists_transformer(old_engine->join,
- &eq_creator);
- else
- trans_res= row_value_in_to_exists_transformer(old_engine->join);
- /*
- The IN=>EXISTS transformation above injects new predicates into the
- WHERE and HAVING clauses. Since the subquery was already optimized,
- below we force its reoptimization with the new injected conditions
- by the first call to subselect_single_select_engine::exec().
- This is the only case of lazy subquery optimization in the server.
- */
- DBUG_ASSERT(old_engine->join->optimized);
- old_engine->join->optimized= false;
- res= (trans_res != Item_subselect::RES_OK);
- }
- if (new_engine)
- engine= new_engine;
-
- if (arena)
- thd->restore_active_arena(arena, &backup);
- }
- else
- {
- DBUG_ASSERT(engine->engine_type() == subselect_engine::HASH_SJ_ENGINE);
- new_engine= (subselect_hash_sj_engine*) engine;
- }
+ /* Create/initialize execution objects. */
+ if (!(mat_engine= new subselect_hash_sj_engine(thd, this, select_engine)))
+ DBUG_RETURN(TRUE);
- /* Initilizations done in runtime memory, repeated for each execution. */
- if (new_engine)
+ if (mat_engine->init(&select_engine->join->fields_list))
{
+ Item_subselect::trans_res trans_res;
+ /*
+ If for some reason we cannot use materialization for this IN predicate,
+ delete all materialization-related objects, and apply the IN=>EXISTS
+ transformation.
+ */
+ delete mat_engine;
+ mat_engine= NULL;
+ exec_method= NOT_TRANSFORMED;
+
+ if (left_expr->cols() == 1)
+ trans_res= single_value_in_to_exists_transformer(select_engine->join,
+ &eq_creator);
+ else
+ trans_res= row_value_in_to_exists_transformer(select_engine->join);
/*
- Reset the LIMIT 1 set in Item_exists_subselect::fix_length_and_dec.
- TODO:
- Currently we set the subquery LIMIT to infinity, and this is correct
- because we forbid at parse time LIMIT inside IN subqueries (see
- Item_in_subselect::test_limit). However, once we allow this, here
- we should set the correct limit if given in the query.
+ The IN=>EXISTS transformation above injects new predicates into the
+ WHERE and HAVING clauses. Since the subquery was already optimized,
+ below we force its reoptimization with the new injected conditions
+ by the first call to subselect_single_select_engine::exec().
+ This is the only case of lazy subquery optimization in the server.
*/
- unit->global_parameters->select_limit= NULL;
- if ((res= new_engine->init_runtime()))
- DBUG_RETURN(res);
+ DBUG_ASSERT(select_engine->join->optimized);
+ select_engine->join->optimized= false;
+ DBUG_RETURN(trans_res != Item_subselect::RES_OK);
}
- DBUG_RETURN(res);
+ /*
+ Reset the "LIMIT 1" set in Item_exists_subselect::fix_length_and_dec.
+ TODO:
+ Currently we set the subquery LIMIT to infinity, and this is correct
+ because we forbid at parse time LIMIT inside IN subqueries (see
+ Item_in_subselect::test_limit). However, once we allow this, here
+ we should set the correct limit if given in the query.
+ */
+ unit->global_parameters->select_limit= NULL;
+
+ engine= mat_engine;
+ DBUG_RETURN(FALSE);
}
@@ -3680,14 +3661,14 @@ bitmap_init_memroot(MY_BITMAP *map, uint
@retval FALSE otherwise
*/
-bool subselect_hash_sj_engine::init_permanent(List<Item> *tmp_columns)
+bool subselect_hash_sj_engine::init(List<Item> *tmp_columns)
{
select_union *result_sink;
/* Options to create_tmp_table. */
ulonglong tmp_create_options= thd->options | TMP_TABLE_ALL_COLUMNS;
/* | TMP_TABLE_FORCE_MYISAM; TIMOUR: force MYISAM */
- DBUG_ENTER("subselect_hash_sj_engine::init_permanent");
+ DBUG_ENTER("subselect_hash_sj_engine::init");
if (bitmap_init_memroot(&non_null_key_parts, tmp_columns->elements,
thd->mem_root) ||
@@ -3762,6 +3743,17 @@ bool subselect_hash_sj_engine::init_perm
!(lookup_engine= make_unique_engine()))
DBUG_RETURN(TRUE);
+ /*
+ Repeat name resolution for 'cond' since cond is not part of any
+ clause of the query, and it is not 'fixed' during JOIN::prepare.
+ */
+ if (semi_join_conds && !semi_join_conds->fixed &&
+ semi_join_conds->fix_fields(thd, (Item**)&semi_join_conds))
+ DBUG_RETURN(TRUE);
+ /* Let our engine reuse this query plan for materialization. */
+ materialize_join= materialize_engine->join;
+ materialize_join->change_result(result);
+
DBUG_RETURN(FALSE);
}
@@ -3907,30 +3899,6 @@ subselect_hash_sj_engine::make_unique_en
}
-/**
- Initialize members of the engine that need to be re-initilized at each
- execution.
-
- @retval TRUE if a memory allocation error occurred
- @retval FALSE if success
-*/
-
-bool subselect_hash_sj_engine::init_runtime()
-{
- /*
- Repeat name resolution for 'cond' since cond is not part of any
- clause of the query, and it is not 'fixed' during JOIN::prepare.
- */
- if (semi_join_conds && !semi_join_conds->fixed &&
- semi_join_conds->fix_fields(thd, (Item**)&semi_join_conds))
- return TRUE;
- /* Let our engine reuse this query plan for materialization. */
- materialize_join= materialize_engine->join;
- materialize_join->change_result(result);
- return FALSE;
-}
-
-
subselect_hash_sj_engine::~subselect_hash_sj_engine()
{
delete lookup_engine;
@@ -3967,6 +3935,13 @@ void subselect_hash_sj_engine::cleanup()
count_null_only_columns= 0;
strategy= UNDEFINED;
materialize_engine->cleanup();
+ /*
+ Restore the original Item_in_subselect engine. This engine is created once
+ at parse time and stored across executions, while all other materialization
+ related engines are created and chosen for each execution.
+ */
+ ((Item_in_subselect *) item)->engine= materialize_engine;
+
if (lookup_engine_type == TABLE_SCAN_ENGINE ||
lookup_engine_type == ROWID_MERGE_ENGINE)
{
@@ -3983,6 +3958,9 @@ void subselect_hash_sj_engine::cleanup()
DBUG_ASSERT(lookup_engine->engine_type() == UNIQUESUBQUERY_ENGINE);
lookup_engine->cleanup();
result->cleanup(); /* Resets the temp table as well. */
+ DBUG_ASSERT(tmp_table);
+ free_tmp_table(thd, tmp_table);
+ tmp_table= NULL;
}
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-05-27 13:13:47 +0000
+++ b/sql/item_subselect.h 2010-06-01 11:57:35 +0000
@@ -802,8 +802,7 @@ public:
}
~subselect_hash_sj_engine();
- bool init_permanent(List<Item> *tmp_columns);
- bool init_runtime();
+ bool init(List<Item> *tmp_columns);
void cleanup();
int prepare();
int exec();
1
0

[Maria-developers] Rev 2789: Subquery cache for pre-review (MWL#66) in file:///home/bell/maria/bzr/work-maria-5.3-scache2/
by sanja@askmonty.org 31 May '10
by sanja@askmonty.org 31 May '10
31 May '10
At file:///home/bell/maria/bzr/work-maria-5.3-scache2/
------------------------------------------------------------
revno: 2789
revision-id: sanja(a)askmonty.org-20100531212554-oal32d5v360l6cul
parent: sergii(a)pisem.net-20100510134608-oyi2vznyghgcrt0x
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-scache2
timestamp: Tue 2010-06-01 00:25:54 +0300
message:
Subquery cache for pre-review (MWL#66)
=== modified file 'libmysqld/Makefile.am'
--- a/libmysqld/Makefile.am 2010-03-20 12:01:47 +0000
+++ b/libmysqld/Makefile.am 2010-05-31 21:25:54 +0000
@@ -80,7 +80,8 @@
sql_tablespace.cc \
rpl_injector.cc my_user.c partition_info.cc \
sql_servers.cc event_parse_data.cc opt_table_elimination.cc \
- multi_range_read.cc opt_index_cond_pushdown.cc
+ multi_range_read.cc opt_index_cond_pushdown.cc \
+ sql_subquery_cache.cc
libmysqld_int_a_SOURCES= $(libmysqld_sources)
nodist_libmysqld_int_a_SOURCES= $(libmysqlsources) $(sqlsources)
=== modified file 'mysql-test/r/index_merge_myisam.result'
--- a/mysql-test/r/index_merge_myisam.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/index_merge_myisam.result 2010-05-31 21:25:54 +0000
@@ -1419,19 +1419,19 @@
#
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='index_merge=off,index_merge_union=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='index_merge_union=on';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=off,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=off,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,index_merge_sort_union=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=off,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=off,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=4;
ERROR 42000: Variable 'optimizer_switch' can't be set to the value of '4'
set optimizer_switch=NULL;
@@ -1458,21 +1458,21 @@
set optimizer_switch='index_merge=off,index_merge_union=off,default';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=default;
select @@global.optimizer_switch;
@@global.optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set @@global.optimizer_switch=default;
select @@global.optimizer_switch;
@@global.optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
#
# Check index_merge's @@optimizer_switch flags
#
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int, c int, filler char(100),
@@ -1582,5 +1582,5 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
drop table t0, t1;
=== modified file 'mysql-test/r/myisam_mrr.result'
--- a/mysql-test/r/myisam_mrr.result 2010-03-11 21:43:31 +0000
+++ b/mysql-test/r/myisam_mrr.result 2010-05-31 21:25:54 +0000
@@ -394,7 +394,7 @@
# - engine_condition_pushdown does not affect ICP
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int, key(a));
=== added file 'mysql-test/r/subquery_cache.result'
--- a/mysql-test/r/subquery_cache.result 1970-01-01 00:00:00 +0000
+++ b/mysql-test/r/subquery_cache.result 2010-05-31 21:25:54 +0000
@@ -0,0 +1,591 @@
+set optimizer_switch='subquery_cache=on';
+flush status;
+create table t1 (a int, b int);
+insert into t1 values (1,2),(3,4),(1,2),(3,4),(3,4),(4,5),(4,5),(5,6),(5,6),(4,5);
+create table t2 (c int, d int);
+insert into t2 values (2,3),(3,4),(5,6);
+#single value subquery test
+select a, (select d from t2 where b=c) + 1 from t1;
+a (select d from t2 where b=c) + 1
+1 4
+3 NULL
+1 4
+3 NULL
+3 NULL
+4 7
+4 7
+5 NULL
+5 NULL
+4 7
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 6
+Subquery_cache_miss 4
+#single value subquery test (PS)
+prepare stmt1 from 'select a, (select d from t2 where b=c) + 1 from t1';
+execute stmt1;
+a (select d from t2 where b=c) + 1
+1 4
+3 NULL
+1 4
+3 NULL
+3 NULL
+4 7
+4 7
+5 NULL
+5 NULL
+4 7
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 12
+Subquery_cache_miss 8
+execute stmt1;
+a (select d from t2 where b=c) + 1
+1 4
+3 NULL
+1 4
+3 NULL
+3 NULL
+4 7
+4 7
+5 NULL
+5 NULL
+4 7
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 18
+Subquery_cache_miss 12
+deallocate prepare stmt1;
+#single value subquery test (SP)
+CREATE PROCEDURE p1() select a, (select d from t2 where b=c) + 1 from t1;
+call p1;
+a (select d from t2 where b=c) + 1
+1 4
+3 NULL
+1 4
+3 NULL
+3 NULL
+4 7
+4 7
+5 NULL
+5 NULL
+4 7
+call p1;
+a (select d from t2 where b=c) + 1
+1 4
+3 NULL
+1 4
+3 NULL
+3 NULL
+4 7
+4 7
+5 NULL
+5 NULL
+4 7
+drop procedure p1;
+#IN subquery test
+flush status;
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 0
+Subquery_cache_miss 0
+select a, b , b in (select d from t2) as SUBS from t1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 6
+Subquery_cache_miss 4
+insert into t1 values (7,8),(9,NULL);
+select a, b , b in (select d from t2) as SUBS from t1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+7 8 0
+9 NULL NULL
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 12
+Subquery_cache_miss 10
+insert into t2 values (8,NULL);
+select a, b , b in (select d from t2) as SUBS from t1;
+a b SUBS
+1 2 NULL
+3 4 1
+1 2 NULL
+3 4 1
+3 4 1
+4 5 NULL
+4 5 NULL
+5 6 1
+5 6 1
+4 5 NULL
+7 8 NULL
+9 NULL NULL
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 18
+Subquery_cache_miss 16
+#IN subquery tesy (PS)
+delete from t1 where a > 6;
+delete from t2 where c > 6;
+prepare stmt1 from 'select a, b , b in (select d from t2) as SUBS from t1';
+execute stmt1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 24
+Subquery_cache_miss 20
+execute stmt1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 30
+Subquery_cache_miss 24
+insert into t1 values (7,8),(9,NULL);
+execute stmt1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL NULL
+7 8 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 36
+Subquery_cache_miss 30
+execute stmt1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL NULL
+7 8 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 42
+Subquery_cache_miss 36
+insert into t2 values (8,NULL);
+execute stmt1;
+a b SUBS
+1 2 NULL
+3 4 1
+1 2 NULL
+3 4 1
+3 4 1
+4 5 NULL
+4 5 NULL
+5 6 1
+5 6 1
+4 5 NULL
+9 NULL NULL
+7 8 NULL
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 48
+Subquery_cache_miss 42
+execute stmt1;
+a b SUBS
+1 2 NULL
+3 4 1
+1 2 NULL
+3 4 1
+3 4 1
+4 5 NULL
+4 5 NULL
+5 6 1
+5 6 1
+4 5 NULL
+9 NULL NULL
+7 8 NULL
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 54
+Subquery_cache_miss 48
+deallocate prepare stmt1;
+#IN subquery tesy (SP)
+delete from t1 where a > 6;
+delete from t2 where c > 6;
+CREATE PROCEDURE p1() select a, b , b in (select d from t2) as SUBS from t1;
+call p1();
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 60
+Subquery_cache_miss 52
+call p1();
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 66
+Subquery_cache_miss 56
+insert into t1 values (7,8),(9,NULL);
+call p1();
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL NULL
+7 8 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 72
+Subquery_cache_miss 62
+call p1();
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL NULL
+7 8 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 78
+Subquery_cache_miss 68
+insert into t2 values (8,NULL);
+call p1();
+a b SUBS
+1 2 NULL
+3 4 1
+1 2 NULL
+3 4 1
+3 4 1
+4 5 NULL
+4 5 NULL
+5 6 1
+5 6 1
+4 5 NULL
+9 NULL NULL
+7 8 NULL
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 84
+Subquery_cache_miss 74
+call p1();
+a b SUBS
+1 2 NULL
+3 4 1
+1 2 NULL
+3 4 1
+3 4 1
+4 5 NULL
+4 5 NULL
+5 6 1
+5 6 1
+4 5 NULL
+9 NULL NULL
+7 8 NULL
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 90
+Subquery_cache_miss 80
+drop procedure p1;
+# test of simple exists
+select a, b , exists (select * from t2 where b=d) as SUBS from t1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL 0
+7 8 0
+# test of prepared statement exists
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 96
+Subquery_cache_miss 86
+prepare stmt1 from 'select a, b , exists (select * from t2 where b=d) as SUBS from t1';
+execute stmt1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL 0
+7 8 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 102
+Subquery_cache_miss 92
+execute stmt1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL 0
+7 8 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 108
+Subquery_cache_miss 98
+deallocate prepare stmt1;
+# test of stored procedure exists
+CREATE PROCEDURE p1() select a, b , exists (select * from t2 where b=d) as SUBS from t1;
+call p1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL 0
+7 8 0
+call p1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL 0
+7 8 0
+drop procedure p1;
+#clean up
+drop table t1,t2;
+test different types
+#int
+CREATE TABLE t1 ( a int, b int);
+INSERT INTO t1 VALUES(1,1),(2,2),(3,3);
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = 2);
+a
+1
+3
+DROP TABLE t1;
+#char
+CREATE TABLE t1 ( a char(1), b char (1));
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+a
+1
+3
+DROP TABLE t1;
+#decimal
+CREATE TABLE t1 ( a decimal(3,1), b decimal(3,1));
+INSERT INTO t1 VALUES(1,1),(2,2),(3,3);
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = 2);
+a
+1.0
+3.0
+DROP TABLE t1;
+#date
+CREATE TABLE t1 ( a date, b date);
+INSERT INTO t1 VALUES('1000-01-01','1000-01-01'),('2000-02-01','2000-02-01'),('3000-03-03','3000-03-03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2000-02-01');
+a
+1000-01-01
+3000-03-03
+DROP TABLE t1;
+#datetime
+CREATE TABLE t1 ( a datetime, b datetime);
+INSERT INTO t1 VALUES('1000-01-01 01:01:01','1000-01-01 01:01:01'),('2000-02-02 02:02:02','2000-02-02 02:02:02'),('3000-03-03 03:03:03','3000-03-03 03:03:03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2000-02-02 02:02:02');
+a
+1000-01-01 01:01:01
+3000-03-03 03:03:03
+DROP TABLE t1;
+#time
+CREATE TABLE t1 ( a time, b time);
+INSERT INTO t1 VALUES('01:01:01','01:01:01'),('02:02:02','02:02:02'),('03:03:03','03:03:03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '02:02:02');
+a
+01:01:01
+03:03:03
+DROP TABLE t1;
+#timestamp
+CREATE TABLE t1 ( a timestamp, b timestamp);
+INSERT INTO t1 VALUES('2000-02-02 01:01:01','2000-02-02 01:01:01'),('2000-02-02 02:02:02','2000-02-02 02:02:02'),('2000-02-02 03:03:03','2000-02-02 03:03:03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2000-02-02 02:02:02');
+a
+2000-02-02 01:01:01
+2000-02-02 03:03:03
+DROP TABLE t1;
+#bit
+CREATE TABLE t1 ( a bit(20), b bit(20));
+INSERT INTO t1 VALUES(1,1),(2,2),(3,3);
+SELECT a+0 FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = 2);
+a+0
+1
+3
+DROP TABLE t1;
+#enum
+CREATE TABLE t1 ( a enum('1','2','3'), b enum('1','2','3'));
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+a
+1
+3
+DROP TABLE t1;
+#set
+CREATE TABLE t1 ( a set('1','2','3'), b set('1','2','3'));
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+a
+1
+3
+DROP TABLE t1;
+#blob
+CREATE TABLE t1 ( a blob, b blob);
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+a
+1
+3
+DROP TABLE t1;
+#geometry
+CREATE TABLE t1 ( a geometry, b geometry);
+INSERT INTO t1 VALUES(POINT(1,1),POINT(1,1)),(POINT(2,2),POINT(2,2)),(POINT(3,3),POINT(3,3));
+SELECT astext(a) FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = POINT(2,2));
+astext(a)
+POINT(1 1)
+POINT(3 3)
+DROP TABLE t1;
+#uncacheable queries test (random and side effect)
+flush status;
+CREATE TABLE t1 (a int);
+INSERT INTO t1 VALUES (2), (4), (1), (3);
+select a, a in (select a from t1) from t1 as ext;
+a a in (select a from t1)
+2 1
+4 1
+1 1
+3 1
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 0
+Subquery_cache_miss 4
+select a, a in (select a from t1 where -1 < rand()) from t1 as ext;
+a a in (select a from t1 where -1 < rand())
+2 1
+4 1
+1 1
+3 1
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 0
+Subquery_cache_miss 4
+select a, a in (select a from t1 where -1 < benchmark(a,100)) from t1 as ext;
+a a in (select a from t1 where -1 < benchmark(a,100))
+2 1
+4 1
+1 1
+3 1
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 0
+Subquery_cache_miss 4
+drop table t1;
+set optimizer_switch='subquery_cache=default';
=== modified file 'mysql-test/r/subselect3.result'
--- a/mysql-test/r/subselect3.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect3.result 2010-05-31 21:25:54 +0000
@@ -105,6 +105,7 @@
Handler_read_rnd_next 5
delete from t2;
insert into t2 values (NULL, 0),(NULL, 0), (NULL, 0), (NULL, 0);
+set optimizer_switch='subquery_cache=off';
flush status;
select oref, a, a in (select a from t1 where oref=t2.oref) Z from t2;
oref a Z
@@ -123,6 +124,7 @@
select 'No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.' Z;
Z
No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.
+set @@optimizer_switch=@save_optimizer_switch;
drop table t1, t2;
create table t1 (a int, b int, primary key (a));
insert into t1 values (1,1), (3,1),(100,1);
=== modified file 'mysql-test/r/subselect3_jcl6.result'
--- a/mysql-test/r/subselect3_jcl6.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect3_jcl6.result 2010-05-31 21:25:54 +0000
@@ -109,6 +109,7 @@
Handler_read_rnd_next 5
delete from t2;
insert into t2 values (NULL, 0),(NULL, 0), (NULL, 0), (NULL, 0);
+set optimizer_switch='subquery_cache=off';
flush status;
select oref, a, a in (select a from t1 where oref=t2.oref) Z from t2;
oref a Z
@@ -127,6 +128,7 @@
select 'No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.' Z;
Z
No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.
+set @@optimizer_switch=@save_optimizer_switch;
drop table t1, t2;
create table t1 (a int, b int, primary key (a));
insert into t1 values (1,1), (3,1),(100,1);
=== modified file 'mysql-test/r/subselect_no_mat.result'
--- a/mysql-test/r/subselect_no_mat.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/subselect_no_mat.result 2010-05-31 21:25:54 +0000
@@ -1,6 +1,6 @@
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='materialization=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4826,4 +4826,4 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_no_opts.result'
--- a/mysql-test/r/subselect_no_opts.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/subselect_no_opts.result 2010-05-31 21:25:54 +0000
@@ -1,6 +1,6 @@
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='materialization=off,semijoin=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4826,4 +4826,4 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_no_semijoin.result'
--- a/mysql-test/r/subselect_no_semijoin.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/subselect_no_semijoin.result 2010-05-31 21:25:54 +0000
@@ -1,6 +1,6 @@
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='semijoin=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4826,4 +4826,4 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_sj.result'
--- a/mysql-test/r/subselect_sj.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect_sj.result 2010-05-31 21:25:54 +0000
@@ -202,39 +202,39 @@
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=default;
drop table t0, t1, t2;
drop table t10, t11, t12;
=== modified file 'mysql-test/r/subselect_sj_jcl6.result'
--- a/mysql-test/r/subselect_sj_jcl6.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect_sj_jcl6.result 2010-05-31 21:25:54 +0000
@@ -206,39 +206,39 @@
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=default;
drop table t0, t1, t2;
drop table t10, t11, t12;
=== added file 'mysql-test/t/subquery_cache.test'
--- a/mysql-test/t/subquery_cache.test 1970-01-01 00:00:00 +0000
+++ b/mysql-test/t/subquery_cache.test 2010-05-31 21:25:54 +0000
@@ -0,0 +1,204 @@
+
+set optimizer_switch='subquery_cache=on';
+flush status;
+
+create table t1 (a int, b int);
+insert into t1 values (1,2),(3,4),(1,2),(3,4),(3,4),(4,5),(4,5),(5,6),(5,6),(4,5);
+create table t2 (c int, d int);
+insert into t2 values (2,3),(3,4),(5,6);
+
+--echo #single value subquery test
+select a, (select d from t2 where b=c) + 1 from t1;
+
+show status like "subquery_cache%";
+
+--echo #single value subquery test (PS)
+prepare stmt1 from 'select a, (select d from t2 where b=c) + 1 from t1';
+execute stmt1;
+show status like "subquery_cache%";
+execute stmt1;
+show status like "subquery_cache%";
+deallocate prepare stmt1;
+
+--echo #single value subquery test (SP)
+CREATE PROCEDURE p1() select a, (select d from t2 where b=c) + 1 from t1;
+
+call p1;
+call p1;
+
+drop procedure p1;
+
+--echo #IN subquery test
+flush status;
+
+show status like "subquery_cache%";
+select a, b , b in (select d from t2) as SUBS from t1;
+show status like "subquery_cache%";
+
+insert into t1 values (7,8),(9,NULL);
+select a, b , b in (select d from t2) as SUBS from t1;
+show status like "subquery_cache%";
+
+insert into t2 values (8,NULL);
+select a, b , b in (select d from t2) as SUBS from t1;
+show status like "subquery_cache%";
+
+--echo #IN subquery tesy (PS)
+delete from t1 where a > 6;
+delete from t2 where c > 6;
+
+prepare stmt1 from 'select a, b , b in (select d from t2) as SUBS from t1';
+execute stmt1;
+show status like "subquery_cache%";
+execute stmt1;
+show status like "subquery_cache%";
+
+insert into t1 values (7,8),(9,NULL);
+execute stmt1;
+show status like "subquery_cache%";
+execute stmt1;
+show status like "subquery_cache%";
+
+insert into t2 values (8,NULL);
+execute stmt1;
+show status like "subquery_cache%";
+execute stmt1;
+show status like "subquery_cache%";
+
+deallocate prepare stmt1;
+
+
+--echo #IN subquery tesy (SP)
+delete from t1 where a > 6;
+delete from t2 where c > 6;
+
+CREATE PROCEDURE p1() select a, b , b in (select d from t2) as SUBS from t1;
+
+call p1();
+show status like "subquery_cache%";
+call p1();
+show status like "subquery_cache%";
+
+insert into t1 values (7,8),(9,NULL);
+call p1();
+show status like "subquery_cache%";
+call p1();
+show status like "subquery_cache%";
+
+insert into t2 values (8,NULL);
+call p1();
+show status like "subquery_cache%";
+call p1();
+show status like "subquery_cache%";
+
+drop procedure p1;
+
+
+--echo # test of simple exists
+select a, b , exists (select * from t2 where b=d) as SUBS from t1;
+
+--echo # test of prepared statement exists
+show status like "subquery_cache%";
+prepare stmt1 from 'select a, b , exists (select * from t2 where b=d) as SUBS from t1';
+execute stmt1;
+show status like "subquery_cache%";
+execute stmt1;
+show status like "subquery_cache%";
+deallocate prepare stmt1;
+
+--echo # test of stored procedure exists
+CREATE PROCEDURE p1() select a, b , exists (select * from t2 where b=d) as SUBS from t1;
+call p1;
+call p1;
+drop procedure p1;
+
+--echo #clean up
+drop table t1,t2;
+
+--echo test different types
+--echo #int
+CREATE TABLE t1 ( a int, b int);
+INSERT INTO t1 VALUES(1,1),(2,2),(3,3);
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = 2);
+DROP TABLE t1;
+
+--echo #char
+CREATE TABLE t1 ( a char(1), b char (1));
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+DROP TABLE t1;
+
+--echo #decimal
+CREATE TABLE t1 ( a decimal(3,1), b decimal(3,1));
+INSERT INTO t1 VALUES(1,1),(2,2),(3,3);
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = 2);
+DROP TABLE t1;
+
+--echo #date
+CREATE TABLE t1 ( a date, b date);
+INSERT INTO t1 VALUES('1000-01-01','1000-01-01'),('2000-02-01','2000-02-01'),('3000-03-03','3000-03-03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2000-02-01');
+DROP TABLE t1;
+
+--echo #datetime
+CREATE TABLE t1 ( a datetime, b datetime);
+INSERT INTO t1 VALUES('1000-01-01 01:01:01','1000-01-01 01:01:01'),('2000-02-02 02:02:02','2000-02-02 02:02:02'),('3000-03-03 03:03:03','3000-03-03 03:03:03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2000-02-02 02:02:02');
+DROP TABLE t1;
+
+--echo #time
+CREATE TABLE t1 ( a time, b time);
+INSERT INTO t1 VALUES('01:01:01','01:01:01'),('02:02:02','02:02:02'),('03:03:03','03:03:03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '02:02:02');
+DROP TABLE t1;
+
+--echo #timestamp
+CREATE TABLE t1 ( a timestamp, b timestamp);
+INSERT INTO t1 VALUES('2000-02-02 01:01:01','2000-02-02 01:01:01'),('2000-02-02 02:02:02','2000-02-02 02:02:02'),('2000-02-02 03:03:03','2000-02-02 03:03:03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2000-02-02 02:02:02');
+DROP TABLE t1;
+
+--echo #bit
+CREATE TABLE t1 ( a bit(20), b bit(20));
+INSERT INTO t1 VALUES(1,1),(2,2),(3,3);
+SELECT a+0 FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = 2);
+DROP TABLE t1;
+
+--echo #enum
+CREATE TABLE t1 ( a enum('1','2','3'), b enum('1','2','3'));
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+DROP TABLE t1;
+
+--echo #set
+CREATE TABLE t1 ( a set('1','2','3'), b set('1','2','3'));
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+DROP TABLE t1;
+
+--echo #blob
+CREATE TABLE t1 ( a blob, b blob);
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+DROP TABLE t1;
+
+--echo #geometry
+CREATE TABLE t1 ( a geometry, b geometry);
+INSERT INTO t1 VALUES(POINT(1,1),POINT(1,1)),(POINT(2,2),POINT(2,2)),(POINT(3,3),POINT(3,3));
+SELECT astext(a) FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = POINT(2,2));
+DROP TABLE t1;
+
+
+--echo #uncacheable queries test (random and side effect)
+flush status;
+CREATE TABLE t1 (a int);
+INSERT INTO t1 VALUES (2), (4), (1), (3);
+select a, a in (select a from t1) from t1 as ext;
+show status like "subquery_cache%";
+select a, a in (select a from t1 where -1 < rand()) from t1 as ext;
+show status like "subquery_cache%";
+select a, a in (select a from t1 where -1 < benchmark(a,100)) from t1 as ext;
+show status like "subquery_cache%";
+drop table t1;
+
+set optimizer_switch='subquery_cache=default';
=== modified file 'mysql-test/t/subselect3.test'
--- a/mysql-test/t/subselect3.test 2010-03-20 12:01:47 +0000
+++ b/mysql-test/t/subselect3.test 2010-05-31 21:25:54 +0000
@@ -98,10 +98,12 @@
delete from t2;
insert into t2 values (NULL, 0),(NULL, 0), (NULL, 0), (NULL, 0);
+set optimizer_switch='subquery_cache=off';
flush status;
select oref, a, a in (select a from t1 where oref=t2.oref) Z from t2;
show status like '%Handler_read%';
select 'No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.' Z;
+set @@optimizer_switch=@save_optimizer_switch;
drop table t1, t2;
=== modified file 'sql/CMakeLists.txt'
--- a/sql/CMakeLists.txt 2010-03-20 12:01:47 +0000
+++ b/sql/CMakeLists.txt 2010-05-31 21:25:54 +0000
@@ -78,7 +78,7 @@
rpl_rli.cc rpl_mi.cc sql_servers.cc
sql_connect.cc scheduler.cc
sql_profile.cc event_parse_data.cc opt_table_elimination.cc
- ds_mrr.cc
+ ds_mrr.cc sql_subquery_cache.cc
${PROJECT_SOURCE_DIR}/sql/sql_yacc.cc
${PROJECT_SOURCE_DIR}/sql/sql_yacc.h
${PROJECT_SOURCE_DIR}/include/mysqld_error.h
=== modified file 'sql/Makefile.am'
--- a/sql/Makefile.am 2010-03-20 12:01:47 +0000
+++ b/sql/Makefile.am 2010-05-31 21:25:54 +0000
@@ -80,7 +80,7 @@
event_data_objects.h event_scheduler.h \
sql_partition.h partition_info.h partition_element.h \
contributors.h sql_servers.h \
- multi_range_read.h
+ multi_range_read.h sql_subquery_cache.h
mysqld_SOURCES = sql_lex.cc sql_handler.cc sql_partition.cc \
item.cc item_sum.cc item_buff.cc item_func.cc \
@@ -130,7 +130,7 @@
sql_servers.cc event_parse_data.cc \
opt_table_elimination.cc \
multi_range_read.cc \
- opt_index_cond_pushdown.cc
+ opt_index_cond_pushdown.cc sql_subquery_cache.cc
nodist_mysqld_SOURCES = mini_client_errors.c pack.c client.c my_time.c my_user.c
=== modified file 'sql/item.cc'
--- a/sql/item.cc 2010-03-20 12:01:47 +0000
+++ b/sql/item.cc 2010-05-31 21:25:54 +0000
@@ -28,6 +28,9 @@
const String my_null_string("NULL", 4, default_charset_info);
+static int save_field_in_field(Field *from,my_bool * null_value,
+ Field *to, bool no_conversions);
+
/****************************************************************************/
/* Hybrid_type_traits {_real} */
@@ -2273,6 +2276,13 @@
str->append(str_value);
}
+void Item_bool_cache::print(String *str, enum_query_type query_type)
+{
+ if (null_value)
+ str->append("NULL", 4);
+ else
+ Item_int::print(str, query_type);
+}
Item_uint::Item_uint(const char *str_arg, uint length):
Item_int(str_arg, length)
@@ -3646,12 +3656,17 @@
resolved_item->db_name : "");
const char *table_name= (resolved_item->table_name ?
resolved_item->table_name : "");
+ DBUG_ENTER("mark_as_dependent");
+ DBUG_PRINT("enter", ("Field '%s.%s.%s in select %d resolved in %d",
+ db_name, table_name,
+ resolved_item->field_name, current->select_number,
+ last->select_number));
/* store pointer on SELECT_LEX from which item is dependent */
if (mark_item)
mark_item->depended_from= last;
if (current->mark_as_dependent(thd, last, /** resolved_item psergey-thu
**/mark_item))
- return TRUE;
+ DBUG_RETURN(TRUE);
if (thd->lex->describe & DESCRIBE_EXTENDED)
{
push_warning_printf(thd, MYSQL_ERROR::WARN_LEVEL_NOTE,
@@ -3661,7 +3676,7 @@
resolved_item->field_name,
current->select_number, last->select_number);
}
- return FALSE;
+ DBUG_RETURN(FALSE);
}
@@ -3698,6 +3713,7 @@
resolving)
*/
SELECT_LEX *previous_select= current_sel;
+
for (; previous_select->outer_select() != last_select;
previous_select= previous_select->outer_select())
{
@@ -3726,6 +3742,7 @@
mark_as_dependent(thd, last_select, current_sel, resolved_item,
dependent);
}
+ return;
}
@@ -4098,6 +4115,9 @@
((ref_type == REF_ITEM ||
ref_type == FIELD_ITEM) ?
(Item_ident*) (*reference) : 0));
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
return 0;
}
}
@@ -4113,7 +4133,9 @@
((ref_type == REF_ITEM || ref_type == FIELD_ITEM) ?
(Item_ident*) (*reference) :
0));
-
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
/*
A reference to a view field had been found and we
substituted it instead of this Item (find_field_in_tables
@@ -4215,6 +4237,10 @@
mark_as_dependent(thd, last_checked_context->select_lex,
context->select_lex, rf,
rf);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
+
return 0;
}
else
@@ -4222,6 +4248,9 @@
mark_as_dependent(thd, last_checked_context->select_lex,
context->select_lex,
this, (Item_ident*)*reference);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
if (last_checked_context->select_lex->having_fix_field)
{
Item_ref *rf;
@@ -5082,39 +5111,48 @@
/**
+ Saves one Fields of an Item of in other Field
+
+ @param from Field to copy value from
+ @param null_value reference on item null_value to set it if it is needed
+ @param to Field to cope value to
+ @param no_conversions how to deal with NULL value (see
+ set_field_to_null_with_conversions())
+
+ @retval FALSE OK
+ @retval TRUE Error
+*/
+
+static int save_field_in_field(Field *from, my_bool *null_value,
+ Field *to, bool no_conversions)
+{
+ int res;
+ if (from->is_null())
+ {
+ (*null_value)= 1;
+ res= set_field_to_null_with_conversions(to, no_conversions);
+ }
+ else
+ {
+ to->set_notnull();
+ res= field_conv(to, from);
+ (*null_value)= 0;
+ }
+ return res;
+}
+
+/**
Set a field's value from a item.
*/
void Item_field::save_org_in_field(Field *to)
{
- if (field->is_null())
- {
- null_value=1;
- set_field_to_null_with_conversions(to, 1);
- }
- else
- {
- to->set_notnull();
- field_conv(to,field);
- null_value=0;
- }
+ save_field_in_field(field, &null_value, to, TRUE);
}
int Item_field::save_in_field(Field *to, bool no_conversions)
{
- int res;
- if (result_field->is_null())
- {
- null_value=1;
- res= set_field_to_null_with_conversions(to, no_conversions);
- }
- else
- {
- to->set_notnull();
- res= field_conv(to,result_field);
- null_value=0;
- }
- return res;
+ return save_field_in_field(result_field, &null_value, to, no_conversions);
}
@@ -5973,6 +6011,9 @@
refer_type == FIELD_ITEM) ?
(Item_ident*) (*reference) :
0));
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
/*
view reference found, we substituted it instead of this
Item, so can quit
@@ -6023,6 +6064,9 @@
thd->change_item_tree(reference, fld);
mark_as_dependent(thd, last_checked_context->select_lex,
thd->lex->current_select, fld, fld);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
/*
A reference is resolved to a nest level that's outer or the same as
the nest level of the enclosing set function : adjust the value of
@@ -6046,6 +6090,9 @@
DBUG_ASSERT(*ref && (*ref)->fixed);
mark_as_dependent(thd, last_checked_context->select_lex,
context->select_lex, this, this);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ ref);
/*
A reference is resolved to a nest level that's outer or the same as
the nest level of the enclosing set function : adjust the value of
@@ -6312,7 +6359,8 @@
int Item_ref::save_in_field(Field *to, bool no_conversions)
{
int res;
- DBUG_ASSERT(!result_field);
+ if (result_field)
+ return save_field_in_field(result_field, &null_value, to, no_conversions);
res= (*ref)->save_in_field(to, no_conversions);
null_value= (*ref)->null_value;
return res;
=== modified file 'sql/item.h'
--- a/sql/item.h 2010-03-20 12:01:47 +0000
+++ b/sql/item.h 2010-05-31 21:25:54 +0000
@@ -1922,8 +1922,31 @@
virtual void print(String *str, enum_query_type query_type);
Item_num *neg ();
uint decimal_precision() const { return max_length; }
- bool check_partition_func_processor(uchar *bool_arg) { return FALSE;}
- bool check_vcol_func_processor(uchar *arg) { return FALSE;}
+};
+
+
+/**
+ Item represent TRUE/FALSE/NULL for subquery values
+*/
+
+class Item_bool_cache: public Item_int
+{
+public:
+ Item_bool_cache(): Item_int(0, 1)
+ {
+ unsigned_flag= maybe_null= null_value= TRUE;
+ name= (char *)"bool chache";
+ }
+ Item_bool_cache(my_bool val, my_bool null): Item_int(val, 1)
+ {
+ unsigned_flag= maybe_null= TRUE;
+ null_value= null;
+ name= (char *)"bool chache";
+ }
+ Item *clone_item() { return new Item_bool_cache(value, null_value); }
+ uint decimal_precision() const { return 1; }
+ virtual void print(String *str, enum_query_type query_type);
+ void set(my_bool val, my_bool null) {value= test(val); null_value= null;}
};
@@ -3146,7 +3169,8 @@
example(0), used_table_map(0), cached_field(0), cached_field_type(MYSQL_TYPE_STRING),
value_cached(0)
{
- fixed= 1;
+ fixed= 1;
+ maybe_null= 1;
null_value= 1;
}
Item_cache(enum_field_types field_type_arg):
@@ -3154,6 +3178,7 @@
value_cached(0)
{
fixed= 1;
+ maybe_null= 1;
null_value= 1;
}
=== modified file 'sql/item_cmpfunc.cc'
--- a/sql/item_cmpfunc.cc 2010-03-20 12:01:47 +0000
+++ b/sql/item_cmpfunc.cc 2010-05-31 21:25:54 +0000
@@ -1736,6 +1736,15 @@
used_tables_cache|= args[1]->used_tables();
not_null_tables_cache|= args[1]->not_null_tables();
const_item_cache&= args[1]->const_item();
+ DBUG_ASSERT(scache == NULL);
+ if (args[0]->cols() ==1 &&
+ thd->variables.optimizer_switch & OPTIMIZER_SWITCH_SUBQUERY_CACHE &&
+ !(sub->engine->uncacheable() & (UNCACHEABLE_RAND |
+ UNCACHEABLE_SIDEEFFECT)))
+ {
+ sub->depends_on.push_front((Item**)&cache);
+ scache= new Subquery_cache_tmptable(thd, sub->depends_on, &result);
+ }
fixed= 1;
return FALSE;
}
@@ -1744,10 +1753,26 @@
longlong Item_in_optimizer::val_int()
{
bool tmp;
+ DBUG_ENTER("Item_in_optimizer::val_int");
+
DBUG_ASSERT(fixed == 1);
cache->store(args[0]);
cache->cache_value();
-
+
+ /* check if result is in the cache */
+ if (scache)
+ {
+ Subquery_cache_tmptable::result res;
+ Item *cached_value;
+ res= scache->check_value(&cached_value);
+ if (res == Subquery_cache_tmptable::HIT)
+ {
+ tmp= cached_value->val_int();
+ null_value= cached_value->null_value;
+ DBUG_RETURN(tmp);
+ }
+ }
+
if (cache->null_value)
{
/*
@@ -1818,11 +1843,18 @@
for (uint i= 0; i < ncols; i++)
item_subs->set_cond_guard_var(i, TRUE);
}
- return 0;
+ DBUG_RETURN(0);
}
tmp= args[1]->val_bool_result();
null_value= args[1]->null_value;
- return tmp;
+
+ /* put result in the cache */
+ if (scache)
+ {
+ result.set(tmp, null_value);
+ scache->put_value(&result);
+ }
+ DBUG_RETURN(tmp);
}
@@ -1839,6 +1871,11 @@
Item_bool_func::cleanup();
if (!save_cache)
cache= 0;
+ if (scache)
+ {
+ delete scache;
+ scache= 0;
+ }
DBUG_VOID_RETURN;
}
=== modified file 'sql/item_cmpfunc.h'
--- a/sql/item_cmpfunc.h 2010-03-20 12:01:47 +0000
+++ b/sql/item_cmpfunc.h 2010-05-31 21:25:54 +0000
@@ -215,6 +215,7 @@
class Item_cache;
+class Subquery_cache;
#define UNKNOWN ((my_bool)-1)
@@ -237,6 +238,10 @@
{
protected:
Item_cache *cache;
+ /* Subquery cache */
+ Subquery_cache *scache;
+ /* result representation for the subquery cache */
+ Item_bool_cache result;
bool save_cache;
/*
Stores the value of "NULL IN (SELECT ...)" for uncorrelated subqueries:
@@ -247,7 +252,7 @@
my_bool result_for_null_param;
public:
Item_in_optimizer(Item *a, Item_in_subselect *b):
- Item_bool_func(a, my_reinterpret_cast(Item *)(b)), cache(0),
+ Item_bool_func(a, my_reinterpret_cast(Item *)(b)), cache(0), scache(NULL),
save_cache(0), result_for_null_param(UNKNOWN)
{}
bool fix_fields(THD *, Item **);
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-03-29 14:04:35 +0000
+++ b/sql/item_subselect.cc 2010-05-31 21:25:54 +0000
@@ -34,11 +34,10 @@
Item_subselect::Item_subselect():
Item_result_field(), value_assigned(0), thd(0), substitution(0),
- engine(0), old_engine(0), used_tables_cache(0), have_to_be_excluded(0),
- const_item_cache(1),
- inside_first_fix_fields(0), done_first_fix_fields(FALSE),
- eliminated(FALSE),
- engine_changed(0), changed(0), is_correlated(FALSE)
+ engine(0), old_engine(0), scache(0), used_tables_cache(0),
+ have_to_be_excluded(0), const_item_cache(1), inside_first_fix_fields(0),
+ done_first_fix_fields(FALSE), eliminated(FALSE), engine_changed(0),
+ changed(0), is_correlated(FALSE)
{
with_subselect= 1;
reset();
@@ -116,6 +115,12 @@
}
if (engine)
engine->cleanup();
+ depends_on.empty();
+ if (scache)
+ {
+ delete scache;
+ scache= 0;
+ }
reset();
value_assigned= 0;
DBUG_VOID_RETURN;
@@ -148,6 +153,8 @@
Item_subselect::~Item_subselect()
{
delete engine;
+ if (scache)
+ delete scache;
}
Item_subselect::trans_res
@@ -746,9 +753,22 @@
void Item_singlerow_subselect::fix_length_and_dec()
{
+ DBUG_ENTER("Item_singlerow_subselect::fix_length_and_dec");
if ((max_columns= engine->cols()) == 1)
{
+ DBUG_PRINT("info", ("one, elements: %u flag %u",
+ (uint)depends_on.elements,
+ (uint)test(thd->variables.optimizer_switch & OPTIMIZER_SWITCH_SUBQUERY_CACHE)));
engine->fix_length_and_dec(row= &value);
+ if (depends_on.elements &&
+ optimizer_flag(thd, OPTIMIZER_SWITCH_SUBQUERY_CACHE) &&
+ !(engine->uncacheable() & (UNCACHEABLE_RAND |
+ UNCACHEABLE_SIDEEFFECT)))
+ {
+ DBUG_ASSERT(scache == NULL);
+ scache= new Subquery_cache_tmptable(thd, depends_on, value);
+ DBUG_PRINT("info", ("cache: 0x%lx", (ulong) scache));
+ }
}
else
{
@@ -765,6 +785,7 @@
*/
if (engine->no_tables())
maybe_null= engine->may_be_null();
+ DBUG_VOID_RETURN;
}
uint Item_singlerow_subselect::cols()
@@ -797,77 +818,206 @@
exec();
}
+/**
+ Checks subquery cache for value
+
+ @retval NULL nothing found
+ @retval reference on item representing value found in the cache
+*/
+
+Item *Item_subselect::check_cache()
+{
+ DBUG_ENTER("Item_subselect::check_cache");
+ if (scache)
+ {
+ Subquery_cache_tmptable::result res;
+ Item *cached_value;
+ res= scache->check_value(&cached_value);
+ if (res == Subquery_cache_tmptable::HIT)
+ DBUG_RETURN(cached_value);
+ }
+ DBUG_RETURN(NULL);
+}
+
double Item_singlerow_subselect::val_real()
{
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_real");
DBUG_ASSERT(fixed == 1);
- if (!exec() && !value->null_value)
+
+ if ((cached_value = check_cache()))
+ {
+ double res= cached_value->val_real();
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_real();
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_real());
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
longlong Item_singlerow_subselect::val_int()
{
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_int");
DBUG_ASSERT(fixed == 1);
- if (!exec() && !value->null_value)
+
+ if ((cached_value = check_cache()))
+ {
+ longlong res= cached_value->val_int();
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_int();
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_int());
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
String *Item_singlerow_subselect::val_str(String *str)
{
- if (!exec() && !value->null_value)
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_str");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ String *res= cached_value->val_str(str);
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_str(str);
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_str(str));
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
my_decimal *Item_singlerow_subselect::val_decimal(my_decimal *decimal_value)
{
- if (!exec() && !value->null_value)
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_decimal");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ my_decimal *res= cached_value->val_decimal(decimal_value);
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_decimal(decimal_value);
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_decimal(decimal_value));
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
bool Item_singlerow_subselect::val_bool()
{
- if (!exec() && !value->null_value)
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_bool");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ bool res= cached_value->val_bool();
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_bool();
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_bool());
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
@@ -952,33 +1102,79 @@
void Item_exists_subselect::fix_length_and_dec()
{
+ DBUG_ENTER("Item_exists_subselect::fix_length_and_dec");
decimals= 0;
max_length= 1;
max_columns= engine->cols();
/* We need only 1 row to determine existence */
unit->global_parameters->select_limit= new Item_int((int32) 1);
+ if (substype() == EXISTS_SUBS && depends_on.elements &&
+ optimizer_flag(thd, OPTIMIZER_SWITCH_SUBQUERY_CACHE) &&
+ !(engine->uncacheable() & (UNCACHEABLE_RAND |
+ UNCACHEABLE_SIDEEFFECT)))
+ {
+ DBUG_ASSERT(scache == NULL);
+ scache= new Subquery_cache_tmptable(thd, depends_on, &result);
+ DBUG_PRINT("info", ("cache: 0x%lx", (ulong) scache));
+ }
+ DBUG_VOID_RETURN;
}
double Item_exists_subselect::val_real()
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_int");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ double res= cached_value->val_real();
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
{
reset();
- return 0;
- }
- return (double) value;
+ DBUG_RETURN(0);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
+ DBUG_RETURN((double) value);
}
longlong Item_exists_subselect::val_int()
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_real");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ longlong res= cached_value->val_int();
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
DBUG_ASSERT(fixed == 1);
if (exec())
{
reset();
- return 0;
- }
- return value;
+ DBUG_RETURN(0);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
+ DBUG_RETURN(value);
}
@@ -997,11 +1193,32 @@
String *Item_exists_subselect::val_str(String *str)
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_str");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ String *res= cached_value->val_str(str);
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
+ {
reset();
+ str->set((ulonglong)0,&my_charset_bin);
+ DBUG_RETURN(str);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
str->set((ulonglong)value,&my_charset_bin);
- return str;
+ DBUG_RETURN(str);
}
@@ -1020,23 +1237,61 @@
my_decimal *Item_exists_subselect::val_decimal(my_decimal *decimal_value)
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_decvimal");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ my_decimal *res= cached_value->val_decimal(decimal_value);
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
+ {
reset();
+ int2my_decimal(E_DEC_FATAL_ERROR, 0, 0, decimal_value);
+ DBUG_RETURN(decimal_value);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
int2my_decimal(E_DEC_FATAL_ERROR, value, 0, decimal_value);
- return decimal_value;
+ DBUG_RETURN(decimal_value);
}
bool Item_exists_subselect::val_bool()
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_real");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ my_bool res= cached_value->val_bool();
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
{
reset();
- return 0;
- }
- return value != 0;
+ DBUG_RETURN(0);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
+ DBUG_RETURN(value != 0);
}
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-03-29 14:04:35 +0000
+++ b/sql/item_subselect.h 2010-05-31 21:25:54 +0000
@@ -27,6 +27,7 @@
class subselect_hash_sj_engine;
class Item_bool_func2;
class Cached_item;
+class Subquery_cache;
/* base class for subselects */
@@ -57,6 +58,10 @@
subselect_engine *engine;
/* old engine if engine was changed */
subselect_engine *old_engine;
+ /* subquery cache */
+ Subquery_cache *scache;
+ /* null consrtant for caching */
+ Item_null const_null_value;
/* cache of used external tables */
table_map used_tables_cache;
/* allowed number of columns (1 for single value subqueries) */
@@ -67,7 +72,7 @@
bool have_to_be_excluded;
/* cache of constant state */
bool const_item_cache;
-
+
bool inside_first_fix_fields;
bool done_first_fix_fields;
public:
@@ -88,13 +93,21 @@
*/
List<Ref_to_outside> upper_refs;
st_select_lex *parent_select;
-
- /*
+
+ /**
+ List of references on items subquery depends on (externally resolved);
+
+ @note We can't store direct links on Items because it could be
+ substituted with other item (for example for grouping).
+ */
+ List<Item*> depends_on;
+
+ /*
TRUE<=>Table Elimination has made it redundant to evaluate this select
(and so it is not part of QEP, etc)
- */
+ */
bool eliminated;
-
+
/* changed engine indicator */
bool engine_changed;
/* subquery is transformed */
@@ -178,6 +191,8 @@
return trace_unsupported_by_check_vcol_func_processor("subselect");
}
+ Item *check_cache();
+
/**
Get the SELECT_LEX structure associated with this Item.
@return the SELECT_LEX structure associated with this Item
@@ -202,6 +217,7 @@
{
protected:
Item_cache *value, **row;
+
public:
Item_singlerow_subselect(st_select_lex *select_lex);
Item_singlerow_subselect() :Item_subselect(), value(0), row (0) {}
@@ -268,6 +284,8 @@
{
protected:
bool value; /* value of this item (boolean: exists/not-exists) */
+ /* result representation for the subquery cache */
+ Item_bool_cache result;
public:
Item_exists_subselect(st_select_lex *select_lex);
=== modified file 'sql/item_sum.cc'
--- a/sql/item_sum.cc 2010-03-20 12:01:47 +0000
+++ b/sql/item_sum.cc 2010-05-31 21:25:54 +0000
@@ -319,6 +319,7 @@
if (aggr_level >= 0)
{
ref_by= ref;
+ thd->lex->current_select->register_dependency_item(aggr_sel, ref);
/* Add the object to the list of registered objects assigned to aggr_sel */
if (!aggr_sel->inner_sum_func_list)
next= this;
=== modified file 'sql/mysql_priv.h'
--- a/sql/mysql_priv.h 2010-03-20 12:01:47 +0000
+++ b/sql/mysql_priv.h 2010-05-31 21:25:54 +0000
@@ -568,12 +568,13 @@
#define OPTIMIZER_SWITCH_SEMIJOIN 256
#define OPTIMIZER_SWITCH_PARTIAL_MATCH_ROWID_MERGE 512
#define OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN 1024
+#define OPTIMIZER_SWITCH_SUBQUERY_CACHE (1<<11)
#ifdef DBUG_OFF
-# define OPTIMIZER_SWITCH_LAST 2048
+# define OPTIMIZER_SWITCH_LAST (1<<12)
#else
-# define OPTIMIZER_SWITCH_TABLE_ELIMINATION 2048
-# define OPTIMIZER_SWITCH_LAST 4096
+# define OPTIMIZER_SWITCH_TABLE_ELIMINATION (1<<12)
+# define OPTIMIZER_SWITCH_LAST (1<<13)
#endif
#ifdef DBUG_OFF
@@ -588,7 +589,8 @@
OPTIMIZER_SWITCH_MATERIALIZATION | \
OPTIMIZER_SWITCH_SEMIJOIN | \
OPTIMIZER_SWITCH_PARTIAL_MATCH_ROWID_MERGE|\
- OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN)
+ OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN|\
+ OPTIMIZER_SWITCH_SUBQUERY_CACHE)
#else
# define OPTIMIZER_SWITCH_DEFAULT (OPTIMIZER_SWITCH_INDEX_MERGE | \
OPTIMIZER_SWITCH_INDEX_MERGE_UNION | \
@@ -601,7 +603,8 @@
OPTIMIZER_SWITCH_MATERIALIZATION | \
OPTIMIZER_SWITCH_SEMIJOIN | \
OPTIMIZER_SWITCH_PARTIAL_MATCH_ROWID_MERGE|\
- OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN)
+ OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN|\
+ OPTIMIZER_SWITCH_SUBQUERY_CACHE)
#endif
/*
@@ -936,6 +939,7 @@
#ifdef MYSQL_SERVER
#include "sql_servers.h"
#include "opt_range.h"
+#include "sql_subquery_cache.h"
#ifdef HAVE_QUERY_CACHE
struct Query_cache_query_flags
@@ -1269,6 +1273,10 @@
Item *having, ORDER *proc_param, ulonglong select_type,
select_result *result, SELECT_LEX_UNIT *unit,
SELECT_LEX *select_lex);
+
+struct st_join_table *create_index_lookup_join_tab(TABLE *table);
+int join_read_key2(THD *thd, struct st_join_table *tab, TABLE *table,
+ struct st_table_ref *table_ref);
void free_underlaid_joins(THD *thd, SELECT_LEX *select);
bool mysql_explain_union(THD *thd, SELECT_LEX_UNIT *unit,
select_result *result);
@@ -1288,6 +1296,7 @@
bool table_cant_handle_bit_fields,
bool make_copy_field,
uint convert_blob_length);
+bool open_tmp_table(TABLE *table);
void sp_prepare_create_field(THD *thd, Create_field *sql_field);
int prepare_create_field(Create_field *sql_field,
uint *blob_columns,
=== modified file 'sql/mysqld.cc'
--- a/sql/mysqld.cc 2010-03-20 12:01:47 +0000
+++ b/sql/mysqld.cc 2010-05-31 21:25:54 +0000
@@ -305,6 +305,7 @@
"firstmatch","loosescan","materialization", "semijoin",
"partial_match_rowid_merge",
"partial_match_table_scan",
+ "subquery_cache",
#ifndef DBUG_OFF
"table_elimination",
#endif
@@ -325,6 +326,7 @@
sizeof("semijoin") - 1,
sizeof("partial_match_rowid_merge") - 1,
sizeof("partial_match_table_scan") - 1,
+ sizeof("subquery_cache") - 1,
#ifndef DBUG_OFF
sizeof("table_elimination") - 1,
#endif
@@ -404,8 +406,9 @@
static const char *optimizer_switch_str="index_merge=on,index_merge_union=on,"
"index_merge_sort_union=on,"
"index_merge_intersection=on,"
- "index_condition_pushdown=on"
-#ifndef DBUG_OFF
+ "index_condition_pushdown=on,"
+ "subquery_cache=on"
+#ifndef DBUG_OFF
",table_elimination=on";
#else
;
@@ -5872,7 +5875,9 @@
OPT_RECORD_RND_BUFFER, OPT_DIV_PRECINCREMENT, OPT_RELAY_LOG_SPACE_LIMIT,
OPT_RELAY_LOG_PURGE,
OPT_SLAVE_NET_TIMEOUT, OPT_SLAVE_COMPRESSED_PROTOCOL, OPT_SLOW_LAUNCH_TIME,
- OPT_SLAVE_TRANS_RETRIES, OPT_READONLY, OPT_ROWID_MERGE_BUFF_SIZE,
+ OPT_SLAVE_TRANS_RETRIES,
+ OPT_SUBQUERY_CACHE,
+ OPT_READONLY, OPT_ROWID_MERGE_BUFF_SIZE,
OPT_DEBUGGING, OPT_DEBUG_FLUSH,
OPT_SORT_BUFFER, OPT_TABLE_OPEN_CACHE, OPT_TABLE_DEF_CACHE,
OPT_THREAD_CONCURRENCY, OPT_THREAD_CACHE_SIZE,
@@ -7164,7 +7169,7 @@
{"optimizer_switch", OPT_OPTIMIZER_SWITCH,
"optimizer_switch=option=val[,option=val...], where option={index_merge, "
"index_merge_union, index_merge_sort_union, index_merge_intersection, "
- "index_condition_pushdown"
+ "index_condition_pushdown, subquery_cache"
#ifndef DBUG_OFF
", table_elimination"
#endif
@@ -7868,6 +7873,8 @@
{"Ssl_version", (char*) &show_ssl_get_version, SHOW_FUNC},
#endif /* HAVE_OPENSSL */
{"Syncs", (char*) &my_sync_count, SHOW_LONG_NOFLUSH},
+ {"Subquery_cache_hit", (char*) &subquery_cache_hit, SHOW_LONG},
+ {"Subquery_cache_miss", (char*) &subquery_cache_miss, SHOW_LONG},
{"Table_locks_immediate", (char*) &locks_immediate, SHOW_LONG},
{"Table_locks_waited", (char*) &locks_waited, SHOW_LONG},
#ifdef HAVE_MMAP
@@ -8006,6 +8013,7 @@
abort_loop= select_thread_in_use= signal_thread_in_use= 0;
ready_to_exit= shutdown_in_progress= grant_option= 0;
aborted_threads= aborted_connects= 0;
+ subquery_cache_miss= subquery_cache_hit= 0;
delayed_insert_threads= delayed_insert_writes= delayed_rows_in_use= 0;
delayed_insert_errors= thread_created= 0;
specialflag= 0;
=== modified file 'sql/sql_base.cc'
--- a/sql/sql_base.cc 2010-03-20 12:01:47 +0000
+++ b/sql/sql_base.cc 2010-05-31 21:25:54 +0000
@@ -8062,6 +8062,10 @@
if (*conds)
{
thd->where="where clause";
+ DBUG_EXECUTE("where",
+ print_where(*conds,
+ "WHERE in setup_conds",
+ QT_ORDINARY););
if ((!(*conds)->fixed && (*conds)->fix_fields(thd, conds)) ||
(*conds)->check_cols(1))
goto err_no_arena;
=== modified file 'sql/sql_class.cc'
--- a/sql/sql_class.cc 2010-03-20 12:01:47 +0000
+++ b/sql/sql_class.cc 2010-05-31 21:25:54 +0000
@@ -3020,6 +3020,7 @@
table_charset= 0;
precomputed_group_by= 0;
bit_fields_as_long= 0;
+ skip_create_table= 0;
DBUG_VOID_RETURN;
}
=== modified file 'sql/sql_class.h'
--- a/sql/sql_class.h 2010-03-20 12:01:47 +0000
+++ b/sql/sql_class.h 2010-05-31 21:25:54 +0000
@@ -2786,12 +2786,17 @@
that MEMORY tables cannot index BIT columns.
*/
bool bit_fields_as_long;
+ /*
+ Whether to create or postpone actual creation of this temporary table.
+ TRUE <=> create_tmp_table will create only the TABLE structure.
+ */
+ bool skip_create_table;
TMP_TABLE_PARAM()
:copy_field(0), group_parts(0),
group_length(0), group_null_parts(0), convert_blob_length(0),
schema_table(0), precomputed_group_by(0), force_copy_fields(0),
- bit_fields_as_long(0)
+ bit_fields_as_long(0), skip_create_table(0)
{}
~TMP_TABLE_PARAM()
{
=== modified file 'sql/sql_lex.cc'
--- a/sql/sql_lex.cc 2010-03-20 12:01:47 +0000
+++ b/sql/sql_lex.cc 2010-05-31 21:25:54 +0000
@@ -1829,6 +1829,52 @@
}
+/**
+ Registers reference on items on which the subqueries depends
+
+ @param last pointer to last st_select_lex struct, before
+ which all st_select_lex have to be marked as
+ dependent
+ @param dependency reference on the item on which all this
+ subqueries depends
+
+*/
+
+void st_select_lex::register_dependency_item(st_select_lex *last,
+ Item **dependency)
+{
+ SELECT_LEX *s= this;
+ DBUG_ENTER("st_select_lex::register_dependency_item");
+ DBUG_ASSERT(this != last);
+ DBUG_ASSERT(*dependency);
+ do
+ {
+ /* check duplicates */
+ List_iterator_fast<Item*> li(s->master_unit()->item->depends_on);
+ Item **dep;
+ while ((dep= li++))
+ {
+ if ((*dep)->eq(*dependency, FALSE))
+ {
+ DBUG_PRINT("info", ("dependency %s already present",
+ ((*dependency)->name ?
+ (*dependency)->name :
+ "<no name>")));
+ DBUG_VOID_RETURN;
+ }
+ }
+
+ s->master_unit()->item->depends_on.push_back(dependency);
+ DBUG_PRINT("info", ("depends_on: Select: %d added: %s",
+ s->select_number,
+ ((*dependency)->name ?
+ (*dependency)->name :
+ "<no name>")));
+ } while ((s= s->outer_select()) != last && s != 0);
+ DBUG_VOID_RETURN;
+}
+
+
/*
st_select_lex_node::mark_as_dependent mark all st_select_lex struct from
this to 'last' as dependent
@@ -1843,7 +1889,7 @@
bool st_select_lex::mark_as_dependent(THD *thd, st_select_lex *last, Item *dependency)
{
-
+ DBUG_ENTER("st_select_lex::mark_as_dependent");
DBUG_ASSERT(this != last);
/*
@@ -1872,11 +1918,11 @@
Item_subselect *subquery_expr= s->master_unit()->item;
if (subquery_expr && subquery_expr->mark_as_dependent(thd, last,
dependency))
- return TRUE;
+ DBUG_RETURN(TRUE);
} while ((s= s->outer_select()) != last && s != 0);
is_correlated= TRUE;
this->master_unit()->item->is_correlated= TRUE;
- return FALSE;
+ DBUG_RETURN(FALSE);
}
bool st_select_lex_node::set_braces(bool value) { return 1; }
=== modified file 'sql/sql_lex.h'
--- a/sql/sql_lex.h 2010-03-20 12:01:47 +0000
+++ b/sql/sql_lex.h 2010-05-31 21:25:54 +0000
@@ -748,6 +748,7 @@
}
bool mark_as_dependent(THD *thd, st_select_lex *last, Item *dependency);
+ void register_dependency_item(st_select_lex *last, Item **dependency);
bool set_braces(bool value);
bool inc_in_sum_expr();
=== modified file 'sql/sql_select.cc'
--- a/sql/sql_select.cc 2010-05-10 13:46:08 +0000
+++ b/sql/sql_select.cc 2010-05-31 21:25:54 +0000
@@ -151,7 +151,6 @@
static int join_read_system(JOIN_TAB *tab);
static int join_read_const(JOIN_TAB *tab);
static int join_read_key(JOIN_TAB *tab);
-static int join_read_key2(JOIN_TAB *tab, TABLE *table, TABLE_REF *table_ref);
static void join_read_key_unlock_row(st_join_table *tab);
static int join_read_always_key(JOIN_TAB *tab);
static int join_read_last_key(JOIN_TAB *tab);
@@ -5209,7 +5208,7 @@
'join->best_positions' contains a complete optimal extension of the
current partial QEP.
*/
- DBUG_EXECUTE("opt", print_plan(join, join->tables,
+ DBUG_EXECUTE("opt", print_plan(join, n_tables,
record_count, read_time, read_time,
"optimal"););
DBUG_RETURN(FALSE);
@@ -7625,6 +7624,40 @@
/**
+ Creates and fills JOIN_TAB for index look up in temporary table
+
+ @param table The table where to look up
+
+ @return JOIN_TAB object or NULL in case of error
+*/
+
+JOIN_TAB *create_index_lookup_join_tab(TABLE *table)
+{
+ JOIN_TAB *tab;
+ DBUG_ENTER("create_index_lookup_join_tab");
+
+ if (!((tab= new JOIN_TAB)))
+ DBUG_RETURN(NULL);
+ tab->read_record.table= table;
+ tab->read_record.file=table->file;
+ /*tab->read_record.unlock_row= rr_unlock_row;*/
+ tab->next_select=0;
+ tab->sorted= 1;
+
+ table->status= STATUS_NO_RECORD;
+ tab->read_first_record= join_read_key;
+ /*tab->read_record.unlock_row= join_read_key_unlock_row;*/
+ tab->read_record.read_record= join_no_more_records;
+ if (table->covering_keys.is_set(tab->ref.key) &&
+ !table->no_keyread)
+ {
+ table->key_read=1;
+ table->file->extra(HA_EXTRA_KEYREAD);
+ }
+ DBUG_RETURN(tab);
+}
+
+/**
Give error if we some tables are done with a full join.
This is used by multi_table_update and multi_table_delete when running
@@ -10778,6 +10811,7 @@
case Item::REF_ITEM:
case Item::NULL_ITEM:
case Item::VARBIN_ITEM:
+ case Item::CACHE_ITEM:
if (make_copy_field)
{
DBUG_ASSERT(((Item_result_field*)item)->result_field);
@@ -11552,7 +11586,8 @@
¶m->recinfo, select_options))
goto err;
}
- if (open_tmp_table(table))
+ DBUG_PRINT("info", ("skip_create_table: %d", (int)param->skip_create_table));
+ if (!param->skip_create_table && open_tmp_table(table))
goto err;
thd->mem_root= mem_root_save;
@@ -11700,16 +11735,17 @@
bool open_tmp_table(TABLE *table)
{
int error;
+ DBUG_ENTER("open_tmp_table");
if ((error= table->file->ha_open(table, table->s->table_name.str, O_RDWR,
HA_OPEN_TMP_TABLE |
HA_OPEN_INTERNAL_TABLE)))
{
table->file->print_error(error,MYF(0)); /* purecov: inspected */
table->db_stat=0;
- return(1);
+ DBUG_RETURN(1);
}
(void) table->file->extra(HA_EXTRA_QUICK); /* Faster */
- return(0);
+ DBUG_RETURN(0);
}
@@ -12540,7 +12576,8 @@
else
{
/* Do index lookup in the materialized table */
- if ((res= join_read_key2(join_tab, sjm->table, sjm->tab_ref)) == 1)
+ if ((res= join_read_key2(join_tab->join->thd, join_tab,
+ sjm->table, sjm->tab_ref)) == 1)
DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
if (res || !sjm->in_equality->val_int())
DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
@@ -13323,61 +13360,62 @@
static int
join_read_key(JOIN_TAB *tab)
{
- return join_read_key2(tab, tab->table, &tab->ref);
+ return join_read_key2(tab->join->thd, tab, tab->table, &tab->ref);
}
-/*
+/*
eq_ref access handler but generalized a bit to support TABLE and TABLE_REF
not from the join_tab. See join_read_key for detailed synopsis.
*/
-static int
-join_read_key2(JOIN_TAB *tab, TABLE *table, TABLE_REF *table_ref)
+int join_read_key2(THD *thd, JOIN_TAB *tab, TABLE *table, TABLE_REF *table_ref)
{
int error;
+ DBUG_ENTER("join_read_key2");
if (!table->file->inited)
{
table->file->ha_index_init(table_ref->key, tab->sorted);
}
/* TODO: Why don't we do "Late NULLs Filtering" here? */
- if (cmp_buffer_with_ref(tab->join->thd, table, table_ref) ||
+ if (cmp_buffer_with_ref(thd, table, table_ref) ||
(table->status & (STATUS_GARBAGE | STATUS_NO_PARENT | STATUS_NULL_ROW)))
{
if (table_ref->key_err)
{
table->status=STATUS_NOT_FOUND;
- return -1;
+ DBUG_RETURN(-1);
}
/*
Moving away from the current record. Unlock the row
in the handler if it did not match the partial WHERE.
*/
- if (tab->ref.has_record && tab->ref.use_count == 0)
+ if (table_ref->has_record )
+ if (table_ref->use_count == 0)
{
tab->read_record.file->unlock_row();
- tab->ref.has_record= FALSE;
+ table_ref->has_record= FALSE;
}
error=table->file->ha_index_read_map(table->record[0],
table_ref->key_buff,
make_prev_keypart_map(table_ref->key_parts),
HA_READ_KEY_EXACT);
if (error && error != HA_ERR_KEY_NOT_FOUND && error != HA_ERR_END_OF_FILE)
- return report_error(table, error);
+ DBUG_RETURN(report_error(table, error));
if (! error)
{
- tab->ref.has_record= TRUE;
- tab->ref.use_count= 1;
+ table_ref->has_record= TRUE;
+ table_ref->use_count= 1;
}
}
else if (table->status == 0)
{
- DBUG_ASSERT(tab->ref.has_record);
- tab->ref.use_count++;
+ DBUG_ASSERT(table_ref->has_record);
+ table_ref->use_count++;
}
table->null_row=0;
- return table->status ? -1 : 0;
+ DBUG_RETURN(table->status ? -1 : 0);
}
=== added file 'sql/sql_subquery_cache.cc'
--- a/sql/sql_subquery_cache.cc 1970-01-01 00:00:00 +0000
+++ b/sql/sql_subquery_cache.cc 2010-05-31 21:25:54 +0000
@@ -0,0 +1,360 @@
+
+#include "mysql_priv.h"
+#include "sql_select.h"
+
+ulonglong subquery_cache_miss, subquery_cache_hit;
+
+/**
+ Creates structures which we need for index look up
+
+ @retval FALSE OK
+ @retval TRUE Error
+*/
+
+static my_bool createtmp_table_search_structures(THD *thd,
+ TABLE *table,
+ List_iterator_fast<Item> &li,
+ TABLE_REF **ref)
+{
+ /*
+ Create/initialize everything we will need to index lookups into the
+ temptable.
+ */
+ TABLE_REF *tab_ref;
+ KEY *tmp_key; /* The only index on the temporary table. */
+ Item *item;
+ uint tmp_key_parts; /* Number of keyparts in tmp_key. */
+ uint i;
+
+ DBUG_ENTER("createtmp_table_search_structures");
+
+ tmp_key= table->key_info;
+ tmp_key_parts= tmp_key->key_parts;
+
+ if (!(tab_ref= (TABLE_REF*) thd->alloc(sizeof(TABLE_REF))))
+ DBUG_RETURN(TRUE);
+
+ tab_ref->key= 0; /* The only temp table index. */
+ tab_ref->key_length= tmp_key->key_length;
+ if (!(tab_ref->key_buff=
+ (uchar*) thd->calloc(ALIGN_SIZE(tmp_key->key_length) * 2)) ||
+ !(tab_ref->key_copy=
+ (store_key**) thd->alloc((sizeof(store_key*) *
+ (tmp_key_parts + 1)))) ||
+ !(tab_ref->items=
+ (Item**) thd->alloc(sizeof(Item*) * tmp_key_parts)))
+ DBUG_RETURN(TRUE); /* purecov: inspected */
+
+ tab_ref->key_buff2=tab_ref->key_buff+ALIGN_SIZE(tmp_key->key_length);
+ tab_ref->key_err=1;
+ tab_ref->null_rejecting= 1;
+ tab_ref->disable_cache= FALSE;
+ tab_ref->has_record= 0;
+
+ KEY_PART_INFO *cur_key_part= tmp_key->key_part;
+ store_key **ref_key= tab_ref->key_copy;
+ uchar *cur_ref_buff= tab_ref->key_buff;
+
+ for (i= 0; i < tmp_key_parts; i++, cur_key_part++, ref_key++)
+ {
+ item= li++;
+ DBUG_ASSERT(item);
+ tab_ref->items[i]= item;
+ int null_count= test(cur_key_part->field->real_maybe_null());
+ *ref_key= new store_key_item(thd, cur_key_part->field,
+ /* TODO:
+ the NULL byte is taken into account in
+ cur_key_part->store_length, so instead of
+ cur_ref_buff + test(maybe_null), we could
+ use that information instead.
+ */
+ cur_ref_buff + null_count,
+ null_count ? tab_ref->key_buff : 0,
+ cur_key_part->length, tab_ref->items[i]);
+ cur_ref_buff+= cur_key_part->store_length;
+ }
+ *ref_key= NULL; /* End marker. */
+ tab_ref->key_err= 1;
+ tab_ref->key_parts= tmp_key_parts;
+ *ref= tab_ref;
+
+ DBUG_RETURN(FALSE);
+}
+
+
+Subquery_cache_tmptable::Subquery_cache_tmptable(THD *thd,
+ List<Item*> &dependance,
+ Item *value)
+ :cache_table(NULL), table_thd(thd), list(&dependance), val(value),
+ equalities(NULL), inited (0)
+{
+ DBUG_ENTER("Subquery_cache_tmptable::Subquery_cache_tmptable");
+ DBUG_VOID_RETURN;
+};
+
+
+/**
+ Creates equalities expression.
+
+ @note For some type of fields index lookup do not return failure but set
+ pointer on the next record. To check exact match we use expression like:
+ field1=value1 and field2=value2 ...
+
+ @retval FALSE OK
+ @retval TRUE Error
+*/
+
+bool Subquery_cache_tmptable::make_equalities()
+{
+ List<Item> args;
+ List_iterator_fast<Item*> li(*list);
+ Item **ref;
+ Name_resolution_context *cn= NULL;
+ DBUG_ENTER("Subquery_cache_tmptable::make_equalities");
+
+ for (uint i= 1 /* skip result filed */; (ref= li++); i++)
+ {
+ Field *fld= cache_table->field[i];
+ /* Only some field types should be checked after lookup */
+ if (fld->type() == MYSQL_TYPE_VARCHAR ||
+ fld->type() == MYSQL_TYPE_TINY_BLOB ||
+ fld->type() == MYSQL_TYPE_MEDIUM_BLOB ||
+ fld->type() == MYSQL_TYPE_LONG_BLOB ||
+ fld->type() == MYSQL_TYPE_BLOB ||
+ fld->type() == MYSQL_TYPE_VAR_STRING ||
+ fld->type() == MYSQL_TYPE_STRING ||
+ fld->type() == MYSQL_TYPE_NEWDECIMAL ||
+ fld->type() == MYSQL_TYPE_DECIMAL)
+ {
+ if (!cn)
+ {
+ // dummy resolution context
+ cn= new Name_resolution_context();
+ cn->init();
+ }
+ args.push_front(new Item_func_eq(new Item_ref(cn, ref, "", "", FALSE),
+ new Item_field(fld)));
+ }
+ }
+ if (args.elements == 1)
+ equalities= args.head();
+ else
+ equalities= new Item_cond_and(args);
+
+ DBUG_RETURN(equalities->fix_fields(table_thd, &equalities));
+}
+
+
+/**
+ Enumerates all fields in field number order.
+
+ @param arg reference on current field number
+
+ @return field number
+*/
+
+static uint field_enumerator(uchar *arg)
+{
+ return ((uint*)arg)[0]++;
+}
+
+/**
+ Initializes temporary table and index for this cache
+*/
+
+void Subquery_cache_tmptable::init()
+{
+ List_iterator_fast<Item*> li(*list);
+ List_iterator_fast<Item> li_items(items);
+ Item **item;
+ uint field_counter;
+ DBUG_ENTER("Subquery_cache_tmptable::init");
+ DBUG_ASSERT(!inited);
+ inited= TRUE;
+
+ if (!(ULONGLONG_MAX >> (list->elements + 1)))
+ {
+ DBUG_PRINT("info", ("Too many dependencies"));
+ DBUG_VOID_RETURN;
+ }
+
+ cache_table= NULL;
+ while ((item= li++))
+ {
+ DBUG_ASSERT(item);
+ DBUG_ASSERT(*item);
+ DBUG_ASSERT((*item)->fixed);
+ items.push_back((*item));
+ }
+
+ cache_table_param.init();
+ /* dependance items and result */
+ cache_table_param.field_count= list->elements + 1;
+ /* postpone table creation to index description */
+ cache_table_param.skip_create_table= 1;
+
+
+ items.push_front(val);
+ if (!(cache_table= create_tmp_table(table_thd, &cache_table_param,
+ items, (ORDER*) NULL,
+ FALSE, FALSE,
+ ((table_thd->options |
+ TMP_TABLE_ALL_COLUMNS) &
+ ~(OPTION_BIG_TABLES |
+ TMP_TABLE_FORCE_MYISAM)),
+ HA_POS_ERROR,
+ (char *)"subquery-cache-table")))
+ {
+ DBUG_PRINT("error", ("create_tmp_table failed, caching switched off"));
+ DBUG_VOID_RETURN;
+ }
+
+ if (cache_table->s->db_type() != heap_hton)
+ {
+ DBUG_PRINT("error", ("we need only heap table"));
+ goto error;
+ }
+
+ /* first field in the table is result value, so we skip it */
+ li_items++;
+ field_counter=1;
+
+ if (cache_table->alloc_keys(1) ||
+ (cache_table->add_tmp_key(0, items.elements - 1,
+ &field_enumerator,
+ (uchar*)&field_counter) < 0) ||
+ createtmp_table_search_structures(table_thd, cache_table, li_items,
+ &tab_ref) ||
+ !(tab= create_index_lookup_join_tab(cache_table)))
+ {
+ DBUG_PRINT("error", ("creating index failed"));
+ goto error;
+ }
+ cache_table->s->keys= 1;
+ cache_table->s->uniques= 1;
+
+ if (open_tmp_table(cache_table))
+ {
+ DBUG_PRINT("error", ("Opening (creating) temporary table failed"));
+ goto error;
+ }
+
+ if (!(chached_result= new Item_field(cache_table->field[0])))
+ {
+ DBUG_PRINT("error", ("Creating Item_field failed"));
+ goto error;
+ }
+
+ if (make_equalities())
+ {
+ DBUG_PRINT("error", ("Creating equalities failed"));
+ goto error;
+ }
+
+ DBUG_VOID_RETURN;
+
+error:
+ /* switch off cache */
+ free_tmp_table(table_thd, cache_table);
+ cache_table= NULL;
+ DBUG_VOID_RETURN;
+}
+
+
+Subquery_cache_tmptable::~Subquery_cache_tmptable()
+{
+ if (cache_table)
+ free_tmp_table(table_thd, cache_table);
+}
+
+
+/**
+ Checks if current key present in the cache and returns value if it is true
+
+ @param value assigned Item with value from the cache if key
+ is found
+ @return result of the key lookup
+*/
+
+Subquery_cache::result Subquery_cache_tmptable::check_value(Item **value)
+{
+ int res;
+ DBUG_ENTER("Subquery_cache_tmptable::check_value");
+
+ /*
+ We delay cache initialization to get item references which should be
+ used at the moment of query execution. I.e. we store reference on item
+ reference at the moment of class creation but for table creation and
+ index supply structures (join_tab) we need real Items which used at the
+ moment of execution so we can resolve reference only at this point.
+ */
+ if (!inited)
+ init();
+
+ if (cache_table)
+ {
+ DBUG_PRINT("info", ("status: %u has_record %u",
+ (uint)cache_table->status, (uint)tab_ref->has_record));
+ if ((res= join_read_key2(table_thd, tab, cache_table, tab_ref)) == 1)
+ DBUG_RETURN(ERROR);
+ if (res || (equalities && !equalities->val_int()))
+ {
+ subquery_cache_miss++;
+ DBUG_RETURN(MISS);
+ }
+
+ subquery_cache_hit++;
+ *value= chached_result;
+ DBUG_RETURN(Subquery_cache::HIT);
+ }
+ DBUG_RETURN(Subquery_cache::MISS);
+}
+
+
+/**
+ Puts given value in the cache
+
+ @param value Value to put in the cache
+
+ @retval FALSE OK
+ @retval TRUE Error
+*/
+
+my_bool Subquery_cache_tmptable::put_value(Item *value)
+{
+ int error;
+ DBUG_ENTER("Subquery_cache_tmptable::put_value");
+ DBUG_ASSERT(inited);
+
+ if (!cache_table)
+ {
+ DBUG_PRINT("info", ("No table so behave as we successfully put value"));
+ DBUG_RETURN(FALSE);
+ }
+
+ *(items.head_ref())= value;
+ fill_record(table_thd, cache_table->field, items, 1);
+ if (table_thd->is_error())
+ goto err;;
+
+ if ((error= cache_table->file->ha_write_row(cache_table->record[0])))
+ {
+ /* create_myisam_from_heap will generate error if needed */
+ if (cache_table->file->is_fatal_error(error, HA_CHECK_DUP) &&
+ create_internal_tmp_table_from_heap(table_thd, cache_table,
+ cache_table_param.start_recinfo,
+ &cache_table_param.recinfo,
+ error, 1))
+ goto err;
+ }
+ cache_table->status= 0; /* cache_table->record contains an existed record */
+ tab_ref->has_record= TRUE; /* the same as above */
+ DBUG_PRINT("info", ("has_record: TRUE status: 0"));
+
+ DBUG_RETURN(FALSE);
+
+err:
+ free_tmp_table(table_thd, cache_table);
+ cache_table= NULL;
+ DBUG_RETURN(TRUE);
+}
=== added file 'sql/sql_subquery_cache.h'
--- a/sql/sql_subquery_cache.h 1970-01-01 00:00:00 +0000
+++ b/sql/sql_subquery_cache.h 2010-05-31 21:25:54 +0000
@@ -0,0 +1,74 @@
+#ifndef _SQL_SUBQUERY_CACHE_H_
+#define _SQL_SUBQUERY_CACHE_H_
+
+/**
+ Interface for subquery cache
+*/
+
+extern ulonglong subquery_cache_miss, subquery_cache_hit;
+
+class Subquery_cache :public Sql_alloc
+{
+public:
+ enum result {ERROR, HIT, MISS};
+
+ Subquery_cache(){};
+ virtual ~Subquery_cache() {};
+ /**
+ Checks presence of the key (taken from cache owner) and if found return
+ it via value parameter
+ */
+ virtual result check_value(Item **value)= 0;
+ /**
+ Puts value into this cache (key should be taken from cache owner)
+ */
+ virtual my_bool put_value(Item *value)= 0;
+};
+
+struct st_table_ref;
+struct st_join_table;
+//class Item_cache;
+class Item_field;
+
+/**
+ Implementation of subquery cache over temporary table
+*/
+
+class Subquery_cache_tmptable :public Subquery_cache
+{
+public:
+ Subquery_cache_tmptable(THD *thd, List<Item*> &dependance, Item *value);
+ virtual ~Subquery_cache_tmptable();
+ virtual result check_value(Item **value);
+ virtual my_bool put_value(Item *value);
+
+private:
+ void init();
+ bool make_equalities();
+
+ /* tmp table parameters */
+ TMP_TABLE_PARAM cache_table_param;
+ /* temporary table to store this cache */
+ TABLE *cache_table;
+ /* Thread handler for the temporary table */
+ THD *table_thd;
+ /* tab_ref for index search */
+ struct st_table_ref *tab_ref;
+ /* cache of subquery value to avoid evaluating it twice */
+ //Item_cache *value_cache;
+ /* JOIN_TAB for index lookup */
+ st_join_table *tab;
+ /* Chached result */
+ Item_field *chached_result;
+ /* List of references to items */
+ List<Item*> *list;
+ /* List of items */
+ List<Item> items;
+ /* Value Item example */
+ Item *val;
+ /* Expression to check after index lookup */
+ Item *equalities;
+ /* set if structures are inited */
+ bool inited;
+};
+#endif
=== modified file 'sql/table.cc'
--- a/sql/table.cc 2010-03-20 12:01:47 +0000
+++ b/sql/table.cc 2010-05-31 21:25:54 +0000
@@ -20,6 +20,7 @@
#include "sql_trigger.h"
#include <m_ctype.h>
#include "my_md5.h"
+#include "my_bit.h"
/* INFORMATION_SCHEMA name */
LEX_STRING INFORMATION_SCHEMA_NAME= {C_STRING_WITH_LEN("information_schema")};
@@ -5096,6 +5097,115 @@
file->column_bitmaps_signal();
}
+
+/**
+ @brief
+ Allocate space for keys
+
+ @param key_count number of keys to allocate.
+
+ @details
+ Allocate space enough to fit 'key_count' keys for this table.
+
+ @return FALSE space was successfully allocated.
+ @return TRUE an error occur.
+*/
+
+bool TABLE::alloc_keys(uint key_count)
+{
+ DBUG_ASSERT(!s->keys);
+ key_info= s->key_info= (KEY*) alloc_root(&mem_root, sizeof(KEY)*key_count);
+ max_keys= key_count;
+ return !(key_info);
+}
+
+
+/**
+ @brief Adds one key to a temporary table.
+
+ @param key key number.
+ @param key_parts number of fields in the key
+ @param next_field_no function which returns field numbers which
+ should be included in the key
+ @param arg above function argement
+
+ @return <0 an error occur.
+ @return >=0 number of newly added key.
+*/
+
+bool TABLE::add_tmp_key(uint key, uint key_parts,
+ uint (*next_field_no) (uchar *), uchar *arg)
+{
+ DBUG_ASSERT(key < max_keys);
+
+ char buf[NAME_CHAR_LEN];
+ KEY* keyinfo;
+ Field **reg_field;
+ uint i;
+ bool key_start= TRUE;
+ KEY_PART_INFO* key_part_info=
+ (KEY_PART_INFO*) alloc_root(&mem_root, sizeof(KEY_PART_INFO)*key_parts);
+ if (!key_part_info)
+ return TRUE;
+ keyinfo= key_info + key;
+ keyinfo->key_part= key_part_info;
+ keyinfo->usable_key_parts= keyinfo->key_parts = key_parts;
+ keyinfo->key_length=0;
+ keyinfo->algorithm= HA_KEY_ALG_UNDEF;
+ keyinfo->flags= HA_GENERATED_KEY;
+ sprintf(buf, "key%i", key);
+ if (!(keyinfo->name= strdup_root(&mem_root, buf)))
+ return TRUE;
+ keyinfo->rec_per_key= (ulong*) alloc_root(&mem_root,
+ sizeof(ulong)*key_parts);
+ if (!keyinfo->rec_per_key)
+ return TRUE;
+ bzero(keyinfo->rec_per_key, sizeof(ulong)*key_parts);
+ for (i= 0; i < key_parts; i++)
+ {
+ reg_field= field + next_field_no(arg);
+ if (key_start)
+ (*reg_field)->key_start.set_bit(key);
+ key_start= FALSE;
+ (*reg_field)->part_of_key.set_bit(key);
+ (*reg_field)->flags|= PART_KEY_FLAG;
+ key_part_info->null_bit= (*reg_field)->null_bit;
+ key_part_info->null_offset= (uint) ((*reg_field)->null_ptr -
+ (uchar*) record[0]);
+ key_part_info->field= *reg_field;
+ key_part_info->offset= (*reg_field)->offset(record[0]);
+ key_part_info->length= (uint16) (*reg_field)->pack_length();
+ keyinfo->key_length+= key_part_info->length;
+ /* TODO:
+ The below method of computing the key format length of the
+ key part is a copy/paste from opt_range.cc, and table.cc.
+ This should be factored out, e.g. as a method of Field.
+ In addition it is not clear if any of the Field::*_length
+ methods is supposed to compute the same length. If so, it
+ might be reused.
+ */
+ key_part_info->store_length= key_part_info->length;
+
+ if ((*reg_field)->real_maybe_null())
+ key_part_info->store_length+= HA_KEY_NULL_LENGTH;
+ if ((*reg_field)->type() == MYSQL_TYPE_BLOB ||
+ (*reg_field)->real_type() == MYSQL_TYPE_VARCHAR)
+ key_part_info->store_length+= HA_KEY_BLOB_LENGTH;
+
+ key_part_info->type= (uint8) (*reg_field)->key_type();
+ key_part_info->key_type =
+ ((ha_base_keytype) key_part_info->type == HA_KEYTYPE_TEXT ||
+ (ha_base_keytype) key_part_info->type == HA_KEYTYPE_VARTEXT1 ||
+ (ha_base_keytype) key_part_info->type == HA_KEYTYPE_VARTEXT2) ?
+ 0 : FIELDFLAG_BINARY;
+ key_part_info++;
+ }
+ set_if_bigger(s->max_key_length, keyinfo->key_length);
+ s->keys++;
+ return FALSE;
+}
+
+
/**
@brief Check if this is part of a MERGE table with attached children.
=== modified file 'sql/table.h'
--- a/sql/table.h 2010-03-20 12:01:47 +0000
+++ b/sql/table.h 2010-05-31 21:25:54 +0000
@@ -781,6 +781,7 @@
uint temp_pool_slot; /* Used by intern temp tables */
uint status; /* What's in record[0] */
uint db_stat; /* mode of file as in handler.h */
+ uint max_keys; /* Size of allocated key_info array. */
/* number of select if it is derived table */
uint derived_select_number;
int current_lock; /* Type of lock on table */
@@ -913,6 +914,9 @@
*/
inline bool needs_reopen_or_name_lock()
{ return s->version != refresh_version; }
+ bool alloc_keys(uint key_count);
+ bool add_tmp_key(uint key, uint key_parts,
+ uint (*next_field_no) (uchar *), uchar *arg);
bool is_children_attached(void);
};
=== modified file 'storage/maria/ha_maria.cc'
--- a/storage/maria/ha_maria.cc 2010-03-20 12:01:47 +0000
+++ b/storage/maria/ha_maria.cc 2010-05-31 21:25:54 +0000
@@ -995,6 +995,8 @@
{
MARIA_HA *tmp= file;
file= 0;
+ if (!tmp)
+ return 0;
return maria_close(tmp);
}
1
0

[Maria-developers] Rev 2789: Subquery cache for pre-review (MWL#66) in file:///home/bell/maria/bzr/work-maria-5.3-scache2/
by sanja@askmonty.org 31 May '10
by sanja@askmonty.org 31 May '10
31 May '10
At file:///home/bell/maria/bzr/work-maria-5.3-scache2/
------------------------------------------------------------
revno: 2789
revision-id: sanja(a)askmonty.org-20100531212240-qwphnvofu9f0l06l
parent: sergii(a)pisem.net-20100510134608-oyi2vznyghgcrt0x
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-scache2
timestamp: Tue 2010-06-01 00:22:40 +0300
message:
Subquery cache for pre-review (MWL#66)
=== modified file 'libmysqld/Makefile.am'
--- a/libmysqld/Makefile.am 2010-03-20 12:01:47 +0000
+++ b/libmysqld/Makefile.am 2010-05-31 21:22:40 +0000
@@ -80,7 +80,8 @@
sql_tablespace.cc \
rpl_injector.cc my_user.c partition_info.cc \
sql_servers.cc event_parse_data.cc opt_table_elimination.cc \
- multi_range_read.cc opt_index_cond_pushdown.cc
+ multi_range_read.cc opt_index_cond_pushdown.cc \
+ sql_subquery_cache.cc
libmysqld_int_a_SOURCES= $(libmysqld_sources)
nodist_libmysqld_int_a_SOURCES= $(libmysqlsources) $(sqlsources)
=== modified file 'mysql-test/r/index_merge_myisam.result'
--- a/mysql-test/r/index_merge_myisam.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/index_merge_myisam.result 2010-05-31 21:22:40 +0000
@@ -1419,19 +1419,19 @@
#
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='index_merge=off,index_merge_union=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='index_merge_union=on';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=off,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=off,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,index_merge_sort_union=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=off,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=off,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=4;
ERROR 42000: Variable 'optimizer_switch' can't be set to the value of '4'
set optimizer_switch=NULL;
@@ -1458,21 +1458,21 @@
set optimizer_switch='index_merge=off,index_merge_union=off,default';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=default;
select @@global.optimizer_switch;
@@global.optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set @@global.optimizer_switch=default;
select @@global.optimizer_switch;
@@global.optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
#
# Check index_merge's @@optimizer_switch flags
#
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int, c int, filler char(100),
@@ -1582,5 +1582,5 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
drop table t0, t1;
=== modified file 'mysql-test/r/myisam_mrr.result'
--- a/mysql-test/r/myisam_mrr.result 2010-03-11 21:43:31 +0000
+++ b/mysql-test/r/myisam_mrr.result 2010-05-31 21:22:40 +0000
@@ -394,7 +394,7 @@
# - engine_condition_pushdown does not affect ICP
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int, key(a));
=== modified file 'mysql-test/r/subselect3.result'
--- a/mysql-test/r/subselect3.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect3.result 2010-05-31 21:22:40 +0000
@@ -105,6 +105,7 @@
Handler_read_rnd_next 5
delete from t2;
insert into t2 values (NULL, 0),(NULL, 0), (NULL, 0), (NULL, 0);
+set optimizer_switch='subquery_cache=off';
flush status;
select oref, a, a in (select a from t1 where oref=t2.oref) Z from t2;
oref a Z
@@ -123,6 +124,7 @@
select 'No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.' Z;
Z
No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.
+set @@optimizer_switch=@save_optimizer_switch;
drop table t1, t2;
create table t1 (a int, b int, primary key (a));
insert into t1 values (1,1), (3,1),(100,1);
=== modified file 'mysql-test/r/subselect3_jcl6.result'
--- a/mysql-test/r/subselect3_jcl6.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect3_jcl6.result 2010-05-31 21:22:40 +0000
@@ -109,6 +109,7 @@
Handler_read_rnd_next 5
delete from t2;
insert into t2 values (NULL, 0),(NULL, 0), (NULL, 0), (NULL, 0);
+set optimizer_switch='subquery_cache=off';
flush status;
select oref, a, a in (select a from t1 where oref=t2.oref) Z from t2;
oref a Z
@@ -127,6 +128,7 @@
select 'No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.' Z;
Z
No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.
+set @@optimizer_switch=@save_optimizer_switch;
drop table t1, t2;
create table t1 (a int, b int, primary key (a));
insert into t1 values (1,1), (3,1),(100,1);
=== modified file 'mysql-test/r/subselect_no_mat.result'
--- a/mysql-test/r/subselect_no_mat.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/subselect_no_mat.result 2010-05-31 21:22:40 +0000
@@ -1,6 +1,6 @@
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='materialization=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4826,4 +4826,4 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_no_opts.result'
--- a/mysql-test/r/subselect_no_opts.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/subselect_no_opts.result 2010-05-31 21:22:40 +0000
@@ -1,6 +1,6 @@
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='materialization=off,semijoin=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4826,4 +4826,4 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_no_semijoin.result'
--- a/mysql-test/r/subselect_no_semijoin.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/subselect_no_semijoin.result 2010-05-31 21:22:40 +0000
@@ -1,6 +1,6 @@
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='semijoin=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4826,4 +4826,4 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_sj.result'
--- a/mysql-test/r/subselect_sj.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect_sj.result 2010-05-31 21:22:40 +0000
@@ -202,39 +202,39 @@
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=default;
drop table t0, t1, t2;
drop table t10, t11, t12;
=== modified file 'mysql-test/r/subselect_sj_jcl6.result'
--- a/mysql-test/r/subselect_sj_jcl6.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect_sj_jcl6.result 2010-05-31 21:22:40 +0000
@@ -206,39 +206,39 @@
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=default;
drop table t0, t1, t2;
drop table t10, t11, t12;
=== modified file 'mysql-test/t/subselect3.test'
--- a/mysql-test/t/subselect3.test 2010-03-20 12:01:47 +0000
+++ b/mysql-test/t/subselect3.test 2010-05-31 21:22:40 +0000
@@ -98,10 +98,12 @@
delete from t2;
insert into t2 values (NULL, 0),(NULL, 0), (NULL, 0), (NULL, 0);
+set optimizer_switch='subquery_cache=off';
flush status;
select oref, a, a in (select a from t1 where oref=t2.oref) Z from t2;
show status like '%Handler_read%';
select 'No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.' Z;
+set @@optimizer_switch=@save_optimizer_switch;
drop table t1, t2;
=== modified file 'sql/CMakeLists.txt'
--- a/sql/CMakeLists.txt 2010-03-20 12:01:47 +0000
+++ b/sql/CMakeLists.txt 2010-05-31 21:22:40 +0000
@@ -78,7 +78,7 @@
rpl_rli.cc rpl_mi.cc sql_servers.cc
sql_connect.cc scheduler.cc
sql_profile.cc event_parse_data.cc opt_table_elimination.cc
- ds_mrr.cc
+ ds_mrr.cc sql_subquery_cache.cc
${PROJECT_SOURCE_DIR}/sql/sql_yacc.cc
${PROJECT_SOURCE_DIR}/sql/sql_yacc.h
${PROJECT_SOURCE_DIR}/include/mysqld_error.h
=== modified file 'sql/Makefile.am'
--- a/sql/Makefile.am 2010-03-20 12:01:47 +0000
+++ b/sql/Makefile.am 2010-05-31 21:22:40 +0000
@@ -80,7 +80,7 @@
event_data_objects.h event_scheduler.h \
sql_partition.h partition_info.h partition_element.h \
contributors.h sql_servers.h \
- multi_range_read.h
+ multi_range_read.h sql_subquery_cache.h
mysqld_SOURCES = sql_lex.cc sql_handler.cc sql_partition.cc \
item.cc item_sum.cc item_buff.cc item_func.cc \
@@ -130,7 +130,7 @@
sql_servers.cc event_parse_data.cc \
opt_table_elimination.cc \
multi_range_read.cc \
- opt_index_cond_pushdown.cc
+ opt_index_cond_pushdown.cc sql_subquery_cache.cc
nodist_mysqld_SOURCES = mini_client_errors.c pack.c client.c my_time.c my_user.c
=== modified file 'sql/item.cc'
--- a/sql/item.cc 2010-03-20 12:01:47 +0000
+++ b/sql/item.cc 2010-05-31 21:22:40 +0000
@@ -28,6 +28,9 @@
const String my_null_string("NULL", 4, default_charset_info);
+static int save_field_in_field(Field *from,my_bool * null_value,
+ Field *to, bool no_conversions);
+
/****************************************************************************/
/* Hybrid_type_traits {_real} */
@@ -2273,6 +2276,13 @@
str->append(str_value);
}
+void Item_bool_cache::print(String *str, enum_query_type query_type)
+{
+ if (null_value)
+ str->append("NULL", 4);
+ else
+ Item_int::print(str, query_type);
+}
Item_uint::Item_uint(const char *str_arg, uint length):
Item_int(str_arg, length)
@@ -3646,12 +3656,17 @@
resolved_item->db_name : "");
const char *table_name= (resolved_item->table_name ?
resolved_item->table_name : "");
+ DBUG_ENTER("mark_as_dependent");
+ DBUG_PRINT("enter", ("Field '%s.%s.%s in select %d resolved in %d",
+ db_name, table_name,
+ resolved_item->field_name, current->select_number,
+ last->select_number));
/* store pointer on SELECT_LEX from which item is dependent */
if (mark_item)
mark_item->depended_from= last;
if (current->mark_as_dependent(thd, last, /** resolved_item psergey-thu
**/mark_item))
- return TRUE;
+ DBUG_RETURN(TRUE);
if (thd->lex->describe & DESCRIBE_EXTENDED)
{
push_warning_printf(thd, MYSQL_ERROR::WARN_LEVEL_NOTE,
@@ -3661,7 +3676,7 @@
resolved_item->field_name,
current->select_number, last->select_number);
}
- return FALSE;
+ DBUG_RETURN(FALSE);
}
@@ -3698,6 +3713,7 @@
resolving)
*/
SELECT_LEX *previous_select= current_sel;
+
for (; previous_select->outer_select() != last_select;
previous_select= previous_select->outer_select())
{
@@ -3726,6 +3742,7 @@
mark_as_dependent(thd, last_select, current_sel, resolved_item,
dependent);
}
+ return;
}
@@ -4098,6 +4115,9 @@
((ref_type == REF_ITEM ||
ref_type == FIELD_ITEM) ?
(Item_ident*) (*reference) : 0));
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
return 0;
}
}
@@ -4113,7 +4133,9 @@
((ref_type == REF_ITEM || ref_type == FIELD_ITEM) ?
(Item_ident*) (*reference) :
0));
-
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
/*
A reference to a view field had been found and we
substituted it instead of this Item (find_field_in_tables
@@ -4215,6 +4237,10 @@
mark_as_dependent(thd, last_checked_context->select_lex,
context->select_lex, rf,
rf);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
+
return 0;
}
else
@@ -4222,6 +4248,9 @@
mark_as_dependent(thd, last_checked_context->select_lex,
context->select_lex,
this, (Item_ident*)*reference);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
if (last_checked_context->select_lex->having_fix_field)
{
Item_ref *rf;
@@ -5082,39 +5111,48 @@
/**
+ Saves one Fields of an Item of in other Field
+
+ @param from Field to copy value from
+ @param null_value reference on item null_value to set it if it is needed
+ @param to Field to cope value to
+ @param no_conversions how to deal with NULL value (see
+ set_field_to_null_with_conversions())
+
+ @retval FALSE OK
+ @retval TRUE Error
+*/
+
+static int save_field_in_field(Field *from, my_bool *null_value,
+ Field *to, bool no_conversions)
+{
+ int res;
+ if (from->is_null())
+ {
+ (*null_value)= 1;
+ res= set_field_to_null_with_conversions(to, no_conversions);
+ }
+ else
+ {
+ to->set_notnull();
+ res= field_conv(to, from);
+ (*null_value)= 0;
+ }
+ return res;
+}
+
+/**
Set a field's value from a item.
*/
void Item_field::save_org_in_field(Field *to)
{
- if (field->is_null())
- {
- null_value=1;
- set_field_to_null_with_conversions(to, 1);
- }
- else
- {
- to->set_notnull();
- field_conv(to,field);
- null_value=0;
- }
+ save_field_in_field(field, &null_value, to, TRUE);
}
int Item_field::save_in_field(Field *to, bool no_conversions)
{
- int res;
- if (result_field->is_null())
- {
- null_value=1;
- res= set_field_to_null_with_conversions(to, no_conversions);
- }
- else
- {
- to->set_notnull();
- res= field_conv(to,result_field);
- null_value=0;
- }
- return res;
+ return save_field_in_field(result_field, &null_value, to, no_conversions);
}
@@ -5973,6 +6011,9 @@
refer_type == FIELD_ITEM) ?
(Item_ident*) (*reference) :
0));
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
/*
view reference found, we substituted it instead of this
Item, so can quit
@@ -6023,6 +6064,9 @@
thd->change_item_tree(reference, fld);
mark_as_dependent(thd, last_checked_context->select_lex,
thd->lex->current_select, fld, fld);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
/*
A reference is resolved to a nest level that's outer or the same as
the nest level of the enclosing set function : adjust the value of
@@ -6046,6 +6090,9 @@
DBUG_ASSERT(*ref && (*ref)->fixed);
mark_as_dependent(thd, last_checked_context->select_lex,
context->select_lex, this, this);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ ref);
/*
A reference is resolved to a nest level that's outer or the same as
the nest level of the enclosing set function : adjust the value of
@@ -6312,7 +6359,8 @@
int Item_ref::save_in_field(Field *to, bool no_conversions)
{
int res;
- DBUG_ASSERT(!result_field);
+ if (result_field)
+ return save_field_in_field(result_field, &null_value, to, no_conversions);
res= (*ref)->save_in_field(to, no_conversions);
null_value= (*ref)->null_value;
return res;
=== modified file 'sql/item.h'
--- a/sql/item.h 2010-03-20 12:01:47 +0000
+++ b/sql/item.h 2010-05-31 21:22:40 +0000
@@ -1922,8 +1922,31 @@
virtual void print(String *str, enum_query_type query_type);
Item_num *neg ();
uint decimal_precision() const { return max_length; }
- bool check_partition_func_processor(uchar *bool_arg) { return FALSE;}
- bool check_vcol_func_processor(uchar *arg) { return FALSE;}
+};
+
+
+/**
+ Item represent TRUE/FALSE/NULL for subquery values
+*/
+
+class Item_bool_cache: public Item_int
+{
+public:
+ Item_bool_cache(): Item_int(0, 1)
+ {
+ unsigned_flag= maybe_null= null_value= TRUE;
+ name= (char *)"bool chache";
+ }
+ Item_bool_cache(my_bool val, my_bool null): Item_int(val, 1)
+ {
+ unsigned_flag= maybe_null= TRUE;
+ null_value= null;
+ name= (char *)"bool chache";
+ }
+ Item *clone_item() { return new Item_bool_cache(value, null_value); }
+ uint decimal_precision() const { return 1; }
+ virtual void print(String *str, enum_query_type query_type);
+ void set(my_bool val, my_bool null) {value= test(val); null_value= null;}
};
@@ -3146,7 +3169,8 @@
example(0), used_table_map(0), cached_field(0), cached_field_type(MYSQL_TYPE_STRING),
value_cached(0)
{
- fixed= 1;
+ fixed= 1;
+ maybe_null= 1;
null_value= 1;
}
Item_cache(enum_field_types field_type_arg):
@@ -3154,6 +3178,7 @@
value_cached(0)
{
fixed= 1;
+ maybe_null= 1;
null_value= 1;
}
=== modified file 'sql/item_cmpfunc.cc'
--- a/sql/item_cmpfunc.cc 2010-03-20 12:01:47 +0000
+++ b/sql/item_cmpfunc.cc 2010-05-31 21:22:40 +0000
@@ -1736,6 +1736,15 @@
used_tables_cache|= args[1]->used_tables();
not_null_tables_cache|= args[1]->not_null_tables();
const_item_cache&= args[1]->const_item();
+ DBUG_ASSERT(scache == NULL);
+ if (args[0]->cols() ==1 &&
+ thd->variables.optimizer_switch & OPTIMIZER_SWITCH_SUBQUERY_CACHE &&
+ !(sub->engine->uncacheable() & (UNCACHEABLE_RAND |
+ UNCACHEABLE_SIDEEFFECT)))
+ {
+ sub->depends_on.push_front((Item**)&cache);
+ scache= new Subquery_cache_tmptable(thd, sub->depends_on, &result);
+ }
fixed= 1;
return FALSE;
}
@@ -1744,10 +1753,26 @@
longlong Item_in_optimizer::val_int()
{
bool tmp;
+ DBUG_ENTER("Item_in_optimizer::val_int");
+
DBUG_ASSERT(fixed == 1);
cache->store(args[0]);
cache->cache_value();
-
+
+ /* check if result is in the cache */
+ if (scache)
+ {
+ Subquery_cache_tmptable::result res;
+ Item *cached_value;
+ res= scache->check_value(&cached_value);
+ if (res == Subquery_cache_tmptable::HIT)
+ {
+ tmp= cached_value->val_int();
+ null_value= cached_value->null_value;
+ DBUG_RETURN(tmp);
+ }
+ }
+
if (cache->null_value)
{
/*
@@ -1818,11 +1843,18 @@
for (uint i= 0; i < ncols; i++)
item_subs->set_cond_guard_var(i, TRUE);
}
- return 0;
+ DBUG_RETURN(0);
}
tmp= args[1]->val_bool_result();
null_value= args[1]->null_value;
- return tmp;
+
+ /* put result in the cache */
+ if (scache)
+ {
+ result.set(tmp, null_value);
+ scache->put_value(&result);
+ }
+ DBUG_RETURN(tmp);
}
@@ -1839,6 +1871,11 @@
Item_bool_func::cleanup();
if (!save_cache)
cache= 0;
+ if (scache)
+ {
+ delete scache;
+ scache= 0;
+ }
DBUG_VOID_RETURN;
}
=== modified file 'sql/item_cmpfunc.h'
--- a/sql/item_cmpfunc.h 2010-03-20 12:01:47 +0000
+++ b/sql/item_cmpfunc.h 2010-05-31 21:22:40 +0000
@@ -215,6 +215,7 @@
class Item_cache;
+class Subquery_cache;
#define UNKNOWN ((my_bool)-1)
@@ -237,6 +238,10 @@
{
protected:
Item_cache *cache;
+ /* Subquery cache */
+ Subquery_cache *scache;
+ /* result representation for the subquery cache */
+ Item_bool_cache result;
bool save_cache;
/*
Stores the value of "NULL IN (SELECT ...)" for uncorrelated subqueries:
@@ -247,7 +252,7 @@
my_bool result_for_null_param;
public:
Item_in_optimizer(Item *a, Item_in_subselect *b):
- Item_bool_func(a, my_reinterpret_cast(Item *)(b)), cache(0),
+ Item_bool_func(a, my_reinterpret_cast(Item *)(b)), cache(0), scache(NULL),
save_cache(0), result_for_null_param(UNKNOWN)
{}
bool fix_fields(THD *, Item **);
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-03-29 14:04:35 +0000
+++ b/sql/item_subselect.cc 2010-05-31 21:22:40 +0000
@@ -34,11 +34,10 @@
Item_subselect::Item_subselect():
Item_result_field(), value_assigned(0), thd(0), substitution(0),
- engine(0), old_engine(0), used_tables_cache(0), have_to_be_excluded(0),
- const_item_cache(1),
- inside_first_fix_fields(0), done_first_fix_fields(FALSE),
- eliminated(FALSE),
- engine_changed(0), changed(0), is_correlated(FALSE)
+ engine(0), old_engine(0), scache(0), used_tables_cache(0),
+ have_to_be_excluded(0), const_item_cache(1), inside_first_fix_fields(0),
+ done_first_fix_fields(FALSE), eliminated(FALSE), engine_changed(0),
+ changed(0), is_correlated(FALSE)
{
with_subselect= 1;
reset();
@@ -116,6 +115,12 @@
}
if (engine)
engine->cleanup();
+ depends_on.empty();
+ if (scache)
+ {
+ delete scache;
+ scache= 0;
+ }
reset();
value_assigned= 0;
DBUG_VOID_RETURN;
@@ -148,6 +153,8 @@
Item_subselect::~Item_subselect()
{
delete engine;
+ if (scache)
+ delete scache;
}
Item_subselect::trans_res
@@ -746,9 +753,22 @@
void Item_singlerow_subselect::fix_length_and_dec()
{
+ DBUG_ENTER("Item_singlerow_subselect::fix_length_and_dec");
if ((max_columns= engine->cols()) == 1)
{
+ DBUG_PRINT("info", ("one, elements: %u flag %u",
+ (uint)depends_on.elements,
+ (uint)test(thd->variables.optimizer_switch & OPTIMIZER_SWITCH_SUBQUERY_CACHE)));
engine->fix_length_and_dec(row= &value);
+ if (depends_on.elements &&
+ optimizer_flag(thd, OPTIMIZER_SWITCH_SUBQUERY_CACHE) &&
+ !(engine->uncacheable() & (UNCACHEABLE_RAND |
+ UNCACHEABLE_SIDEEFFECT)))
+ {
+ DBUG_ASSERT(scache == NULL);
+ scache= new Subquery_cache_tmptable(thd, depends_on, value);
+ DBUG_PRINT("info", ("cache: 0x%lx", (ulong) scache));
+ }
}
else
{
@@ -765,6 +785,7 @@
*/
if (engine->no_tables())
maybe_null= engine->may_be_null();
+ DBUG_VOID_RETURN;
}
uint Item_singlerow_subselect::cols()
@@ -797,77 +818,206 @@
exec();
}
+/**
+ Checks subquery cache for value
+
+ @retval NULL nothing found
+ @retval reference on item representing value found in the cache
+*/
+
+Item *Item_subselect::check_cache()
+{
+ DBUG_ENTER("Item_subselect::check_cache");
+ if (scache)
+ {
+ Subquery_cache_tmptable::result res;
+ Item *cached_value;
+ res= scache->check_value(&cached_value);
+ if (res == Subquery_cache_tmptable::HIT)
+ DBUG_RETURN(cached_value);
+ }
+ DBUG_RETURN(NULL);
+}
+
double Item_singlerow_subselect::val_real()
{
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_real");
DBUG_ASSERT(fixed == 1);
- if (!exec() && !value->null_value)
+
+ if ((cached_value = check_cache()))
+ {
+ double res= cached_value->val_real();
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_real();
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_real());
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
longlong Item_singlerow_subselect::val_int()
{
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_int");
DBUG_ASSERT(fixed == 1);
- if (!exec() && !value->null_value)
+
+ if ((cached_value = check_cache()))
+ {
+ longlong res= cached_value->val_int();
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_int();
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_int());
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
String *Item_singlerow_subselect::val_str(String *str)
{
- if (!exec() && !value->null_value)
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_str");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ String *res= cached_value->val_str(str);
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_str(str);
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_str(str));
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
my_decimal *Item_singlerow_subselect::val_decimal(my_decimal *decimal_value)
{
- if (!exec() && !value->null_value)
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_decimal");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ my_decimal *res= cached_value->val_decimal(decimal_value);
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_decimal(decimal_value);
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_decimal(decimal_value));
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
bool Item_singlerow_subselect::val_bool()
{
- if (!exec() && !value->null_value)
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_bool");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ bool res= cached_value->val_bool();
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_bool();
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_bool());
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
@@ -952,33 +1102,79 @@
void Item_exists_subselect::fix_length_and_dec()
{
+ DBUG_ENTER("Item_exists_subselect::fix_length_and_dec");
decimals= 0;
max_length= 1;
max_columns= engine->cols();
/* We need only 1 row to determine existence */
unit->global_parameters->select_limit= new Item_int((int32) 1);
+ if (substype() == EXISTS_SUBS && depends_on.elements &&
+ optimizer_flag(thd, OPTIMIZER_SWITCH_SUBQUERY_CACHE) &&
+ !(engine->uncacheable() & (UNCACHEABLE_RAND |
+ UNCACHEABLE_SIDEEFFECT)))
+ {
+ DBUG_ASSERT(scache == NULL);
+ scache= new Subquery_cache_tmptable(thd, depends_on, &result);
+ DBUG_PRINT("info", ("cache: 0x%lx", (ulong) scache));
+ }
+ DBUG_VOID_RETURN;
}
double Item_exists_subselect::val_real()
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_int");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ double res= cached_value->val_real();
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
{
reset();
- return 0;
- }
- return (double) value;
+ DBUG_RETURN(0);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
+ DBUG_RETURN((double) value);
}
longlong Item_exists_subselect::val_int()
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_real");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ longlong res= cached_value->val_int();
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
DBUG_ASSERT(fixed == 1);
if (exec())
{
reset();
- return 0;
- }
- return value;
+ DBUG_RETURN(0);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
+ DBUG_RETURN(value);
}
@@ -997,11 +1193,32 @@
String *Item_exists_subselect::val_str(String *str)
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_str");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ String *res= cached_value->val_str(str);
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
+ {
reset();
+ str->set((ulonglong)0,&my_charset_bin);
+ DBUG_RETURN(str);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
str->set((ulonglong)value,&my_charset_bin);
- return str;
+ DBUG_RETURN(str);
}
@@ -1020,23 +1237,61 @@
my_decimal *Item_exists_subselect::val_decimal(my_decimal *decimal_value)
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_decvimal");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ my_decimal *res= cached_value->val_decimal(decimal_value);
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
+ {
reset();
+ int2my_decimal(E_DEC_FATAL_ERROR, 0, 0, decimal_value);
+ DBUG_RETURN(decimal_value);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
int2my_decimal(E_DEC_FATAL_ERROR, value, 0, decimal_value);
- return decimal_value;
+ DBUG_RETURN(decimal_value);
}
bool Item_exists_subselect::val_bool()
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_real");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ my_bool res= cached_value->val_bool();
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
{
reset();
- return 0;
- }
- return value != 0;
+ DBUG_RETURN(0);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
+ DBUG_RETURN(value != 0);
}
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-03-29 14:04:35 +0000
+++ b/sql/item_subselect.h 2010-05-31 21:22:40 +0000
@@ -27,6 +27,7 @@
class subselect_hash_sj_engine;
class Item_bool_func2;
class Cached_item;
+class Subquery_cache;
/* base class for subselects */
@@ -57,6 +58,10 @@
subselect_engine *engine;
/* old engine if engine was changed */
subselect_engine *old_engine;
+ /* subquery cache */
+ Subquery_cache *scache;
+ /* null consrtant for caching */
+ Item_null const_null_value;
/* cache of used external tables */
table_map used_tables_cache;
/* allowed number of columns (1 for single value subqueries) */
@@ -67,7 +72,7 @@
bool have_to_be_excluded;
/* cache of constant state */
bool const_item_cache;
-
+
bool inside_first_fix_fields;
bool done_first_fix_fields;
public:
@@ -88,13 +93,21 @@
*/
List<Ref_to_outside> upper_refs;
st_select_lex *parent_select;
-
- /*
+
+ /**
+ List of references on items subquery depends on (externally resolved);
+
+ @note We can't store direct links on Items because it could be
+ substituted with other item (for example for grouping).
+ */
+ List<Item*> depends_on;
+
+ /*
TRUE<=>Table Elimination has made it redundant to evaluate this select
(and so it is not part of QEP, etc)
- */
+ */
bool eliminated;
-
+
/* changed engine indicator */
bool engine_changed;
/* subquery is transformed */
@@ -178,6 +191,8 @@
return trace_unsupported_by_check_vcol_func_processor("subselect");
}
+ Item *check_cache();
+
/**
Get the SELECT_LEX structure associated with this Item.
@return the SELECT_LEX structure associated with this Item
@@ -202,6 +217,7 @@
{
protected:
Item_cache *value, **row;
+
public:
Item_singlerow_subselect(st_select_lex *select_lex);
Item_singlerow_subselect() :Item_subselect(), value(0), row (0) {}
@@ -268,6 +284,8 @@
{
protected:
bool value; /* value of this item (boolean: exists/not-exists) */
+ /* result representation for the subquery cache */
+ Item_bool_cache result;
public:
Item_exists_subselect(st_select_lex *select_lex);
=== modified file 'sql/item_sum.cc'
--- a/sql/item_sum.cc 2010-03-20 12:01:47 +0000
+++ b/sql/item_sum.cc 2010-05-31 21:22:40 +0000
@@ -319,6 +319,7 @@
if (aggr_level >= 0)
{
ref_by= ref;
+ thd->lex->current_select->register_dependency_item(aggr_sel, ref);
/* Add the object to the list of registered objects assigned to aggr_sel */
if (!aggr_sel->inner_sum_func_list)
next= this;
=== modified file 'sql/mysql_priv.h'
--- a/sql/mysql_priv.h 2010-03-20 12:01:47 +0000
+++ b/sql/mysql_priv.h 2010-05-31 21:22:40 +0000
@@ -568,12 +568,13 @@
#define OPTIMIZER_SWITCH_SEMIJOIN 256
#define OPTIMIZER_SWITCH_PARTIAL_MATCH_ROWID_MERGE 512
#define OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN 1024
+#define OPTIMIZER_SWITCH_SUBQUERY_CACHE (1<<11)
#ifdef DBUG_OFF
-# define OPTIMIZER_SWITCH_LAST 2048
+# define OPTIMIZER_SWITCH_LAST (1<<12)
#else
-# define OPTIMIZER_SWITCH_TABLE_ELIMINATION 2048
-# define OPTIMIZER_SWITCH_LAST 4096
+# define OPTIMIZER_SWITCH_TABLE_ELIMINATION (1<<12)
+# define OPTIMIZER_SWITCH_LAST (1<<13)
#endif
#ifdef DBUG_OFF
@@ -588,7 +589,8 @@
OPTIMIZER_SWITCH_MATERIALIZATION | \
OPTIMIZER_SWITCH_SEMIJOIN | \
OPTIMIZER_SWITCH_PARTIAL_MATCH_ROWID_MERGE|\
- OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN)
+ OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN|\
+ OPTIMIZER_SWITCH_SUBQUERY_CACHE)
#else
# define OPTIMIZER_SWITCH_DEFAULT (OPTIMIZER_SWITCH_INDEX_MERGE | \
OPTIMIZER_SWITCH_INDEX_MERGE_UNION | \
@@ -601,7 +603,8 @@
OPTIMIZER_SWITCH_MATERIALIZATION | \
OPTIMIZER_SWITCH_SEMIJOIN | \
OPTIMIZER_SWITCH_PARTIAL_MATCH_ROWID_MERGE|\
- OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN)
+ OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN|\
+ OPTIMIZER_SWITCH_SUBQUERY_CACHE)
#endif
/*
@@ -936,6 +939,7 @@
#ifdef MYSQL_SERVER
#include "sql_servers.h"
#include "opt_range.h"
+#include "sql_subquery_cache.h"
#ifdef HAVE_QUERY_CACHE
struct Query_cache_query_flags
@@ -1269,6 +1273,10 @@
Item *having, ORDER *proc_param, ulonglong select_type,
select_result *result, SELECT_LEX_UNIT *unit,
SELECT_LEX *select_lex);
+
+struct st_join_table *create_index_lookup_join_tab(TABLE *table);
+int join_read_key2(THD *thd, struct st_join_table *tab, TABLE *table,
+ struct st_table_ref *table_ref);
void free_underlaid_joins(THD *thd, SELECT_LEX *select);
bool mysql_explain_union(THD *thd, SELECT_LEX_UNIT *unit,
select_result *result);
@@ -1288,6 +1296,7 @@
bool table_cant_handle_bit_fields,
bool make_copy_field,
uint convert_blob_length);
+bool open_tmp_table(TABLE *table);
void sp_prepare_create_field(THD *thd, Create_field *sql_field);
int prepare_create_field(Create_field *sql_field,
uint *blob_columns,
=== modified file 'sql/mysqld.cc'
--- a/sql/mysqld.cc 2010-03-20 12:01:47 +0000
+++ b/sql/mysqld.cc 2010-05-31 21:22:40 +0000
@@ -305,6 +305,7 @@
"firstmatch","loosescan","materialization", "semijoin",
"partial_match_rowid_merge",
"partial_match_table_scan",
+ "subquery_cache",
#ifndef DBUG_OFF
"table_elimination",
#endif
@@ -325,6 +326,7 @@
sizeof("semijoin") - 1,
sizeof("partial_match_rowid_merge") - 1,
sizeof("partial_match_table_scan") - 1,
+ sizeof("subquery_cache") - 1,
#ifndef DBUG_OFF
sizeof("table_elimination") - 1,
#endif
@@ -404,8 +406,9 @@
static const char *optimizer_switch_str="index_merge=on,index_merge_union=on,"
"index_merge_sort_union=on,"
"index_merge_intersection=on,"
- "index_condition_pushdown=on"
-#ifndef DBUG_OFF
+ "index_condition_pushdown=on,"
+ "subquery_cache=on"
+#ifndef DBUG_OFF
",table_elimination=on";
#else
;
@@ -5872,7 +5875,9 @@
OPT_RECORD_RND_BUFFER, OPT_DIV_PRECINCREMENT, OPT_RELAY_LOG_SPACE_LIMIT,
OPT_RELAY_LOG_PURGE,
OPT_SLAVE_NET_TIMEOUT, OPT_SLAVE_COMPRESSED_PROTOCOL, OPT_SLOW_LAUNCH_TIME,
- OPT_SLAVE_TRANS_RETRIES, OPT_READONLY, OPT_ROWID_MERGE_BUFF_SIZE,
+ OPT_SLAVE_TRANS_RETRIES,
+ OPT_SUBQUERY_CACHE,
+ OPT_READONLY, OPT_ROWID_MERGE_BUFF_SIZE,
OPT_DEBUGGING, OPT_DEBUG_FLUSH,
OPT_SORT_BUFFER, OPT_TABLE_OPEN_CACHE, OPT_TABLE_DEF_CACHE,
OPT_THREAD_CONCURRENCY, OPT_THREAD_CACHE_SIZE,
@@ -7164,7 +7169,7 @@
{"optimizer_switch", OPT_OPTIMIZER_SWITCH,
"optimizer_switch=option=val[,option=val...], where option={index_merge, "
"index_merge_union, index_merge_sort_union, index_merge_intersection, "
- "index_condition_pushdown"
+ "index_condition_pushdown, subquery_cache"
#ifndef DBUG_OFF
", table_elimination"
#endif
@@ -7868,6 +7873,8 @@
{"Ssl_version", (char*) &show_ssl_get_version, SHOW_FUNC},
#endif /* HAVE_OPENSSL */
{"Syncs", (char*) &my_sync_count, SHOW_LONG_NOFLUSH},
+ {"Subquery_cache_hit", (char*) &subquery_cache_hit, SHOW_LONG},
+ {"Subquery_cache_miss", (char*) &subquery_cache_miss, SHOW_LONG},
{"Table_locks_immediate", (char*) &locks_immediate, SHOW_LONG},
{"Table_locks_waited", (char*) &locks_waited, SHOW_LONG},
#ifdef HAVE_MMAP
@@ -8006,6 +8013,7 @@
abort_loop= select_thread_in_use= signal_thread_in_use= 0;
ready_to_exit= shutdown_in_progress= grant_option= 0;
aborted_threads= aborted_connects= 0;
+ subquery_cache_miss= subquery_cache_hit= 0;
delayed_insert_threads= delayed_insert_writes= delayed_rows_in_use= 0;
delayed_insert_errors= thread_created= 0;
specialflag= 0;
=== modified file 'sql/sql_base.cc'
--- a/sql/sql_base.cc 2010-03-20 12:01:47 +0000
+++ b/sql/sql_base.cc 2010-05-31 21:22:40 +0000
@@ -8062,6 +8062,10 @@
if (*conds)
{
thd->where="where clause";
+ DBUG_EXECUTE("where",
+ print_where(*conds,
+ "WHERE in setup_conds",
+ QT_ORDINARY););
if ((!(*conds)->fixed && (*conds)->fix_fields(thd, conds)) ||
(*conds)->check_cols(1))
goto err_no_arena;
=== modified file 'sql/sql_class.cc'
--- a/sql/sql_class.cc 2010-03-20 12:01:47 +0000
+++ b/sql/sql_class.cc 2010-05-31 21:22:40 +0000
@@ -3020,6 +3020,7 @@
table_charset= 0;
precomputed_group_by= 0;
bit_fields_as_long= 0;
+ skip_create_table= 0;
DBUG_VOID_RETURN;
}
=== modified file 'sql/sql_class.h'
--- a/sql/sql_class.h 2010-03-20 12:01:47 +0000
+++ b/sql/sql_class.h 2010-05-31 21:22:40 +0000
@@ -2786,12 +2786,17 @@
that MEMORY tables cannot index BIT columns.
*/
bool bit_fields_as_long;
+ /*
+ Whether to create or postpone actual creation of this temporary table.
+ TRUE <=> create_tmp_table will create only the TABLE structure.
+ */
+ bool skip_create_table;
TMP_TABLE_PARAM()
:copy_field(0), group_parts(0),
group_length(0), group_null_parts(0), convert_blob_length(0),
schema_table(0), precomputed_group_by(0), force_copy_fields(0),
- bit_fields_as_long(0)
+ bit_fields_as_long(0), skip_create_table(0)
{}
~TMP_TABLE_PARAM()
{
=== modified file 'sql/sql_lex.cc'
--- a/sql/sql_lex.cc 2010-03-20 12:01:47 +0000
+++ b/sql/sql_lex.cc 2010-05-31 21:22:40 +0000
@@ -1829,6 +1829,52 @@
}
+/**
+ Registers reference on items on which the subqueries depends
+
+ @param last pointer to last st_select_lex struct, before
+ which all st_select_lex have to be marked as
+ dependent
+ @param dependency reference on the item on which all this
+ subqueries depends
+
+*/
+
+void st_select_lex::register_dependency_item(st_select_lex *last,
+ Item **dependency)
+{
+ SELECT_LEX *s= this;
+ DBUG_ENTER("st_select_lex::register_dependency_item");
+ DBUG_ASSERT(this != last);
+ DBUG_ASSERT(*dependency);
+ do
+ {
+ /* check duplicates */
+ List_iterator_fast<Item*> li(s->master_unit()->item->depends_on);
+ Item **dep;
+ while ((dep= li++))
+ {
+ if ((*dep)->eq(*dependency, FALSE))
+ {
+ DBUG_PRINT("info", ("dependency %s already present",
+ ((*dependency)->name ?
+ (*dependency)->name :
+ "<no name>")));
+ DBUG_VOID_RETURN;
+ }
+ }
+
+ s->master_unit()->item->depends_on.push_back(dependency);
+ DBUG_PRINT("info", ("depends_on: Select: %d added: %s",
+ s->select_number,
+ ((*dependency)->name ?
+ (*dependency)->name :
+ "<no name>")));
+ } while ((s= s->outer_select()) != last && s != 0);
+ DBUG_VOID_RETURN;
+}
+
+
/*
st_select_lex_node::mark_as_dependent mark all st_select_lex struct from
this to 'last' as dependent
@@ -1843,7 +1889,7 @@
bool st_select_lex::mark_as_dependent(THD *thd, st_select_lex *last, Item *dependency)
{
-
+ DBUG_ENTER("st_select_lex::mark_as_dependent");
DBUG_ASSERT(this != last);
/*
@@ -1872,11 +1918,11 @@
Item_subselect *subquery_expr= s->master_unit()->item;
if (subquery_expr && subquery_expr->mark_as_dependent(thd, last,
dependency))
- return TRUE;
+ DBUG_RETURN(TRUE);
} while ((s= s->outer_select()) != last && s != 0);
is_correlated= TRUE;
this->master_unit()->item->is_correlated= TRUE;
- return FALSE;
+ DBUG_RETURN(FALSE);
}
bool st_select_lex_node::set_braces(bool value) { return 1; }
=== modified file 'sql/sql_lex.h'
--- a/sql/sql_lex.h 2010-03-20 12:01:47 +0000
+++ b/sql/sql_lex.h 2010-05-31 21:22:40 +0000
@@ -748,6 +748,7 @@
}
bool mark_as_dependent(THD *thd, st_select_lex *last, Item *dependency);
+ void register_dependency_item(st_select_lex *last, Item **dependency);
bool set_braces(bool value);
bool inc_in_sum_expr();
=== modified file 'sql/sql_select.cc'
--- a/sql/sql_select.cc 2010-05-10 13:46:08 +0000
+++ b/sql/sql_select.cc 2010-05-31 21:22:40 +0000
@@ -151,7 +151,6 @@
static int join_read_system(JOIN_TAB *tab);
static int join_read_const(JOIN_TAB *tab);
static int join_read_key(JOIN_TAB *tab);
-static int join_read_key2(JOIN_TAB *tab, TABLE *table, TABLE_REF *table_ref);
static void join_read_key_unlock_row(st_join_table *tab);
static int join_read_always_key(JOIN_TAB *tab);
static int join_read_last_key(JOIN_TAB *tab);
@@ -5209,7 +5208,7 @@
'join->best_positions' contains a complete optimal extension of the
current partial QEP.
*/
- DBUG_EXECUTE("opt", print_plan(join, join->tables,
+ DBUG_EXECUTE("opt", print_plan(join, n_tables,
record_count, read_time, read_time,
"optimal"););
DBUG_RETURN(FALSE);
@@ -7625,6 +7624,40 @@
/**
+ Creates and fills JOIN_TAB for index look up in temporary table
+
+ @param table The table where to look up
+
+ @return JOIN_TAB object or NULL in case of error
+*/
+
+JOIN_TAB *create_index_lookup_join_tab(TABLE *table)
+{
+ JOIN_TAB *tab;
+ DBUG_ENTER("create_index_lookup_join_tab");
+
+ if (!((tab= new JOIN_TAB)))
+ DBUG_RETURN(NULL);
+ tab->read_record.table= table;
+ tab->read_record.file=table->file;
+ /*tab->read_record.unlock_row= rr_unlock_row;*/
+ tab->next_select=0;
+ tab->sorted= 1;
+
+ table->status= STATUS_NO_RECORD;
+ tab->read_first_record= join_read_key;
+ /*tab->read_record.unlock_row= join_read_key_unlock_row;*/
+ tab->read_record.read_record= join_no_more_records;
+ if (table->covering_keys.is_set(tab->ref.key) &&
+ !table->no_keyread)
+ {
+ table->key_read=1;
+ table->file->extra(HA_EXTRA_KEYREAD);
+ }
+ DBUG_RETURN(tab);
+}
+
+/**
Give error if we some tables are done with a full join.
This is used by multi_table_update and multi_table_delete when running
@@ -10778,6 +10811,7 @@
case Item::REF_ITEM:
case Item::NULL_ITEM:
case Item::VARBIN_ITEM:
+ case Item::CACHE_ITEM:
if (make_copy_field)
{
DBUG_ASSERT(((Item_result_field*)item)->result_field);
@@ -11552,7 +11586,8 @@
¶m->recinfo, select_options))
goto err;
}
- if (open_tmp_table(table))
+ DBUG_PRINT("info", ("skip_create_table: %d", (int)param->skip_create_table));
+ if (!param->skip_create_table && open_tmp_table(table))
goto err;
thd->mem_root= mem_root_save;
@@ -11700,16 +11735,17 @@
bool open_tmp_table(TABLE *table)
{
int error;
+ DBUG_ENTER("open_tmp_table");
if ((error= table->file->ha_open(table, table->s->table_name.str, O_RDWR,
HA_OPEN_TMP_TABLE |
HA_OPEN_INTERNAL_TABLE)))
{
table->file->print_error(error,MYF(0)); /* purecov: inspected */
table->db_stat=0;
- return(1);
+ DBUG_RETURN(1);
}
(void) table->file->extra(HA_EXTRA_QUICK); /* Faster */
- return(0);
+ DBUG_RETURN(0);
}
@@ -12540,7 +12576,8 @@
else
{
/* Do index lookup in the materialized table */
- if ((res= join_read_key2(join_tab, sjm->table, sjm->tab_ref)) == 1)
+ if ((res= join_read_key2(join_tab->join->thd, join_tab,
+ sjm->table, sjm->tab_ref)) == 1)
DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
if (res || !sjm->in_equality->val_int())
DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
@@ -13323,61 +13360,62 @@
static int
join_read_key(JOIN_TAB *tab)
{
- return join_read_key2(tab, tab->table, &tab->ref);
+ return join_read_key2(tab->join->thd, tab, tab->table, &tab->ref);
}
-/*
+/*
eq_ref access handler but generalized a bit to support TABLE and TABLE_REF
not from the join_tab. See join_read_key for detailed synopsis.
*/
-static int
-join_read_key2(JOIN_TAB *tab, TABLE *table, TABLE_REF *table_ref)
+int join_read_key2(THD *thd, JOIN_TAB *tab, TABLE *table, TABLE_REF *table_ref)
{
int error;
+ DBUG_ENTER("join_read_key2");
if (!table->file->inited)
{
table->file->ha_index_init(table_ref->key, tab->sorted);
}
/* TODO: Why don't we do "Late NULLs Filtering" here? */
- if (cmp_buffer_with_ref(tab->join->thd, table, table_ref) ||
+ if (cmp_buffer_with_ref(thd, table, table_ref) ||
(table->status & (STATUS_GARBAGE | STATUS_NO_PARENT | STATUS_NULL_ROW)))
{
if (table_ref->key_err)
{
table->status=STATUS_NOT_FOUND;
- return -1;
+ DBUG_RETURN(-1);
}
/*
Moving away from the current record. Unlock the row
in the handler if it did not match the partial WHERE.
*/
- if (tab->ref.has_record && tab->ref.use_count == 0)
+ if (table_ref->has_record )
+ if (table_ref->use_count == 0)
{
tab->read_record.file->unlock_row();
- tab->ref.has_record= FALSE;
+ table_ref->has_record= FALSE;
}
error=table->file->ha_index_read_map(table->record[0],
table_ref->key_buff,
make_prev_keypart_map(table_ref->key_parts),
HA_READ_KEY_EXACT);
if (error && error != HA_ERR_KEY_NOT_FOUND && error != HA_ERR_END_OF_FILE)
- return report_error(table, error);
+ DBUG_RETURN(report_error(table, error));
if (! error)
{
- tab->ref.has_record= TRUE;
- tab->ref.use_count= 1;
+ table_ref->has_record= TRUE;
+ table_ref->use_count= 1;
}
}
else if (table->status == 0)
{
- DBUG_ASSERT(tab->ref.has_record);
- tab->ref.use_count++;
+ DBUG_ASSERT(table_ref->has_record);
+ table_ref->use_count++;
}
table->null_row=0;
- return table->status ? -1 : 0;
+ DBUG_RETURN(table->status ? -1 : 0);
}
=== modified file 'sql/table.cc'
--- a/sql/table.cc 2010-03-20 12:01:47 +0000
+++ b/sql/table.cc 2010-05-31 21:22:40 +0000
@@ -20,6 +20,7 @@
#include "sql_trigger.h"
#include <m_ctype.h>
#include "my_md5.h"
+#include "my_bit.h"
/* INFORMATION_SCHEMA name */
LEX_STRING INFORMATION_SCHEMA_NAME= {C_STRING_WITH_LEN("information_schema")};
@@ -5096,6 +5097,115 @@
file->column_bitmaps_signal();
}
+
+/**
+ @brief
+ Allocate space for keys
+
+ @param key_count number of keys to allocate.
+
+ @details
+ Allocate space enough to fit 'key_count' keys for this table.
+
+ @return FALSE space was successfully allocated.
+ @return TRUE an error occur.
+*/
+
+bool TABLE::alloc_keys(uint key_count)
+{
+ DBUG_ASSERT(!s->keys);
+ key_info= s->key_info= (KEY*) alloc_root(&mem_root, sizeof(KEY)*key_count);
+ max_keys= key_count;
+ return !(key_info);
+}
+
+
+/**
+ @brief Adds one key to a temporary table.
+
+ @param key key number.
+ @param key_parts number of fields in the key
+ @param next_field_no function which returns field numbers which
+ should be included in the key
+ @param arg above function argement
+
+ @return <0 an error occur.
+ @return >=0 number of newly added key.
+*/
+
+bool TABLE::add_tmp_key(uint key, uint key_parts,
+ uint (*next_field_no) (uchar *), uchar *arg)
+{
+ DBUG_ASSERT(key < max_keys);
+
+ char buf[NAME_CHAR_LEN];
+ KEY* keyinfo;
+ Field **reg_field;
+ uint i;
+ bool key_start= TRUE;
+ KEY_PART_INFO* key_part_info=
+ (KEY_PART_INFO*) alloc_root(&mem_root, sizeof(KEY_PART_INFO)*key_parts);
+ if (!key_part_info)
+ return TRUE;
+ keyinfo= key_info + key;
+ keyinfo->key_part= key_part_info;
+ keyinfo->usable_key_parts= keyinfo->key_parts = key_parts;
+ keyinfo->key_length=0;
+ keyinfo->algorithm= HA_KEY_ALG_UNDEF;
+ keyinfo->flags= HA_GENERATED_KEY;
+ sprintf(buf, "key%i", key);
+ if (!(keyinfo->name= strdup_root(&mem_root, buf)))
+ return TRUE;
+ keyinfo->rec_per_key= (ulong*) alloc_root(&mem_root,
+ sizeof(ulong)*key_parts);
+ if (!keyinfo->rec_per_key)
+ return TRUE;
+ bzero(keyinfo->rec_per_key, sizeof(ulong)*key_parts);
+ for (i= 0; i < key_parts; i++)
+ {
+ reg_field= field + next_field_no(arg);
+ if (key_start)
+ (*reg_field)->key_start.set_bit(key);
+ key_start= FALSE;
+ (*reg_field)->part_of_key.set_bit(key);
+ (*reg_field)->flags|= PART_KEY_FLAG;
+ key_part_info->null_bit= (*reg_field)->null_bit;
+ key_part_info->null_offset= (uint) ((*reg_field)->null_ptr -
+ (uchar*) record[0]);
+ key_part_info->field= *reg_field;
+ key_part_info->offset= (*reg_field)->offset(record[0]);
+ key_part_info->length= (uint16) (*reg_field)->pack_length();
+ keyinfo->key_length+= key_part_info->length;
+ /* TODO:
+ The below method of computing the key format length of the
+ key part is a copy/paste from opt_range.cc, and table.cc.
+ This should be factored out, e.g. as a method of Field.
+ In addition it is not clear if any of the Field::*_length
+ methods is supposed to compute the same length. If so, it
+ might be reused.
+ */
+ key_part_info->store_length= key_part_info->length;
+
+ if ((*reg_field)->real_maybe_null())
+ key_part_info->store_length+= HA_KEY_NULL_LENGTH;
+ if ((*reg_field)->type() == MYSQL_TYPE_BLOB ||
+ (*reg_field)->real_type() == MYSQL_TYPE_VARCHAR)
+ key_part_info->store_length+= HA_KEY_BLOB_LENGTH;
+
+ key_part_info->type= (uint8) (*reg_field)->key_type();
+ key_part_info->key_type =
+ ((ha_base_keytype) key_part_info->type == HA_KEYTYPE_TEXT ||
+ (ha_base_keytype) key_part_info->type == HA_KEYTYPE_VARTEXT1 ||
+ (ha_base_keytype) key_part_info->type == HA_KEYTYPE_VARTEXT2) ?
+ 0 : FIELDFLAG_BINARY;
+ key_part_info++;
+ }
+ set_if_bigger(s->max_key_length, keyinfo->key_length);
+ s->keys++;
+ return FALSE;
+}
+
+
/**
@brief Check if this is part of a MERGE table with attached children.
=== modified file 'sql/table.h'
--- a/sql/table.h 2010-03-20 12:01:47 +0000
+++ b/sql/table.h 2010-05-31 21:22:40 +0000
@@ -781,6 +781,7 @@
uint temp_pool_slot; /* Used by intern temp tables */
uint status; /* What's in record[0] */
uint db_stat; /* mode of file as in handler.h */
+ uint max_keys; /* Size of allocated key_info array. */
/* number of select if it is derived table */
uint derived_select_number;
int current_lock; /* Type of lock on table */
@@ -913,6 +914,9 @@
*/
inline bool needs_reopen_or_name_lock()
{ return s->version != refresh_version; }
+ bool alloc_keys(uint key_count);
+ bool add_tmp_key(uint key, uint key_parts,
+ uint (*next_field_no) (uchar *), uchar *arg);
bool is_children_attached(void);
};
=== modified file 'storage/maria/ha_maria.cc'
--- a/storage/maria/ha_maria.cc 2010-03-20 12:01:47 +0000
+++ b/storage/maria/ha_maria.cc 2010-05-31 21:22:40 +0000
@@ -995,6 +995,8 @@
{
MARIA_HA *tmp= file;
file= 0;
+ if (!tmp)
+ return 0;
return maria_close(tmp);
}
1
0

[Maria-developers] IRC log of hingo and cafuego about the deb packages still named "mysql"
by Henrik Ingo 31 May '10
by Henrik Ingo 31 May '10
31 May '10
---------- Forwarded Message ----------
Subject: IRC log of hingo and cafuego about the deb packages still named "mysql"
Date: Friday 28 May 2010
From: Henrik Ingo <hingo(a)askmonty.org>
To: maria-developers(a)lists.launchpad.net
Archiving this here for later reference.
The background is that current MariaDB packaging (which is based on ourdelta)
still has a few packages called mysql-something instead of mariadb-something.
This works for ourdelta, but since you obviously cannot have 2 identically
named packages in the same repository, this is a showstopper for getting
MariaDB into Debian.
My gut feeling of the below is that this is a bug in apt. If there is one
package called mysql-common (which is one of the problematic packages) and one
called mariadb-common that Provides: mysql-common, then if user chooses to
install mariadb-* and uninstall mysql-*, apt should be happy and let the user
do that.
[12:54:43] <hingo> cafuego?
[12:55:24] <evil_steve> is the six foot something dutch guy in the corner
nursing the fruitiest, girliest drink in the place.
[12:56:20] <capitol> ^^
[13:03:57] <cafuego> hingo: yes?
[13:04:16] <hingo> cafuego: You are the one doing the ourdelta deb packages?
[13:04:30] <cafuego> Yup
[13:08:30] <-- monk-eeee (~monk-
eeee(a)c220-237-92-67.kelvn3.qld.optusnet.com.au) has quit (Quit: Computer has
gone to sleep)
[13:13:06] --> monk-eeee (~monk-
eeee(a)c220-237-92-67.kelvn3.qld.optusnet.com.au) has joined #ourdelta
[13:14:42] <hingo> Oh sorry, I drifted off...
[13:15:27] <hingo> So, I was looking into some old emails and returned to the
fact we ship a "mysql-common" package with MariaDB.
[13:16:00] <hingo> Arjen says this is because some other debian package is
"hard-coded" to depend on that, but he is never able to remember more details.
Do you?
[13:17:01] <hingo> Details such as 1) do you remember which packages in debian
break if we rename it to mariadb-common and 2) since packages do "depends" and
"provides", how is it even possible to depend on the package name so that it
cannot be solved with a provides:?
[13:17:11] <hingo> cafuego ^
[13:37:32] <-- monk-eeee (~monk-
eeee(a)c220-237-92-67.kelvn3.qld.optusnet.com.au) has quit (Quit: Computer has
gone to sleep)
[13:43:18] <cafuego> hingo: ummm... i think it was perl-dbi or somesuch
[13:44:16] <cafuego> hingo: The problem was that Provides can't be versioned,
so the distro pkg always wins.
[13:44:43] <hingo> cafuego: Ok, that makes more sense.
[13:45:01] <cafuego> hingo: A friend suggested sticking an empty mysql-common
package in and upping the epoch on that so the distro always loses :-)
[13:45:38] <hingo> cafuego: So perl-dbi depends always on a specific version.
[13:45:55] <cafuego> hingo: No, but it always grabs the newest version.
[13:46:09] <cafuego> I think, let me check
[13:46:36] <cafuego> wrong pkg
[13:47:38] <cafuego> libdbd-mysql-perl
[13:48:13] <cafuego> On my box that has a versioned depend on
libmysqlclient15off (>= 5.0.27-1)
[13:48:25] <hingo> cafuego: yes, but that is perl-dbi in commonspeak :-)
[13:48:34] <cafuego> ;-)
[13:48:53] <cafuego> So unless I stick libmysqlclient15off in my pkg it'll
always keep the distro version.
[13:49:01] <cafuego> a provides won't do it :-(
[13:49:16] <hingo> cafuego: So not actually mysql-common as such, just that
file?
[13:49:41] <cafuego> Yeah I think so. It's been a while since I worked in it.
[13:49:45] <cafuego> s/in/on/
[13:50:15] <cafuego> The client won't install without libdbd-mysql-perl and
libdbd-mysql-perl won't install wtihout libmysqlclient15off
[13:50:43] --> monk-eeee (~monk-
eeee(a)c220-237-92-67.kelvn3.qld.optusnet.com.au) has joined #ourdelta
[13:51:03] <hingo> cafuego: So why is the package name relevant at all then?
it depends on a file name of a library. Can't the package name be called
anything?
[13:51:15] <cafuego> hingo: sorry?
[13:51:27] <cafuego> hingo: I'm only talking pkg names here.
[13:51:41] <hingo> the package name is mysql-common
[13:51:54] <cafuego> So with my maria 5.1.42-mariadb68 package
[13:53:01] <hingo> mysql-common_5.1.42-mariadb68_all.deb
[13:53:22] <cafuego> mariadb-client-5.1 depends on libdbd-mysql-perl (which is
provided by the distro, and thus I can't edit its depends) depends on
libmysqlclient16 (>= 5.1.21-1)
[13:54:02] <cafuego> if I create libmariadbclient16 witha Provides:
libmysqlclient16 the distro will NOT install that in preference
[13:54:12] <hingo> Ok, I get the libmysqlclient packages. I was speaking about
mysql-common in http://mirror.ourdelta.org/deb/dists/lenny/mariadb-ourdelta/
[13:55:52] <cafuego> AH yep. So something depends on libmysqlclient15off
[13:57:00] <cafuego> I don't think I have a lenny box handy :-/
[13:57:18] <hingo> But libmysqlclient15off is a separate package?
[13:57:22] <cafuego> yes
[13:57:43] <cafuego> that's the name of the pkg provided by the distro, that
would override a Provides in the maria packages
[13:59:38] <hingo> cafuego: I'm confused. mariadb-common does not contain
libmysqlclient15off or any other libmysqlclient.
[13:59:39] <cafuego> There's 134 packages in Lenny that depend on
libmysqlclient15off.
[14:00:02] <hingo> I mean of course mysql-common from the mariadb repo.
[14:00:47] <cafuego> I swear I had a good reason at the time ;-)
[14:01:38] <cafuego> Oh that's right.
[14:01:40] <hingo> Ok. I can see how there could be a similar reason as for
libmysqlclient* problems. You kind of answered my second question anyway.
[14:01:56] <cafuego> Packages depend on libmysqlclient15off and
libmysqlclient15off depends on mysql-common.
[14:02:09] <hingo> ok.
[14:02:24] <hingo> Yes, of course.
[14:02:44] <cafuego> libmysqlclient15off has a versioned depend, so a provides
line in mariadb-common doesn't override that
[14:03:21] <cafuego> I think I got stuck in circular depependency land and
yelled at my machine a lot. Then I decided to just not rename everything :-)
[14:03:36] <hingo> Hmm... I bet that versioned depend isn't really necessary.
It's just a .cnf file there...
[14:04:08] <cafuego> hingo: probably, but it's a distro pkg so I can't change
it without actually providing a package of that name anyway.
[14:04:25] <hingo> cafuego: No, of course.
[14:04:40] <cafuego> So it went into the "currently unfixable" basket
[14:04:45] <cafuego> Well
[14:05:25] <cafuego> It's easily fixable as long as I don't expect users to
want to simply 'aptitude upgrade', but instead download depdns and manually
install them with dpkg.
[14:05:50] <hingo> cafuego: Btw, do you have an idea why RPM based systems
avoid this same problem? For them it works with a Provides?
[14:06:11] <hingo> cafuego: No we of course want to support apt-get/aptitude.
[14:06:33] <hingo> cafuego: We are looking into getting MariaDB into Debian
itself, but then we cannot have 2 packages with the same name.
[14:07:18] <cafuego> Ah yes.
[14:07:46] <cafuego> Well, in *theory* they shouldn't have that awful depend
in squeeze
[14:07:47] <hingo> This smells apt bug to me actually.
--
Henrik Ingo
Project Manager and COO, Monty Program Ab
hingo(a)askmonty.org, skype:henrik.ingo, +358405697354
http://askmonty.org/wiki/index.php/About_Us
What's up with MariaDB?
http://askmonty.org/wiki/index.php/MariaDB
-------------------------------------------------------
--
email: henrik.ingo(a)avoinelama.fi
tel: +358-40-5697354
www: www.avoinelama.fi/~hingo
book: www.openlife.cc
1
0

[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 31 May '10
by worklog-noreply@askmonty.org 31 May '10
31 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 35
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
-=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=-
Status updated.
--- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000
+++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=-
Worked 20 hours (alexi)
Worked 20 hours and estimate 15 hours remain (original estimate unchanged).
-=-=(Serg - Fri, 05 Feb 2010, 14:04)=-=-
Observers changed: Knielsen,Serg
------------------------------------------------------------
-=-=(View All Progress Notes, 32 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 31 May '10
by worklog-noreply@askmonty.org 31 May '10
31 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 35
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
-=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=-
Status updated.
--- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000
+++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=-
Worked 20 hours (alexi)
Worked 20 hours and estimate 15 hours remain (original estimate unchanged).
-=-=(Serg - Fri, 05 Feb 2010, 14:04)=-=-
Observers changed: Knielsen,Serg
------------------------------------------------------------
-=-=(View All Progress Notes, 32 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 31 May '10
by worklog-noreply@askmonty.org 31 May '10
31 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 35
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
-=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=-
Status updated.
--- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000
+++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=-
Worked 20 hours (alexi)
Worked 20 hours and estimate 15 hours remain (original estimate unchanged).
-=-=(Serg - Fri, 05 Feb 2010, 14:04)=-=-
Observers changed: Knielsen,Serg
------------------------------------------------------------
-=-=(View All Progress Notes, 32 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 31 May '10
by worklog-noreply@askmonty.org 31 May '10
31 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 35
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
-=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=-
Status updated.
--- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000
+++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=-
Worked 20 hours (alexi)
Worked 20 hours and estimate 15 hours remain (original estimate unchanged).
-=-=(Serg - Fri, 05 Feb 2010, 14:04)=-=-
Observers changed: Knielsen,Serg
------------------------------------------------------------
-=-=(View All Progress Notes, 32 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply@askmonty.org 31 May '10
by worklog-noreply@askmonty.org 31 May '10
31 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add Sphinx storage engine to MariaDB
CREATION DATE..: Mon, 10 Aug 2009, 23:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 42 (http://askmonty.org/worklog/?tid=42)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 7
ESTIMATE.......: 9 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Wrote patch that allows to test SphinxSE in mysql-test-run, using external Sphinx daemon.
Worked 7 hours and estimate 9 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 28 May 2010, 07:49)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.4369 2010-05-28 07:49:13.000000000 +0000
+++ /tmp/wklog.42.new.4369 2010-05-28 07:49:13.000000000 +0000
@@ -49,6 +49,10 @@
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
+I pushed a proof-of-concept patch for this here:
+
+ lp:~knielsen/maria/5.2-sphinxse
+
Here is a sample test case using this:
--source include/have_sphinx.inc
-=-=(Knielsen - Fri, 28 May 2010, 06:31)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.32746 2010-05-28 06:31:24.000000000 +0000
+++ /tmp/wklog.42.new.32746 2010-05-28 06:31:24.000000000 +0000
@@ -1 +1,63 @@
+Code
+----
+
+Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
+the MariaDB tree.
+
+It is a plugin, so it can be added to the tree just by including the
+sub-directory storage/sphinx/.
+
+The Sphinx plugin is already of some maturity, having been used with MySQL for
+some time.
+
+
+Testing
+-------
+
+To get testing in the mysql-test-run framework, some extensions are needed.
+
+To use the Sphinx storage engine, the external Sphinx search daemon needs to
+be running with some data directory containing indexed data. It also needs to
+be allocated a port.
+
+This is the indended approach:
+
+1. Testing will use an external Sphinx setup installed on the machine. Sphinx
+binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
+or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
+and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
+binaries can not be found, then Sphinx tests will be disabled (using some
+--source include/have_sphinx.inc in the test cases).
+
+2. The mysql-test-run framework will install Sphinx search data and start/stop
+the Sphinx search daemon for the test cases, similarly how it is done for the
+other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
+options --console, --config, and --pidfile.
+
+3. The mysql-test-run framework will generate a Sphinx config file from a
+template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
+ports and data directories appropriate for avoiding conflicts between multiple
+simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
+similar to MySQL my.cnf that we can use the existing framework for generating
+config file, with just a slightly modified variant of the code writing the
+file to disk.
+
+4. The mysql-test-run framework will pre-load the mysql database with tables
+and data for Sphinx to index. It will then run the `indexer` program to
+generate the indexes, and then start the `searchd` daemon. These three steps
+must be done in order, as each step depends on the previous. ALTERNATIVE: it
+might be possible to pre-generate the necessary data/index files and store
+them in the source tree.
+
+Here is a sample test case using this:
+
+--source include/have_sphinx.inc
+--source include/have_sphinxse.inc
+
+--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
+eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
+not null, index(q) ) engine=sphinx
+connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
+select * from ts where q='test';
+drop table ts;
-=-=(Knielsen - Fri, 28 May 2010, 06:07)=-=-
Version updated.
--- /tmp/wklog.42.old.32184 2010-05-28 06:07:00.000000000 +0000
+++ /tmp/wklog.42.new.32184 2010-05-28 06:07:00.000000000 +0000
@@ -1 +1 @@
-9.x
+Server-5.2
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Category updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-BackLog
+Server-Sprint
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Version updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-2.0
+9.x
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Status updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Guest - Tue, 15 Sep 2009, 02:25)=-=-
no
Reported zero hours worked. Estimate unchanged.
-=-=(Guest - Tue, 15 Sep 2009, 02:24)=-=-
Version updated.
--- /tmp/wklog.42.old.13241 2009-09-15 02:24:07.000000000 +0300
+++ /tmp/wklog.42.new.13241 2009-09-15 02:24:07.000000000 +0300
@@ -1 +1 @@
-Connector/.NET-5.1
+Maria-2.0
DESCRIPTION:
Add the Sphinx storage engine to the MariaDB tree
HIGH-LEVEL SPECIFICATION:
Code
----
Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
the MariaDB tree.
It is a plugin, so it can be added to the tree just by including the
sub-directory storage/sphinx/.
The Sphinx plugin is already of some maturity, having been used with MySQL for
some time.
Testing
-------
To get testing in the mysql-test-run framework, some extensions are needed.
To use the Sphinx storage engine, the external Sphinx search daemon needs to
be running with some data directory containing indexed data. It also needs to
be allocated a port.
This is the indended approach:
1. Testing will use an external Sphinx setup installed on the machine. Sphinx
binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
binaries can not be found, then Sphinx tests will be disabled (using some
--source include/have_sphinx.inc in the test cases).
2. The mysql-test-run framework will install Sphinx search data and start/stop
the Sphinx search daemon for the test cases, similarly how it is done for the
other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
options --console, --config, and --pidfile.
3. The mysql-test-run framework will generate a Sphinx config file from a
template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
ports and data directories appropriate for avoiding conflicts between multiple
simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
similar to MySQL my.cnf that we can use the existing framework for generating
config file, with just a slightly modified variant of the code writing the
file to disk.
4. The mysql-test-run framework will pre-load the mysql database with tables
and data for Sphinx to index. It will then run the `indexer` program to
generate the indexes, and then start the `searchd` daemon. These three steps
must be done in order, as each step depends on the previous. ALTERNATIVE: it
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
I pushed a proof-of-concept patch for this here:
lp:~knielsen/maria/5.2-sphinxse
Here is a sample test case using this:
--source include/have_sphinx.inc
--source include/have_sphinxse.inc
--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
not null, index(q) ) engine=sphinx
connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
select * from ts where q='test';
drop table ts;
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply@askmonty.org 31 May '10
by worklog-noreply@askmonty.org 31 May '10
31 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add Sphinx storage engine to MariaDB
CREATION DATE..: Mon, 10 Aug 2009, 23:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 42 (http://askmonty.org/worklog/?tid=42)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 7
ESTIMATE.......: 9 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Wrote patch that allows to test SphinxSE in mysql-test-run, using external Sphinx daemon.
Worked 7 hours and estimate 9 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 28 May 2010, 07:49)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.4369 2010-05-28 07:49:13.000000000 +0000
+++ /tmp/wklog.42.new.4369 2010-05-28 07:49:13.000000000 +0000
@@ -49,6 +49,10 @@
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
+I pushed a proof-of-concept patch for this here:
+
+ lp:~knielsen/maria/5.2-sphinxse
+
Here is a sample test case using this:
--source include/have_sphinx.inc
-=-=(Knielsen - Fri, 28 May 2010, 06:31)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.32746 2010-05-28 06:31:24.000000000 +0000
+++ /tmp/wklog.42.new.32746 2010-05-28 06:31:24.000000000 +0000
@@ -1 +1,63 @@
+Code
+----
+
+Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
+the MariaDB tree.
+
+It is a plugin, so it can be added to the tree just by including the
+sub-directory storage/sphinx/.
+
+The Sphinx plugin is already of some maturity, having been used with MySQL for
+some time.
+
+
+Testing
+-------
+
+To get testing in the mysql-test-run framework, some extensions are needed.
+
+To use the Sphinx storage engine, the external Sphinx search daemon needs to
+be running with some data directory containing indexed data. It also needs to
+be allocated a port.
+
+This is the indended approach:
+
+1. Testing will use an external Sphinx setup installed on the machine. Sphinx
+binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
+or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
+and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
+binaries can not be found, then Sphinx tests will be disabled (using some
+--source include/have_sphinx.inc in the test cases).
+
+2. The mysql-test-run framework will install Sphinx search data and start/stop
+the Sphinx search daemon for the test cases, similarly how it is done for the
+other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
+options --console, --config, and --pidfile.
+
+3. The mysql-test-run framework will generate a Sphinx config file from a
+template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
+ports and data directories appropriate for avoiding conflicts between multiple
+simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
+similar to MySQL my.cnf that we can use the existing framework for generating
+config file, with just a slightly modified variant of the code writing the
+file to disk.
+
+4. The mysql-test-run framework will pre-load the mysql database with tables
+and data for Sphinx to index. It will then run the `indexer` program to
+generate the indexes, and then start the `searchd` daemon. These three steps
+must be done in order, as each step depends on the previous. ALTERNATIVE: it
+might be possible to pre-generate the necessary data/index files and store
+them in the source tree.
+
+Here is a sample test case using this:
+
+--source include/have_sphinx.inc
+--source include/have_sphinxse.inc
+
+--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
+eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
+not null, index(q) ) engine=sphinx
+connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
+select * from ts where q='test';
+drop table ts;
-=-=(Knielsen - Fri, 28 May 2010, 06:07)=-=-
Version updated.
--- /tmp/wklog.42.old.32184 2010-05-28 06:07:00.000000000 +0000
+++ /tmp/wklog.42.new.32184 2010-05-28 06:07:00.000000000 +0000
@@ -1 +1 @@
-9.x
+Server-5.2
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Category updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-BackLog
+Server-Sprint
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Version updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-2.0
+9.x
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Status updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Guest - Tue, 15 Sep 2009, 02:25)=-=-
no
Reported zero hours worked. Estimate unchanged.
-=-=(Guest - Tue, 15 Sep 2009, 02:24)=-=-
Version updated.
--- /tmp/wklog.42.old.13241 2009-09-15 02:24:07.000000000 +0300
+++ /tmp/wklog.42.new.13241 2009-09-15 02:24:07.000000000 +0300
@@ -1 +1 @@
-Connector/.NET-5.1
+Maria-2.0
DESCRIPTION:
Add the Sphinx storage engine to the MariaDB tree
HIGH-LEVEL SPECIFICATION:
Code
----
Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
the MariaDB tree.
It is a plugin, so it can be added to the tree just by including the
sub-directory storage/sphinx/.
The Sphinx plugin is already of some maturity, having been used with MySQL for
some time.
Testing
-------
To get testing in the mysql-test-run framework, some extensions are needed.
To use the Sphinx storage engine, the external Sphinx search daemon needs to
be running with some data directory containing indexed data. It also needs to
be allocated a port.
This is the indended approach:
1. Testing will use an external Sphinx setup installed on the machine. Sphinx
binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
binaries can not be found, then Sphinx tests will be disabled (using some
--source include/have_sphinx.inc in the test cases).
2. The mysql-test-run framework will install Sphinx search data and start/stop
the Sphinx search daemon for the test cases, similarly how it is done for the
other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
options --console, --config, and --pidfile.
3. The mysql-test-run framework will generate a Sphinx config file from a
template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
ports and data directories appropriate for avoiding conflicts between multiple
simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
similar to MySQL my.cnf that we can use the existing framework for generating
config file, with just a slightly modified variant of the code writing the
file to disk.
4. The mysql-test-run framework will pre-load the mysql database with tables
and data for Sphinx to index. It will then run the `indexer` program to
generate the indexes, and then start the `searchd` daemon. These three steps
must be done in order, as each step depends on the previous. ALTERNATIVE: it
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
I pushed a proof-of-concept patch for this here:
lp:~knielsen/maria/5.2-sphinxse
Here is a sample test case using this:
--source include/have_sphinx.inc
--source include/have_sphinxse.inc
--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
not null, index(q) ) engine=sphinx
connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
select * from ts where q='test';
drop table ts;
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Efficient group commit for binary log (116)
by worklog-noreply@askmonty.org 31 May '10
by worklog-noreply@askmonty.org 31 May '10
31 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Efficient group commit for binary log
CREATION DATE..: Mon, 26 Apr 2010, 13:28
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Serg
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 116 (http://askmonty.org/worklog/?tid=116)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 60
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 31 May 2010, 06:48)=-=-
Finish first architecture draft (changed my mind a number of times before I was satisfied).
Write up architecture in worklog.
Fix remaining test failures in proof-of-concept patch + implement xtradb part.
Run some benchmarks on proof-of-concept implementation.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Tue, 25 May 2010, 13:19)=-=-
Low Level Design modified.
--- /tmp/wklog.116.old.14255 2010-05-25 13:19:00.000000000 +0000
+++ /tmp/wklog.116.new.14255 2010-05-25 13:19:00.000000000 +0000
@@ -1 +1,363 @@
+1. Changes for ha_commit_trans()
+
+The gut of the code for commit is in the function ha_commit_trans() (and in
+commit_one_phase() which is called from it). This must be extended to use the
+new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
+
+1.1 Atomic queue of committing transactions
+
+To keep the right commit order among participants, we put transactions into a
+queue. The operations on the queue are non-locking:
+
+ - Insert THD at the head of the queue, and return old queue.
+
+ THD *enqueue_atomic(THD *thd)
+
+ - Fetch (and delete) the whole queue.
+
+ THD *atomic_grab_reverse_queue()
+
+These are simple to implement with atomic compare-and-set. Note that there is
+no ABA problem [2], as we do not delete individual elements from the queue, we
+grab the whole queue and replace it with NULL.
+
+A transaction enters the queue when it does prepare_ordered(). This way, the
+scheduling order for prepare_ordered() calls is what determines the sequence
+in the queue and effectively the commit order.
+
+The queue is grabbed by the code doing group_log_xid() and commit_ordered()
+calls. The queue is passed directly to group_log_xid(), and afterwards
+iterated to do individual commit_ordered() calls.
+
+Using a lock-free queue allows prepare_ordered() (for one transaction) to run
+in parallel with commit_ordered (in another transaction), increasing potential
+parallelism.
+
+The queue is simply a linked list of THD objects, linked through a
+THD::next_commit_ordered field. Since we add at the head of the queue, the
+list is actually in reverse order, so must be reversed when we grab and delete
+it.
+
+The reason that enqueue_atomic() returns the old queue is so that we can check
+if an insert goes to the head of the queue. The thread at the head of the
+queue will do the sequential part of group commit for everyone.
+
+
+1.2 Locks
+
+1.2.1 Global LOCK_prepare_ordered
+
+This lock is taken to serialise calls to prepare_ordered(). Note that
+effectively, the commit order is decided by the order in which threads obtain
+this lock.
+
+
+1.2.2 Global LOCK_group_commit and COND_group_commit
+
+This lock is used to protect the serial part of group commit. It is taken
+around the code where we grab the queue, call group_log_xid() on the queue,
+and call commit_ordered() on each element of the queue, to make sure they
+happen serialised and in consistent order. It also protects the variable
+group_commit_queue_busy, which is used when not using group_log_xid() to delay
+running over a new queue until the first queue is completely done.
+
+
+1.2.3 Global LOCK_commit_ordered
+
+This lock is taken around calls to commit_ordered(), to ensure they happen
+serialised.
+
+
+1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
+
+This lock protects the thd->group_commit_ready variable, as well as the
+condition variable used to wake up threads after log_xid() and
+commit_ordered() finishes.
+
+
+1.2.5 Global LOCK_group_commit_queue
+
+This is only used on platforms with no native compare-and-set operations, to
+make the queue operations atomic.
+
+
+1.3 Commit algorithm.
+
+This is the basic algorithm, simplified by
+
+ - omitting some error handling
+
+ - omitting looping over all handlers when invoking handler methods
+
+ - omitting some possible optimisations when not all calls needed (see next
+ section).
+
+ - Omitting the case where no group_log_xid() is used, see below.
+
+---- BEGIN ALGORITHM ----
+ ht->prepare()
+
+ // Call prepare_ordered() and enqueue in correct commit order
+ lock(LOCK_prepare_ordered)
+ ht->prepare_ordered()
+ old_queue= enqueue_atomic(thd)
+ thd->group_commit_ready= FALSE
+ is_group_commit_leader= (old_queue == NULL)
+ unlock(LOCK_prepare_ordered)
+
+ if (is_group_commit_leader)
+
+ // The first in queue handles group commit for everyone
+
+ lock(LOCK_group_commit)
+ // Wait while queue is busy, see below for when this occurs
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+
+ // Grab and reverse the queue to get correct order of transactions
+ queue= atomic_grab_reverse_queue()
+
+ // This call will set individual error codes in thd->xid_error
+ // It also sets the cookie for unlog() in thd->xid_cookie
+ group_log_xid(queue)
+
+ lock(LOCK_commit_ordered)
+ for (other IN queue)
+ if (!other->xid_error)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ unlock(LOCK_group_commit)
+
+ // Now we are done, so wake up all the others.
+ for (other IN TAIL(queue))
+ lock(other->LOCK_commit_ordered)
+ other->group_commit_ready= TRUE
+ cond_signal(other->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+ else
+ // If not the leader, just wait until leader did the work for us.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ // Finally do any error reporting now that we're back in own thread.
+ if (thd->xid_error)
+ xid_delayed_error(thd)
+ else
+ ht->commit(thd)
+ unlog(thd->xid_cookie, thd->xid)
+---- END ALGORITHM ----
+
+If the transaction coordinator does not support group_log_xid(), we have to do
+things differently. In this case after the serialisation point at
+prepare_ordered(), we have to parallelise again when running log_xid()
+(otherwise we would loose group commit). But then when log_xid() is done, we
+have to serialise again to check for any error and call commit_ordered() in
+correct sequence for any transaction where log_xid() did not return error.
+
+The central part of the algorithm in this case (when using log_xid()) is:
+
+---- BEGIN ALGORITHM ----
+ cookie= log_xid(thd)
+ error= (cookie == 0)
+
+ if (is_group_commit_leader)
+
+ // The first to enqueue grabs the queue and runs first.
+ // But we must wait until a previous queue run is fully done.
+
+ lock(LOCK_group_commit)
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+ queue= atomic_grab_reverse_queue()
+ // The queue will be busy until last thread in it is done.
+ group_commit_queue_busy= TRUE
+ unlock(LOCK_group_commit)
+ else
+ // Not first in queue -> wait for previous one to wake us up.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ if (!error) // Only if log_xid() was successful
+ lock(LOCK_commit_ordered)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ // Wake up the next thread, and release queue in last.
+ next= thd->next_commit_ordered
+
+ if (next)
+ lock(next->LOCK_commit_ordered)
+ next->group_commit_ready= TRUE
+ cond_signal(next->COND_commit_ordered)
+ unlock(next->LOCK_commit_ordered)
+ else
+ lock(LOCK_group_commit)
+ group_commit_queue_busy= FALSE
+ unlock(LOCK_group_commit)
+---- END ALGORITHM ----
+
+There are a number of locks taken in the algorithm, but in the group_log_xid()
+case most of them should be uncontended most of the time. The
+LOCK_group_commit of course will be contended, as new threads queue up waiting
+for the previous group commit (and binlog fsync()) to finish so they can do
+the next group commit. This is the whole point of implementing group commit.
+
+The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
+contended as long as handlers follow the intension of having the corresponding
+handler calls execute quickly.
+
+The per-thread LOCK_commit_ordered mutexes should not be contended; they are
+only used to wake up a sleeping thread.
+
+
+1.4 Optimisations when not using all three new calls
+
+
+The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
+optional, and if not implemented by a particular handler/transaction
+coordinator, we can optimise the algorithm to take advantage of not having to
+keep ordering for the missing parts.
+
+If there is no prepare_ordered(), then we need not take the
+LOCK_prepare_ordered mutex.
+
+If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
+mutex.
+
+If there is no group_log_xid(), then we only need the queue to ensure same
+ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
+if either of these (or both) are also not present, we do not need to use the
+queue at all.
+
+
+2. Binlog code changes (log.cc)
+
+
+The bulk of the work needed for the binary log is to extend the code to allow
+group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
+inside the binlog code for group commit.
+
+The existing code runs most of the write + fsync to the binary lock under the
+global LOCK_log mutex, preventing any group commit.
+
+To enable group commit, this code must be split into two parts:
+
+ - one part that runs per transaction, re-writing the embedded event positions
+ for the correct offset, and writing this into the in-memory log cache.
+
+ - another part that writes a set of transactions to the disk, and runs
+ fsync().
+
+Then in group_log_xid(), we can run the first part in a loop over all the
+transactions in the passed-in queue, and run the second part only once.
+
+The binlog code also has other code paths that write into the binlog,
+eg. non-transactional statements. These have to be adapted also to work with
+the new code.
+
+In order to get some group commit facility for these also, we change that part
+of the code in a similar way to ha_commit_trans. We keep another,
+binlog-internal queue of such non-transactional binlog writes, and such writes
+queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
+LOCK_log, it loops over the queue for the fast part, and does the slow part
+once, then finally wakes up the others in the queue.
+
+In the transactional case in group_log_xid(), before we run the passed-in
+queue, we add any members found in the binlog-internal queue. This allows
+these non-transactional writes to share the group commit.
+
+However, in the case where it is a non-transactional write that gets the
+LOCK_log, the transactional transactions from the ha_commit_trans() queue will
+not be able to take part (they will have to wait for their turn to do another
+fsync). It seems difficult to cleanly let the binlog code grab the queue from
+out of the ha_commit_trans() algorithm. I think the group commit is mostly
+useful in transactional workloads anyway (non-transactional engines will loose
+data anyway in case of crash, so why fsync() after each transaction?)
+
+
+3. XtraDB changes (ha_innodb.cc)
+
+The changes needed in XtraDB are comparatively simple, as XtraDB already
+implements group commit, it just needs to be enabled with the new
+commit_ordered() call.
+
+The existing commit() method already is logically in two parts. The first part
+runs under the prepare_commit_mutex() and must be run in same order as binlog
+commit. This part needs to be moved to commit_ordered(). The second part runs
+after releasing prepare_commit_mutex and does transaction log write+fsync; it
+can remain.
+
+Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
+XtraDB option to disable it).
+
+There are two asserts that check that the thread running the first part of
+XtraDB commit is the same as the thread running the other operations for the
+transaction. These have to be removed (as commit_ordered() can run in a
+different thread). Also an error reporting with sql_print_error() has to be
+delayed until commit() time.
+
+
+4. Proof-of-concept implementation
+
+There is a proof-of-concept implementation of this architecture, in the form
+of a quilt patch series [3].
+
+A quick benchmark was done, with sync_binlog=1 and
+innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
+transactions against one table.
+
+Without the patch, we get only 25 queries per second.
+
+With the patch, we get 650 queries per second.
+
+
+5. Open issues/tasks
+
+5.1 XA / other prepare() and commit() call sites.
+
+Check that user-level XA is handled correctly and working. And covered
+sufficiently with tests. Also check that any other calls of ha->prepare() and
+ha->commit() outside of ha_commit_trans() are handled correctly.
+
+5.2 Testing
+
+This worklog needs additions to the test suite, including error inserts to
+check error handling, and synchronisation points to check thread parallelism
+correctness.
+
+
+6. Alternative implementations
+
+ - The binlog code maintains its own extra atomic transaction queue to handle
+ non-transactional commits in a good way together with transactional (with
+ respect to group commit). Alternatively, we could ignore this issue and
+ just give up on group commit for non-transactional statements, for some
+ code simplifications.
+
+ - The binlog code has two ways to prepare end_event and similar, one that
+ uses stack-allocation, and another for when stack allocation is not
+ possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
+ so small that it would make sense to use the same code for both cases.
+
+ - Instead of adding extra fields to THD, we could allocate a separate
+ structure on the thd->mem_root() with the required extra fields (including
+ the THD pointer). Would seem to require initialising mutexes at every
+ commit though.
+
+ - It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
+ (should not be hard).
+
+
+-----------------------------------------------------------------------
+
+References:
+
+[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
+
+[3] https://knielsen-hq.org/maria/patches.mwl116/
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.116.old.14249 2010-05-25 13:18:34.000000000 +0000
+++ /tmp/wklog.116.new.14249 2010-05-25 13:18:34.000000000 +0000
@@ -1 +1,157 @@
+The basic idea in group commit is that multiple threads, each handling one
+transaction, prepare for commit and then queue up together waiting to do an
+fsync() on the transaction log. Then once the log is available, a single
+thread does the fsync() + other necessary book-keeping for all of the threads
+at once. After this, the single thread signals the other threads that it's
+done and they can finish up and return success (or failure) from the commit
+operation.
+
+So group commit has a parallel part, and a sequential part. So we need a
+facility for engines/binlog to participate in both the parallel and the
+sequential part.
+
+To do this, we add two new handlerton methods:
+
+ int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
+ void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
+
+The idea is that the existing prepare() and commit() methods run in the
+parallel part of group commit, and the new prepare_ordered() and
+commit_ordered() run in the sequential part.
+
+The prepare_ordered() method is called after prepare(). The order of
+tranctions that call into prepare_ordered() is guaranteed to be the same among
+all storage engines and binlog, and it is serialised so no two calls can be
+running inside the same engine at the same time.
+
+The commit_ordered() method is called before commit(), and similarly is
+guaranteed to have same transaction order in all participants, and to be
+serialised within one engine.
+
+As the prepare_ordered() and commit_ordered() calls are serialised, the idea
+is that handlers should do the minimum amount of work needed in these calls,
+relaying most of the work (eg. fsync() ...) to prepare() and commit().
+
+As a concrete example, for InnoDB the commit_ordered() method will do the
+first part of commit that fixed the commit order in the transaction log
+buffer, and the commit() method will write the log to disk and fsync()
+it. This split already exists inside the InnoDB code, running before
+respectively after releasing the prepare_commit_mutex.
+
+In addition, the XA transaction coordinator (TC_LOG) is special, since it is
+the one responsible for deciding whether to commit or rollback the
+transaction. For this we need an extra method, since this decision can be done
+only after we know that all prepare() and prepare_ordered() calls succeed, and
+must be done to know whether to call commit_ordered()/commit(), or do rollback.
+
+The existing method for this is TC_LOG::log_xid(). To make implementing group
+commit simpler to implement in a transaction coordinator and more efficient,
+we introduce a new method:
+
+ void group_log_xid(THD *first_thd);
+
+This method runs in the sequential part of group commit. It receives a list of
+transactions to perform log_xid() on, in the correct commit order. (Note that
+TC_LOG can do parallel parts of group commit in its own prepare() and commit()
+methods).
+
+This method can make it easier to implement the group commit in TC_LOG, as it
+gets directly the list of transactions in the right order. Without it, it
+might need to compute such order anyway in a prepare_ordered() method, and the
+server has to create this ordered list anyway to implement the order guarantee
+for prepare_ordered() and commit_ordered().
+
+This group_log_xid() method also is more efficient, as it avoids some
+inter-thread synchronisation. Since group_log_xid() is serialised, we can run
+it together with all the commit_ordered() method calls and need only a single
+sequential code section. With the log_xid() methods, we would need first a
+sequential part for the prepare_ordered() calls, then a parallel part with
+log_xid() calls (to not loose group commit ability for log_xid()), then again
+a sequential part for the commit_ordered() method calls.
+
+The extra synchronisation is needed, as each commit_ordered() call will have
+to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
+should not be called), and also wait for commit_ordered() to finish in all
+threads handling earlier commits. In effect we will need to bounce the
+execution from one thread to the other among all participants in the group
+commit.
+
+As a consequence of the group_log_xid() optimisation, handlers must be aware
+that the commit_ordered() call can happen in another thread than the one
+running commit() (so thread local storage is not available). This should not
+be a big issue as the THD is available for storing any needed information.
+
+Since group_log_xid() runs for multiple transactions in a single thread, it
+can not do error reporting (my_error()) as that relies on thread local
+storage. Instead it sets an error code in THD::xid_error, and if there is an
+error then later another method will be called (in correct thread context) to
+actually report the error:
+
+ int xid_delayed_error(THD *thd)
+
+The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
+are optional (as is xid_delayed_error). A storage engine or transaction
+coordinator is free to not implement them if they are not needed. In this case
+there will be no order guarantee for the corresponding stage of group commit
+for that engine. For example, InnoDB needs no ordering of the prepare phase,
+so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
+all, so does not need to implement any of them.
+
+Note in particular that all existing engines (/binlog implementations if they
+exist) will work unmodified (and also without any change in group commit
+facilities or commit order guaranteed).
+
+Using these new APIs, the work will be to
+
+ - In ha_commit_trans(), implement the correct semantics for the three new
+ calls.
+
+ - In XtraDB, use the new commit_ordered() call to remove the
+ prepare_commit_mutex (and resurrect group commit) without loosing the
+ consistency with binlog commit order.
+
+ - In log.cc (binlog module), implement group_log_xid() to do group commit of
+ multiple transactions to the binlog with a single shared fsync() call.
+
+-----------------------------------------------------------------------
+Some possible alternative for this worklog:
+
+ - We could eliminate the group_log_xid() method for a simpler API, at the
+ cost of extra synchronisation between threads to do in-order
+ commit_ordered() method calls. This would also allow to call
+ commit_ordered() in the correct thread context.
+
+ - Alternatively, we could eliminate log_xid() and require that all
+ transaction coordinators implement group_log_xid() instead, again for some
+ moderate simplification.
+
+ - At the moment there is no plugin actually using prepare_ordered(), so, it
+ could be removed from the design. But it fits in well, is efficient to
+ implement, and could be useful later (eg. for the requested feature of
+ releasing locks early in InnoDB).
+
+-----------------------------------------------------------------------
+Some possible follow-up projects after this is implemented:
+
+ - Add statistics about how efficient group commit is (#fsyncs/#commits in
+ each engine and binlog).
+
+ - Implement an XtraDB prepare_ordered() methods that can release row locks
+ early (Mark Callaghan from Facebook advocates this, but need to determine
+ exactly how to do this safely).
+
+ - Implement a new crash recovery algorithm that uses the consistent commit
+ ordering to need only fsync() for the binlog. At crash recovery, any
+ missing transactions in an engine is replayed from the correct point in the
+ binlog (this point must be stored transactionally inside the engine, as
+ XtraDB already does today).
+
+ - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
+ consistent snapshow, with same set of committed and not committed
+ transactions in all engines, 2) returns a corresponding consistent binlog
+ position. This should be easy by piggybacking on the synchronisation
+ implemented for ha_commit_trans().
+
+ - Use this in XtraBackup to get consistent binlog position without having to
+ block all updates with FLUSH TABLES WITH READ LOCK.
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High Level Description modified.
--- /tmp/wklog.116.old.14234 2010-05-25 13:18:07.000000000 +0000
+++ /tmp/wklog.116.new.14234 2010-05-25 13:18:07.000000000 +0000
@@ -21,3 +21,69 @@
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
+----
+
+Implementing group commit in MySQL faces some challenges from the handler
+plugin architecture:
+
+1. Because storage engine handlers have separate transaction log from the
+mysql binlog (and from each other), there are multiple fsync() calls per
+commit that need the group commit optimisation (2 per participating storage
+engine + 1 for binlog).
+
+2. The code handling commit is split in several places, in main server code
+and in storage engine code. With pluggable binlog it will be split even
+more. This requires a good abstract yet powerful API to be able to implement
+group commit simply and efficiently in plugins without the different parts
+having to rely on iternals of the others.
+
+3. We want the order of commits to be the same in all engines participating in
+multiple transactions. This requirement is the reason that InnoDB currently
+breaks group commit with the infamous prepare_commit_mutex.
+
+While currently there is no server guarantee to get same commit order in
+engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
+several reasons why this could be desirable:
+
+ - InnoDB hot backup needs to be able to extract a binlog position that is
+ consistent with the hot backup to be able to provision a new slave, and
+ this is impossible without imposing at least partial consistent ordering
+ between InnoDB and binlog.
+
+ - Other backup methods could have similar needs, eg. XtraBackup or
+ `mysqldump --single-transaction`, to have consistent commit order between
+ binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
+ or similar expensive blocking operation. (other backup methods, like LVM
+ snapshot, don't need consistent commit order, as they can restore
+ out-of-order commits during crash recovery using XA).
+
+ - If we have consistent commit order, we can think about optimising commit to
+ need only one fsync (for binlog); lost commits in storage engines can then
+ be recovered from the binlog at crash recovery by re-playing against the
+ engine from a particular point in the binlog.
+
+ - With consistent commit order, we can get better semantics for START
+ TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
+ could even get it to return also a matching binlog position). Currently,
+ this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
+ engines.
+
+ - In InnoDB, the performance in the presense of hotspots can be improved if
+ we can release row locks early in the commit phase, but this requires that we
+release them in
+ the same order as commits in the binlog to ensure consistency between
+ master and slaves.
+
+ - There was some discussions around Galera [1] synchroneous replication and
+ global transaction ID that it needed consistent commit order among
+ participating engines.
+
+ - I believe there could be other applications for guaranteed consistent
+ commit order, and that the architecture described in this worklog can
+ implement such guarantee with reasonable overhead.
+
+
+References:
+
+[1] Galera: http://www.codership.com/products/galera_replication
+
-=-=(Knielsen - Tue, 25 May 2010, 08:28)=-=-
More thoughts on and changes to the archtecture. Got to something now that I am satisfied with and
that seems to be able to handle all issues.
Implement new prepare_ordered and commit_ordered handler methods and the logic in ha_commit_trans().
Implement TC_LOG::group_log_xid() method and logic in ha_commit_trans().
Implement XtraDB part, using commit_ordered() rather than prepare_commit_mutex.
Fix test suite failures.
Proof-of-concept patch series complete now.
Do initial benchmark, getting good results. With 64 threads, see 26x improvement in queries-per-sec.
Next step: write up the architecture description.
Worked 21 hours and estimate 0 hours remain (original estimate increased by 21 hours).
-=-=(Knielsen - Wed, 12 May 2010, 06:41)=-=-
Started work on a Quilt patch series, refactoring the binlog code to prepare for implementing the
group commit, and working on the design of group commit in parallel.
Found and fixed several problems in error handling when writing to binlog.
Removed redundant table map version locking.
Split binlog writing into two parts in preparations for group commit. When ready to write to the
binlog, threads enter a queue, and the first thread in the queue handles the binlog writing for
everyone. When it obtains the LOCK_log, it first loops over all threads, executing the first part of
binlog writing (the write(2) syscall essentially). It then runs the second part (fsync(2)
essentially) only once, and then wakes up the remaining threads in the queue.
Still to be done:
Finish the proof-of-concept group commit patch, by 1) implementing the prepare_fast() and
commit_fast() callbacks in handler.cc 2) move the binlog thread enqueue from log_xid() to
binlog_prepare_fast(), 3) move fast part of InnoDB commit to innobase_commit_fast(), removing the
prepare_commit_mutex().
Write up the final design in this worklog.
Evaluate the design to see if we can do better/different.
Think about possible next steps, such as releasing innodb row locks early (in
innobase_prepare_fast), and doing crash recovery by replaying transactions from the binlog (removing
the need for engine durability and 2 of 3 fsync() in commit).
Worked 28 hours and estimate 0 hours remain (original estimate increased by 28 hours).
-=-=(Serg - Mon, 26 Apr 2010, 14:10)=-=-
Observers changed: Serg
DESCRIPTION:
Currently, in order to ensure that the server can recover after a crash to a
state in which storage engines and binary log are consistent with each other,
it is necessary to use XA with durable commits for both storage engines
(innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1).
This is _very_ expensive, since the server needs to do three fsync() operations
for every commit, as there is no working group commit when the binary log is
enabled.
The idea is to
- Implement/fix group commit to work properly with the binary log enabled.
- (Optionally) avoid the need to fsync() in the engine, and instead rely on
replaying any lost transactions from the binary log against the engine
during crash recovery.
For background see these articles:
http://kristiannielsen.livejournal.com/12254.html
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
----
Implementing group commit in MySQL faces some challenges from the handler
plugin architecture:
1. Because storage engine handlers have separate transaction log from the
mysql binlog (and from each other), there are multiple fsync() calls per
commit that need the group commit optimisation (2 per participating storage
engine + 1 for binlog).
2. The code handling commit is split in several places, in main server code
and in storage engine code. With pluggable binlog it will be split even
more. This requires a good abstract yet powerful API to be able to implement
group commit simply and efficiently in plugins without the different parts
having to rely on iternals of the others.
3. We want the order of commits to be the same in all engines participating in
multiple transactions. This requirement is the reason that InnoDB currently
breaks group commit with the infamous prepare_commit_mutex.
While currently there is no server guarantee to get same commit order in
engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
several reasons why this could be desirable:
- InnoDB hot backup needs to be able to extract a binlog position that is
consistent with the hot backup to be able to provision a new slave, and
this is impossible without imposing at least partial consistent ordering
between InnoDB and binlog.
- Other backup methods could have similar needs, eg. XtraBackup or
`mysqldump --single-transaction`, to have consistent commit order between
binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
or similar expensive blocking operation. (other backup methods, like LVM
snapshot, don't need consistent commit order, as they can restore
out-of-order commits during crash recovery using XA).
- If we have consistent commit order, we can think about optimising commit to
need only one fsync (for binlog); lost commits in storage engines can then
be recovered from the binlog at crash recovery by re-playing against the
engine from a particular point in the binlog.
- With consistent commit order, we can get better semantics for START
TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
could even get it to return also a matching binlog position). Currently,
this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
engines.
- In InnoDB, the performance in the presense of hotspots can be improved if
we can release row locks early in the commit phase, but this requires that we
release them in
the same order as commits in the binlog to ensure consistency between
master and slaves.
- There was some discussions around Galera [1] synchroneous replication and
global transaction ID that it needed consistent commit order among
participating engines.
- I believe there could be other applications for guaranteed consistent
commit order, and that the architecture described in this worklog can
implement such guarantee with reasonable overhead.
References:
[1] Galera: http://www.codership.com/products/galera_replication
HIGH-LEVEL SPECIFICATION:
The basic idea in group commit is that multiple threads, each handling one
transaction, prepare for commit and then queue up together waiting to do an
fsync() on the transaction log. Then once the log is available, a single
thread does the fsync() + other necessary book-keeping for all of the threads
at once. After this, the single thread signals the other threads that it's
done and they can finish up and return success (or failure) from the commit
operation.
So group commit has a parallel part, and a sequential part. So we need a
facility for engines/binlog to participate in both the parallel and the
sequential part.
To do this, we add two new handlerton methods:
int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
The idea is that the existing prepare() and commit() methods run in the
parallel part of group commit, and the new prepare_ordered() and
commit_ordered() run in the sequential part.
The prepare_ordered() method is called after prepare(). The order of
tranctions that call into prepare_ordered() is guaranteed to be the same among
all storage engines and binlog, and it is serialised so no two calls can be
running inside the same engine at the same time.
The commit_ordered() method is called before commit(), and similarly is
guaranteed to have same transaction order in all participants, and to be
serialised within one engine.
As the prepare_ordered() and commit_ordered() calls are serialised, the idea
is that handlers should do the minimum amount of work needed in these calls,
relaying most of the work (eg. fsync() ...) to prepare() and commit().
As a concrete example, for InnoDB the commit_ordered() method will do the
first part of commit that fixed the commit order in the transaction log
buffer, and the commit() method will write the log to disk and fsync()
it. This split already exists inside the InnoDB code, running before
respectively after releasing the prepare_commit_mutex.
In addition, the XA transaction coordinator (TC_LOG) is special, since it is
the one responsible for deciding whether to commit or rollback the
transaction. For this we need an extra method, since this decision can be done
only after we know that all prepare() and prepare_ordered() calls succeed, and
must be done to know whether to call commit_ordered()/commit(), or do rollback.
The existing method for this is TC_LOG::log_xid(). To make implementing group
commit simpler to implement in a transaction coordinator and more efficient,
we introduce a new method:
void group_log_xid(THD *first_thd);
This method runs in the sequential part of group commit. It receives a list of
transactions to perform log_xid() on, in the correct commit order. (Note that
TC_LOG can do parallel parts of group commit in its own prepare() and commit()
methods).
This method can make it easier to implement the group commit in TC_LOG, as it
gets directly the list of transactions in the right order. Without it, it
might need to compute such order anyway in a prepare_ordered() method, and the
server has to create this ordered list anyway to implement the order guarantee
for prepare_ordered() and commit_ordered().
This group_log_xid() method also is more efficient, as it avoids some
inter-thread synchronisation. Since group_log_xid() is serialised, we can run
it together with all the commit_ordered() method calls and need only a single
sequential code section. With the log_xid() methods, we would need first a
sequential part for the prepare_ordered() calls, then a parallel part with
log_xid() calls (to not loose group commit ability for log_xid()), then again
a sequential part for the commit_ordered() method calls.
The extra synchronisation is needed, as each commit_ordered() call will have
to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
should not be called), and also wait for commit_ordered() to finish in all
threads handling earlier commits. In effect we will need to bounce the
execution from one thread to the other among all participants in the group
commit.
As a consequence of the group_log_xid() optimisation, handlers must be aware
that the commit_ordered() call can happen in another thread than the one
running commit() (so thread local storage is not available). This should not
be a big issue as the THD is available for storing any needed information.
Since group_log_xid() runs for multiple transactions in a single thread, it
can not do error reporting (my_error()) as that relies on thread local
storage. Instead it sets an error code in THD::xid_error, and if there is an
error then later another method will be called (in correct thread context) to
actually report the error:
int xid_delayed_error(THD *thd)
The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
are optional (as is xid_delayed_error). A storage engine or transaction
coordinator is free to not implement them if they are not needed. In this case
there will be no order guarantee for the corresponding stage of group commit
for that engine. For example, InnoDB needs no ordering of the prepare phase,
so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
all, so does not need to implement any of them.
Note in particular that all existing engines (/binlog implementations if they
exist) will work unmodified (and also without any change in group commit
facilities or commit order guaranteed).
Using these new APIs, the work will be to
- In ha_commit_trans(), implement the correct semantics for the three new
calls.
- In XtraDB, use the new commit_ordered() call to remove the
prepare_commit_mutex (and resurrect group commit) without loosing the
consistency with binlog commit order.
- In log.cc (binlog module), implement group_log_xid() to do group commit of
multiple transactions to the binlog with a single shared fsync() call.
-----------------------------------------------------------------------
Some possible alternative for this worklog:
- We could eliminate the group_log_xid() method for a simpler API, at the
cost of extra synchronisation between threads to do in-order
commit_ordered() method calls. This would also allow to call
commit_ordered() in the correct thread context.
- Alternatively, we could eliminate log_xid() and require that all
transaction coordinators implement group_log_xid() instead, again for some
moderate simplification.
- At the moment there is no plugin actually using prepare_ordered(), so, it
could be removed from the design. But it fits in well, is efficient to
implement, and could be useful later (eg. for the requested feature of
releasing locks early in InnoDB).
-----------------------------------------------------------------------
Some possible follow-up projects after this is implemented:
- Add statistics about how efficient group commit is (#fsyncs/#commits in
each engine and binlog).
- Implement an XtraDB prepare_ordered() methods that can release row locks
early (Mark Callaghan from Facebook advocates this, but need to determine
exactly how to do this safely).
- Implement a new crash recovery algorithm that uses the consistent commit
ordering to need only fsync() for the binlog. At crash recovery, any
missing transactions in an engine is replayed from the correct point in the
binlog (this point must be stored transactionally inside the engine, as
XtraDB already does today).
- Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
consistent snapshow, with same set of committed and not committed
transactions in all engines, 2) returns a corresponding consistent binlog
position. This should be easy by piggybacking on the synchronisation
implemented for ha_commit_trans().
- Use this in XtraBackup to get consistent binlog position without having to
block all updates with FLUSH TABLES WITH READ LOCK.
LOW-LEVEL DESIGN:
1. Changes for ha_commit_trans()
The gut of the code for commit is in the function ha_commit_trans() (and in
commit_one_phase() which is called from it). This must be extended to use the
new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
1.1 Atomic queue of committing transactions
To keep the right commit order among participants, we put transactions into a
queue. The operations on the queue are non-locking:
- Insert THD at the head of the queue, and return old queue.
THD *enqueue_atomic(THD *thd)
- Fetch (and delete) the whole queue.
THD *atomic_grab_reverse_queue()
These are simple to implement with atomic compare-and-set. Note that there is
no ABA problem [2], as we do not delete individual elements from the queue, we
grab the whole queue and replace it with NULL.
A transaction enters the queue when it does prepare_ordered(). This way, the
scheduling order for prepare_ordered() calls is what determines the sequence
in the queue and effectively the commit order.
The queue is grabbed by the code doing group_log_xid() and commit_ordered()
calls. The queue is passed directly to group_log_xid(), and afterwards
iterated to do individual commit_ordered() calls.
Using a lock-free queue allows prepare_ordered() (for one transaction) to run
in parallel with commit_ordered (in another transaction), increasing potential
parallelism.
The queue is simply a linked list of THD objects, linked through a
THD::next_commit_ordered field. Since we add at the head of the queue, the
list is actually in reverse order, so must be reversed when we grab and delete
it.
The reason that enqueue_atomic() returns the old queue is so that we can check
if an insert goes to the head of the queue. The thread at the head of the
queue will do the sequential part of group commit for everyone.
1.2 Locks
1.2.1 Global LOCK_prepare_ordered
This lock is taken to serialise calls to prepare_ordered(). Note that
effectively, the commit order is decided by the order in which threads obtain
this lock.
1.2.2 Global LOCK_group_commit and COND_group_commit
This lock is used to protect the serial part of group commit. It is taken
around the code where we grab the queue, call group_log_xid() on the queue,
and call commit_ordered() on each element of the queue, to make sure they
happen serialised and in consistent order. It also protects the variable
group_commit_queue_busy, which is used when not using group_log_xid() to delay
running over a new queue until the first queue is completely done.
1.2.3 Global LOCK_commit_ordered
This lock is taken around calls to commit_ordered(), to ensure they happen
serialised.
1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
This lock protects the thd->group_commit_ready variable, as well as the
condition variable used to wake up threads after log_xid() and
commit_ordered() finishes.
1.2.5 Global LOCK_group_commit_queue
This is only used on platforms with no native compare-and-set operations, to
make the queue operations atomic.
1.3 Commit algorithm.
This is the basic algorithm, simplified by
- omitting some error handling
- omitting looping over all handlers when invoking handler methods
- omitting some possible optimisations when not all calls needed (see next
section).
- Omitting the case where no group_log_xid() is used, see below.
---- BEGIN ALGORITHM ----
ht->prepare()
// Call prepare_ordered() and enqueue in correct commit order
lock(LOCK_prepare_ordered)
ht->prepare_ordered()
old_queue= enqueue_atomic(thd)
thd->group_commit_ready= FALSE
is_group_commit_leader= (old_queue == NULL)
unlock(LOCK_prepare_ordered)
if (is_group_commit_leader)
// The first in queue handles group commit for everyone
lock(LOCK_group_commit)
// Wait while queue is busy, see below for when this occurs
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
// Grab and reverse the queue to get correct order of transactions
queue= atomic_grab_reverse_queue()
// This call will set individual error codes in thd->xid_error
// It also sets the cookie for unlog() in thd->xid_cookie
group_log_xid(queue)
lock(LOCK_commit_ordered)
for (other IN queue)
if (!other->xid_error)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
unlock(LOCK_group_commit)
// Now we are done, so wake up all the others.
for (other IN TAIL(queue))
lock(other->LOCK_commit_ordered)
other->group_commit_ready= TRUE
cond_signal(other->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
else
// If not the leader, just wait until leader did the work for us.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
// Finally do any error reporting now that we're back in own thread.
if (thd->xid_error)
xid_delayed_error(thd)
else
ht->commit(thd)
unlog(thd->xid_cookie, thd->xid)
---- END ALGORITHM ----
If the transaction coordinator does not support group_log_xid(), we have to do
things differently. In this case after the serialisation point at
prepare_ordered(), we have to parallelise again when running log_xid()
(otherwise we would loose group commit). But then when log_xid() is done, we
have to serialise again to check for any error and call commit_ordered() in
correct sequence for any transaction where log_xid() did not return error.
The central part of the algorithm in this case (when using log_xid()) is:
---- BEGIN ALGORITHM ----
cookie= log_xid(thd)
error= (cookie == 0)
if (is_group_commit_leader)
// The first to enqueue grabs the queue and runs first.
// But we must wait until a previous queue run is fully done.
lock(LOCK_group_commit)
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
queue= atomic_grab_reverse_queue()
// The queue will be busy until last thread in it is done.
group_commit_queue_busy= TRUE
unlock(LOCK_group_commit)
else
// Not first in queue -> wait for previous one to wake us up.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
if (!error) // Only if log_xid() was successful
lock(LOCK_commit_ordered)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
// Wake up the next thread, and release queue in last.
next= thd->next_commit_ordered
if (next)
lock(next->LOCK_commit_ordered)
next->group_commit_ready= TRUE
cond_signal(next->COND_commit_ordered)
unlock(next->LOCK_commit_ordered)
else
lock(LOCK_group_commit)
group_commit_queue_busy= FALSE
unlock(LOCK_group_commit)
---- END ALGORITHM ----
There are a number of locks taken in the algorithm, but in the group_log_xid()
case most of them should be uncontended most of the time. The
LOCK_group_commit of course will be contended, as new threads queue up waiting
for the previous group commit (and binlog fsync()) to finish so they can do
the next group commit. This is the whole point of implementing group commit.
The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
contended as long as handlers follow the intension of having the corresponding
handler calls execute quickly.
The per-thread LOCK_commit_ordered mutexes should not be contended; they are
only used to wake up a sleeping thread.
1.4 Optimisations when not using all three new calls
The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
optional, and if not implemented by a particular handler/transaction
coordinator, we can optimise the algorithm to take advantage of not having to
keep ordering for the missing parts.
If there is no prepare_ordered(), then we need not take the
LOCK_prepare_ordered mutex.
If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
mutex.
If there is no group_log_xid(), then we only need the queue to ensure same
ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
if either of these (or both) are also not present, we do not need to use the
queue at all.
2. Binlog code changes (log.cc)
The bulk of the work needed for the binary log is to extend the code to allow
group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
inside the binlog code for group commit.
The existing code runs most of the write + fsync to the binary lock under the
global LOCK_log mutex, preventing any group commit.
To enable group commit, this code must be split into two parts:
- one part that runs per transaction, re-writing the embedded event positions
for the correct offset, and writing this into the in-memory log cache.
- another part that writes a set of transactions to the disk, and runs
fsync().
Then in group_log_xid(), we can run the first part in a loop over all the
transactions in the passed-in queue, and run the second part only once.
The binlog code also has other code paths that write into the binlog,
eg. non-transactional statements. These have to be adapted also to work with
the new code.
In order to get some group commit facility for these also, we change that part
of the code in a similar way to ha_commit_trans. We keep another,
binlog-internal queue of such non-transactional binlog writes, and such writes
queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
LOCK_log, it loops over the queue for the fast part, and does the slow part
once, then finally wakes up the others in the queue.
In the transactional case in group_log_xid(), before we run the passed-in
queue, we add any members found in the binlog-internal queue. This allows
these non-transactional writes to share the group commit.
However, in the case where it is a non-transactional write that gets the
LOCK_log, the transactional transactions from the ha_commit_trans() queue will
not be able to take part (they will have to wait for their turn to do another
fsync). It seems difficult to cleanly let the binlog code grab the queue from
out of the ha_commit_trans() algorithm. I think the group commit is mostly
useful in transactional workloads anyway (non-transactional engines will loose
data anyway in case of crash, so why fsync() after each transaction?)
3. XtraDB changes (ha_innodb.cc)
The changes needed in XtraDB are comparatively simple, as XtraDB already
implements group commit, it just needs to be enabled with the new
commit_ordered() call.
The existing commit() method already is logically in two parts. The first part
runs under the prepare_commit_mutex() and must be run in same order as binlog
commit. This part needs to be moved to commit_ordered(). The second part runs
after releasing prepare_commit_mutex and does transaction log write+fsync; it
can remain.
Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
XtraDB option to disable it).
There are two asserts that check that the thread running the first part of
XtraDB commit is the same as the thread running the other operations for the
transaction. These have to be removed (as commit_ordered() can run in a
different thread). Also an error reporting with sql_print_error() has to be
delayed until commit() time.
4. Proof-of-concept implementation
There is a proof-of-concept implementation of this architecture, in the form
of a quilt patch series [3].
A quick benchmark was done, with sync_binlog=1 and
innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
transactions against one table.
Without the patch, we get only 25 queries per second.
With the patch, we get 650 queries per second.
5. Open issues/tasks
5.1 XA / other prepare() and commit() call sites.
Check that user-level XA is handled correctly and working. And covered
sufficiently with tests. Also check that any other calls of ha->prepare() and
ha->commit() outside of ha_commit_trans() are handled correctly.
5.2 Testing
This worklog needs additions to the test suite, including error inserts to
check error handling, and synchronisation points to check thread parallelism
correctness.
6. Alternative implementations
- The binlog code maintains its own extra atomic transaction queue to handle
non-transactional commits in a good way together with transactional (with
respect to group commit). Alternatively, we could ignore this issue and
just give up on group commit for non-transactional statements, for some
code simplifications.
- The binlog code has two ways to prepare end_event and similar, one that
uses stack-allocation, and another for when stack allocation is not
possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
so small that it would make sense to use the same code for both cases.
- Instead of adding extra fields to THD, we could allocate a separate
structure on the thd->mem_root() with the required extra fields (including
the THD pointer). Would seem to require initialising mutexes at every
commit though.
- It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
(should not be hard).
-----------------------------------------------------------------------
References:
[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
[3] https://knielsen-hq.org/maria/patches.mwl116/
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Progress (by Knielsen): Efficient group commit for binary log (116)
by worklog-noreply@askmonty.org 31 May '10
by worklog-noreply@askmonty.org 31 May '10
31 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Efficient group commit for binary log
CREATION DATE..: Mon, 26 Apr 2010, 13:28
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Serg
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 116 (http://askmonty.org/worklog/?tid=116)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 60
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 31 May 2010, 06:48)=-=-
Finish first architecture draft (changed my mind a number of times before I was satisfied).
Write up architecture in worklog.
Fix remaining test failures in proof-of-concept patch + implement xtradb part.
Run some benchmarks on proof-of-concept implementation.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Tue, 25 May 2010, 13:19)=-=-
Low Level Design modified.
--- /tmp/wklog.116.old.14255 2010-05-25 13:19:00.000000000 +0000
+++ /tmp/wklog.116.new.14255 2010-05-25 13:19:00.000000000 +0000
@@ -1 +1,363 @@
+1. Changes for ha_commit_trans()
+
+The gut of the code for commit is in the function ha_commit_trans() (and in
+commit_one_phase() which is called from it). This must be extended to use the
+new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
+
+1.1 Atomic queue of committing transactions
+
+To keep the right commit order among participants, we put transactions into a
+queue. The operations on the queue are non-locking:
+
+ - Insert THD at the head of the queue, and return old queue.
+
+ THD *enqueue_atomic(THD *thd)
+
+ - Fetch (and delete) the whole queue.
+
+ THD *atomic_grab_reverse_queue()
+
+These are simple to implement with atomic compare-and-set. Note that there is
+no ABA problem [2], as we do not delete individual elements from the queue, we
+grab the whole queue and replace it with NULL.
+
+A transaction enters the queue when it does prepare_ordered(). This way, the
+scheduling order for prepare_ordered() calls is what determines the sequence
+in the queue and effectively the commit order.
+
+The queue is grabbed by the code doing group_log_xid() and commit_ordered()
+calls. The queue is passed directly to group_log_xid(), and afterwards
+iterated to do individual commit_ordered() calls.
+
+Using a lock-free queue allows prepare_ordered() (for one transaction) to run
+in parallel with commit_ordered (in another transaction), increasing potential
+parallelism.
+
+The queue is simply a linked list of THD objects, linked through a
+THD::next_commit_ordered field. Since we add at the head of the queue, the
+list is actually in reverse order, so must be reversed when we grab and delete
+it.
+
+The reason that enqueue_atomic() returns the old queue is so that we can check
+if an insert goes to the head of the queue. The thread at the head of the
+queue will do the sequential part of group commit for everyone.
+
+
+1.2 Locks
+
+1.2.1 Global LOCK_prepare_ordered
+
+This lock is taken to serialise calls to prepare_ordered(). Note that
+effectively, the commit order is decided by the order in which threads obtain
+this lock.
+
+
+1.2.2 Global LOCK_group_commit and COND_group_commit
+
+This lock is used to protect the serial part of group commit. It is taken
+around the code where we grab the queue, call group_log_xid() on the queue,
+and call commit_ordered() on each element of the queue, to make sure they
+happen serialised and in consistent order. It also protects the variable
+group_commit_queue_busy, which is used when not using group_log_xid() to delay
+running over a new queue until the first queue is completely done.
+
+
+1.2.3 Global LOCK_commit_ordered
+
+This lock is taken around calls to commit_ordered(), to ensure they happen
+serialised.
+
+
+1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
+
+This lock protects the thd->group_commit_ready variable, as well as the
+condition variable used to wake up threads after log_xid() and
+commit_ordered() finishes.
+
+
+1.2.5 Global LOCK_group_commit_queue
+
+This is only used on platforms with no native compare-and-set operations, to
+make the queue operations atomic.
+
+
+1.3 Commit algorithm.
+
+This is the basic algorithm, simplified by
+
+ - omitting some error handling
+
+ - omitting looping over all handlers when invoking handler methods
+
+ - omitting some possible optimisations when not all calls needed (see next
+ section).
+
+ - Omitting the case where no group_log_xid() is used, see below.
+
+---- BEGIN ALGORITHM ----
+ ht->prepare()
+
+ // Call prepare_ordered() and enqueue in correct commit order
+ lock(LOCK_prepare_ordered)
+ ht->prepare_ordered()
+ old_queue= enqueue_atomic(thd)
+ thd->group_commit_ready= FALSE
+ is_group_commit_leader= (old_queue == NULL)
+ unlock(LOCK_prepare_ordered)
+
+ if (is_group_commit_leader)
+
+ // The first in queue handles group commit for everyone
+
+ lock(LOCK_group_commit)
+ // Wait while queue is busy, see below for when this occurs
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+
+ // Grab and reverse the queue to get correct order of transactions
+ queue= atomic_grab_reverse_queue()
+
+ // This call will set individual error codes in thd->xid_error
+ // It also sets the cookie for unlog() in thd->xid_cookie
+ group_log_xid(queue)
+
+ lock(LOCK_commit_ordered)
+ for (other IN queue)
+ if (!other->xid_error)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ unlock(LOCK_group_commit)
+
+ // Now we are done, so wake up all the others.
+ for (other IN TAIL(queue))
+ lock(other->LOCK_commit_ordered)
+ other->group_commit_ready= TRUE
+ cond_signal(other->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+ else
+ // If not the leader, just wait until leader did the work for us.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ // Finally do any error reporting now that we're back in own thread.
+ if (thd->xid_error)
+ xid_delayed_error(thd)
+ else
+ ht->commit(thd)
+ unlog(thd->xid_cookie, thd->xid)
+---- END ALGORITHM ----
+
+If the transaction coordinator does not support group_log_xid(), we have to do
+things differently. In this case after the serialisation point at
+prepare_ordered(), we have to parallelise again when running log_xid()
+(otherwise we would loose group commit). But then when log_xid() is done, we
+have to serialise again to check for any error and call commit_ordered() in
+correct sequence for any transaction where log_xid() did not return error.
+
+The central part of the algorithm in this case (when using log_xid()) is:
+
+---- BEGIN ALGORITHM ----
+ cookie= log_xid(thd)
+ error= (cookie == 0)
+
+ if (is_group_commit_leader)
+
+ // The first to enqueue grabs the queue and runs first.
+ // But we must wait until a previous queue run is fully done.
+
+ lock(LOCK_group_commit)
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+ queue= atomic_grab_reverse_queue()
+ // The queue will be busy until last thread in it is done.
+ group_commit_queue_busy= TRUE
+ unlock(LOCK_group_commit)
+ else
+ // Not first in queue -> wait for previous one to wake us up.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ if (!error) // Only if log_xid() was successful
+ lock(LOCK_commit_ordered)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ // Wake up the next thread, and release queue in last.
+ next= thd->next_commit_ordered
+
+ if (next)
+ lock(next->LOCK_commit_ordered)
+ next->group_commit_ready= TRUE
+ cond_signal(next->COND_commit_ordered)
+ unlock(next->LOCK_commit_ordered)
+ else
+ lock(LOCK_group_commit)
+ group_commit_queue_busy= FALSE
+ unlock(LOCK_group_commit)
+---- END ALGORITHM ----
+
+There are a number of locks taken in the algorithm, but in the group_log_xid()
+case most of them should be uncontended most of the time. The
+LOCK_group_commit of course will be contended, as new threads queue up waiting
+for the previous group commit (and binlog fsync()) to finish so they can do
+the next group commit. This is the whole point of implementing group commit.
+
+The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
+contended as long as handlers follow the intension of having the corresponding
+handler calls execute quickly.
+
+The per-thread LOCK_commit_ordered mutexes should not be contended; they are
+only used to wake up a sleeping thread.
+
+
+1.4 Optimisations when not using all three new calls
+
+
+The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
+optional, and if not implemented by a particular handler/transaction
+coordinator, we can optimise the algorithm to take advantage of not having to
+keep ordering for the missing parts.
+
+If there is no prepare_ordered(), then we need not take the
+LOCK_prepare_ordered mutex.
+
+If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
+mutex.
+
+If there is no group_log_xid(), then we only need the queue to ensure same
+ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
+if either of these (or both) are also not present, we do not need to use the
+queue at all.
+
+
+2. Binlog code changes (log.cc)
+
+
+The bulk of the work needed for the binary log is to extend the code to allow
+group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
+inside the binlog code for group commit.
+
+The existing code runs most of the write + fsync to the binary lock under the
+global LOCK_log mutex, preventing any group commit.
+
+To enable group commit, this code must be split into two parts:
+
+ - one part that runs per transaction, re-writing the embedded event positions
+ for the correct offset, and writing this into the in-memory log cache.
+
+ - another part that writes a set of transactions to the disk, and runs
+ fsync().
+
+Then in group_log_xid(), we can run the first part in a loop over all the
+transactions in the passed-in queue, and run the second part only once.
+
+The binlog code also has other code paths that write into the binlog,
+eg. non-transactional statements. These have to be adapted also to work with
+the new code.
+
+In order to get some group commit facility for these also, we change that part
+of the code in a similar way to ha_commit_trans. We keep another,
+binlog-internal queue of such non-transactional binlog writes, and such writes
+queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
+LOCK_log, it loops over the queue for the fast part, and does the slow part
+once, then finally wakes up the others in the queue.
+
+In the transactional case in group_log_xid(), before we run the passed-in
+queue, we add any members found in the binlog-internal queue. This allows
+these non-transactional writes to share the group commit.
+
+However, in the case where it is a non-transactional write that gets the
+LOCK_log, the transactional transactions from the ha_commit_trans() queue will
+not be able to take part (they will have to wait for their turn to do another
+fsync). It seems difficult to cleanly let the binlog code grab the queue from
+out of the ha_commit_trans() algorithm. I think the group commit is mostly
+useful in transactional workloads anyway (non-transactional engines will loose
+data anyway in case of crash, so why fsync() after each transaction?)
+
+
+3. XtraDB changes (ha_innodb.cc)
+
+The changes needed in XtraDB are comparatively simple, as XtraDB already
+implements group commit, it just needs to be enabled with the new
+commit_ordered() call.
+
+The existing commit() method already is logically in two parts. The first part
+runs under the prepare_commit_mutex() and must be run in same order as binlog
+commit. This part needs to be moved to commit_ordered(). The second part runs
+after releasing prepare_commit_mutex and does transaction log write+fsync; it
+can remain.
+
+Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
+XtraDB option to disable it).
+
+There are two asserts that check that the thread running the first part of
+XtraDB commit is the same as the thread running the other operations for the
+transaction. These have to be removed (as commit_ordered() can run in a
+different thread). Also an error reporting with sql_print_error() has to be
+delayed until commit() time.
+
+
+4. Proof-of-concept implementation
+
+There is a proof-of-concept implementation of this architecture, in the form
+of a quilt patch series [3].
+
+A quick benchmark was done, with sync_binlog=1 and
+innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
+transactions against one table.
+
+Without the patch, we get only 25 queries per second.
+
+With the patch, we get 650 queries per second.
+
+
+5. Open issues/tasks
+
+5.1 XA / other prepare() and commit() call sites.
+
+Check that user-level XA is handled correctly and working. And covered
+sufficiently with tests. Also check that any other calls of ha->prepare() and
+ha->commit() outside of ha_commit_trans() are handled correctly.
+
+5.2 Testing
+
+This worklog needs additions to the test suite, including error inserts to
+check error handling, and synchronisation points to check thread parallelism
+correctness.
+
+
+6. Alternative implementations
+
+ - The binlog code maintains its own extra atomic transaction queue to handle
+ non-transactional commits in a good way together with transactional (with
+ respect to group commit). Alternatively, we could ignore this issue and
+ just give up on group commit for non-transactional statements, for some
+ code simplifications.
+
+ - The binlog code has two ways to prepare end_event and similar, one that
+ uses stack-allocation, and another for when stack allocation is not
+ possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
+ so small that it would make sense to use the same code for both cases.
+
+ - Instead of adding extra fields to THD, we could allocate a separate
+ structure on the thd->mem_root() with the required extra fields (including
+ the THD pointer). Would seem to require initialising mutexes at every
+ commit though.
+
+ - It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
+ (should not be hard).
+
+
+-----------------------------------------------------------------------
+
+References:
+
+[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
+
+[3] https://knielsen-hq.org/maria/patches.mwl116/
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.116.old.14249 2010-05-25 13:18:34.000000000 +0000
+++ /tmp/wklog.116.new.14249 2010-05-25 13:18:34.000000000 +0000
@@ -1 +1,157 @@
+The basic idea in group commit is that multiple threads, each handling one
+transaction, prepare for commit and then queue up together waiting to do an
+fsync() on the transaction log. Then once the log is available, a single
+thread does the fsync() + other necessary book-keeping for all of the threads
+at once. After this, the single thread signals the other threads that it's
+done and they can finish up and return success (or failure) from the commit
+operation.
+
+So group commit has a parallel part, and a sequential part. So we need a
+facility for engines/binlog to participate in both the parallel and the
+sequential part.
+
+To do this, we add two new handlerton methods:
+
+ int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
+ void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
+
+The idea is that the existing prepare() and commit() methods run in the
+parallel part of group commit, and the new prepare_ordered() and
+commit_ordered() run in the sequential part.
+
+The prepare_ordered() method is called after prepare(). The order of
+tranctions that call into prepare_ordered() is guaranteed to be the same among
+all storage engines and binlog, and it is serialised so no two calls can be
+running inside the same engine at the same time.
+
+The commit_ordered() method is called before commit(), and similarly is
+guaranteed to have same transaction order in all participants, and to be
+serialised within one engine.
+
+As the prepare_ordered() and commit_ordered() calls are serialised, the idea
+is that handlers should do the minimum amount of work needed in these calls,
+relaying most of the work (eg. fsync() ...) to prepare() and commit().
+
+As a concrete example, for InnoDB the commit_ordered() method will do the
+first part of commit that fixed the commit order in the transaction log
+buffer, and the commit() method will write the log to disk and fsync()
+it. This split already exists inside the InnoDB code, running before
+respectively after releasing the prepare_commit_mutex.
+
+In addition, the XA transaction coordinator (TC_LOG) is special, since it is
+the one responsible for deciding whether to commit or rollback the
+transaction. For this we need an extra method, since this decision can be done
+only after we know that all prepare() and prepare_ordered() calls succeed, and
+must be done to know whether to call commit_ordered()/commit(), or do rollback.
+
+The existing method for this is TC_LOG::log_xid(). To make implementing group
+commit simpler to implement in a transaction coordinator and more efficient,
+we introduce a new method:
+
+ void group_log_xid(THD *first_thd);
+
+This method runs in the sequential part of group commit. It receives a list of
+transactions to perform log_xid() on, in the correct commit order. (Note that
+TC_LOG can do parallel parts of group commit in its own prepare() and commit()
+methods).
+
+This method can make it easier to implement the group commit in TC_LOG, as it
+gets directly the list of transactions in the right order. Without it, it
+might need to compute such order anyway in a prepare_ordered() method, and the
+server has to create this ordered list anyway to implement the order guarantee
+for prepare_ordered() and commit_ordered().
+
+This group_log_xid() method also is more efficient, as it avoids some
+inter-thread synchronisation. Since group_log_xid() is serialised, we can run
+it together with all the commit_ordered() method calls and need only a single
+sequential code section. With the log_xid() methods, we would need first a
+sequential part for the prepare_ordered() calls, then a parallel part with
+log_xid() calls (to not loose group commit ability for log_xid()), then again
+a sequential part for the commit_ordered() method calls.
+
+The extra synchronisation is needed, as each commit_ordered() call will have
+to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
+should not be called), and also wait for commit_ordered() to finish in all
+threads handling earlier commits. In effect we will need to bounce the
+execution from one thread to the other among all participants in the group
+commit.
+
+As a consequence of the group_log_xid() optimisation, handlers must be aware
+that the commit_ordered() call can happen in another thread than the one
+running commit() (so thread local storage is not available). This should not
+be a big issue as the THD is available for storing any needed information.
+
+Since group_log_xid() runs for multiple transactions in a single thread, it
+can not do error reporting (my_error()) as that relies on thread local
+storage. Instead it sets an error code in THD::xid_error, and if there is an
+error then later another method will be called (in correct thread context) to
+actually report the error:
+
+ int xid_delayed_error(THD *thd)
+
+The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
+are optional (as is xid_delayed_error). A storage engine or transaction
+coordinator is free to not implement them if they are not needed. In this case
+there will be no order guarantee for the corresponding stage of group commit
+for that engine. For example, InnoDB needs no ordering of the prepare phase,
+so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
+all, so does not need to implement any of them.
+
+Note in particular that all existing engines (/binlog implementations if they
+exist) will work unmodified (and also without any change in group commit
+facilities or commit order guaranteed).
+
+Using these new APIs, the work will be to
+
+ - In ha_commit_trans(), implement the correct semantics for the three new
+ calls.
+
+ - In XtraDB, use the new commit_ordered() call to remove the
+ prepare_commit_mutex (and resurrect group commit) without loosing the
+ consistency with binlog commit order.
+
+ - In log.cc (binlog module), implement group_log_xid() to do group commit of
+ multiple transactions to the binlog with a single shared fsync() call.
+
+-----------------------------------------------------------------------
+Some possible alternative for this worklog:
+
+ - We could eliminate the group_log_xid() method for a simpler API, at the
+ cost of extra synchronisation between threads to do in-order
+ commit_ordered() method calls. This would also allow to call
+ commit_ordered() in the correct thread context.
+
+ - Alternatively, we could eliminate log_xid() and require that all
+ transaction coordinators implement group_log_xid() instead, again for some
+ moderate simplification.
+
+ - At the moment there is no plugin actually using prepare_ordered(), so, it
+ could be removed from the design. But it fits in well, is efficient to
+ implement, and could be useful later (eg. for the requested feature of
+ releasing locks early in InnoDB).
+
+-----------------------------------------------------------------------
+Some possible follow-up projects after this is implemented:
+
+ - Add statistics about how efficient group commit is (#fsyncs/#commits in
+ each engine and binlog).
+
+ - Implement an XtraDB prepare_ordered() methods that can release row locks
+ early (Mark Callaghan from Facebook advocates this, but need to determine
+ exactly how to do this safely).
+
+ - Implement a new crash recovery algorithm that uses the consistent commit
+ ordering to need only fsync() for the binlog. At crash recovery, any
+ missing transactions in an engine is replayed from the correct point in the
+ binlog (this point must be stored transactionally inside the engine, as
+ XtraDB already does today).
+
+ - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
+ consistent snapshow, with same set of committed and not committed
+ transactions in all engines, 2) returns a corresponding consistent binlog
+ position. This should be easy by piggybacking on the synchronisation
+ implemented for ha_commit_trans().
+
+ - Use this in XtraBackup to get consistent binlog position without having to
+ block all updates with FLUSH TABLES WITH READ LOCK.
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High Level Description modified.
--- /tmp/wklog.116.old.14234 2010-05-25 13:18:07.000000000 +0000
+++ /tmp/wklog.116.new.14234 2010-05-25 13:18:07.000000000 +0000
@@ -21,3 +21,69 @@
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
+----
+
+Implementing group commit in MySQL faces some challenges from the handler
+plugin architecture:
+
+1. Because storage engine handlers have separate transaction log from the
+mysql binlog (and from each other), there are multiple fsync() calls per
+commit that need the group commit optimisation (2 per participating storage
+engine + 1 for binlog).
+
+2. The code handling commit is split in several places, in main server code
+and in storage engine code. With pluggable binlog it will be split even
+more. This requires a good abstract yet powerful API to be able to implement
+group commit simply and efficiently in plugins without the different parts
+having to rely on iternals of the others.
+
+3. We want the order of commits to be the same in all engines participating in
+multiple transactions. This requirement is the reason that InnoDB currently
+breaks group commit with the infamous prepare_commit_mutex.
+
+While currently there is no server guarantee to get same commit order in
+engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
+several reasons why this could be desirable:
+
+ - InnoDB hot backup needs to be able to extract a binlog position that is
+ consistent with the hot backup to be able to provision a new slave, and
+ this is impossible without imposing at least partial consistent ordering
+ between InnoDB and binlog.
+
+ - Other backup methods could have similar needs, eg. XtraBackup or
+ `mysqldump --single-transaction`, to have consistent commit order between
+ binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
+ or similar expensive blocking operation. (other backup methods, like LVM
+ snapshot, don't need consistent commit order, as they can restore
+ out-of-order commits during crash recovery using XA).
+
+ - If we have consistent commit order, we can think about optimising commit to
+ need only one fsync (for binlog); lost commits in storage engines can then
+ be recovered from the binlog at crash recovery by re-playing against the
+ engine from a particular point in the binlog.
+
+ - With consistent commit order, we can get better semantics for START
+ TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
+ could even get it to return also a matching binlog position). Currently,
+ this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
+ engines.
+
+ - In InnoDB, the performance in the presense of hotspots can be improved if
+ we can release row locks early in the commit phase, but this requires that we
+release them in
+ the same order as commits in the binlog to ensure consistency between
+ master and slaves.
+
+ - There was some discussions around Galera [1] synchroneous replication and
+ global transaction ID that it needed consistent commit order among
+ participating engines.
+
+ - I believe there could be other applications for guaranteed consistent
+ commit order, and that the architecture described in this worklog can
+ implement such guarantee with reasonable overhead.
+
+
+References:
+
+[1] Galera: http://www.codership.com/products/galera_replication
+
-=-=(Knielsen - Tue, 25 May 2010, 08:28)=-=-
More thoughts on and changes to the archtecture. Got to something now that I am satisfied with and
that seems to be able to handle all issues.
Implement new prepare_ordered and commit_ordered handler methods and the logic in ha_commit_trans().
Implement TC_LOG::group_log_xid() method and logic in ha_commit_trans().
Implement XtraDB part, using commit_ordered() rather than prepare_commit_mutex.
Fix test suite failures.
Proof-of-concept patch series complete now.
Do initial benchmark, getting good results. With 64 threads, see 26x improvement in queries-per-sec.
Next step: write up the architecture description.
Worked 21 hours and estimate 0 hours remain (original estimate increased by 21 hours).
-=-=(Knielsen - Wed, 12 May 2010, 06:41)=-=-
Started work on a Quilt patch series, refactoring the binlog code to prepare for implementing the
group commit, and working on the design of group commit in parallel.
Found and fixed several problems in error handling when writing to binlog.
Removed redundant table map version locking.
Split binlog writing into two parts in preparations for group commit. When ready to write to the
binlog, threads enter a queue, and the first thread in the queue handles the binlog writing for
everyone. When it obtains the LOCK_log, it first loops over all threads, executing the first part of
binlog writing (the write(2) syscall essentially). It then runs the second part (fsync(2)
essentially) only once, and then wakes up the remaining threads in the queue.
Still to be done:
Finish the proof-of-concept group commit patch, by 1) implementing the prepare_fast() and
commit_fast() callbacks in handler.cc 2) move the binlog thread enqueue from log_xid() to
binlog_prepare_fast(), 3) move fast part of InnoDB commit to innobase_commit_fast(), removing the
prepare_commit_mutex().
Write up the final design in this worklog.
Evaluate the design to see if we can do better/different.
Think about possible next steps, such as releasing innodb row locks early (in
innobase_prepare_fast), and doing crash recovery by replaying transactions from the binlog (removing
the need for engine durability and 2 of 3 fsync() in commit).
Worked 28 hours and estimate 0 hours remain (original estimate increased by 28 hours).
-=-=(Serg - Mon, 26 Apr 2010, 14:10)=-=-
Observers changed: Serg
DESCRIPTION:
Currently, in order to ensure that the server can recover after a crash to a
state in which storage engines and binary log are consistent with each other,
it is necessary to use XA with durable commits for both storage engines
(innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1).
This is _very_ expensive, since the server needs to do three fsync() operations
for every commit, as there is no working group commit when the binary log is
enabled.
The idea is to
- Implement/fix group commit to work properly with the binary log enabled.
- (Optionally) avoid the need to fsync() in the engine, and instead rely on
replaying any lost transactions from the binary log against the engine
during crash recovery.
For background see these articles:
http://kristiannielsen.livejournal.com/12254.html
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
----
Implementing group commit in MySQL faces some challenges from the handler
plugin architecture:
1. Because storage engine handlers have separate transaction log from the
mysql binlog (and from each other), there are multiple fsync() calls per
commit that need the group commit optimisation (2 per participating storage
engine + 1 for binlog).
2. The code handling commit is split in several places, in main server code
and in storage engine code. With pluggable binlog it will be split even
more. This requires a good abstract yet powerful API to be able to implement
group commit simply and efficiently in plugins without the different parts
having to rely on iternals of the others.
3. We want the order of commits to be the same in all engines participating in
multiple transactions. This requirement is the reason that InnoDB currently
breaks group commit with the infamous prepare_commit_mutex.
While currently there is no server guarantee to get same commit order in
engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
several reasons why this could be desirable:
- InnoDB hot backup needs to be able to extract a binlog position that is
consistent with the hot backup to be able to provision a new slave, and
this is impossible without imposing at least partial consistent ordering
between InnoDB and binlog.
- Other backup methods could have similar needs, eg. XtraBackup or
`mysqldump --single-transaction`, to have consistent commit order between
binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
or similar expensive blocking operation. (other backup methods, like LVM
snapshot, don't need consistent commit order, as they can restore
out-of-order commits during crash recovery using XA).
- If we have consistent commit order, we can think about optimising commit to
need only one fsync (for binlog); lost commits in storage engines can then
be recovered from the binlog at crash recovery by re-playing against the
engine from a particular point in the binlog.
- With consistent commit order, we can get better semantics for START
TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
could even get it to return also a matching binlog position). Currently,
this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
engines.
- In InnoDB, the performance in the presense of hotspots can be improved if
we can release row locks early in the commit phase, but this requires that we
release them in
the same order as commits in the binlog to ensure consistency between
master and slaves.
- There was some discussions around Galera [1] synchroneous replication and
global transaction ID that it needed consistent commit order among
participating engines.
- I believe there could be other applications for guaranteed consistent
commit order, and that the architecture described in this worklog can
implement such guarantee with reasonable overhead.
References:
[1] Galera: http://www.codership.com/products/galera_replication
HIGH-LEVEL SPECIFICATION:
The basic idea in group commit is that multiple threads, each handling one
transaction, prepare for commit and then queue up together waiting to do an
fsync() on the transaction log. Then once the log is available, a single
thread does the fsync() + other necessary book-keeping for all of the threads
at once. After this, the single thread signals the other threads that it's
done and they can finish up and return success (or failure) from the commit
operation.
So group commit has a parallel part, and a sequential part. So we need a
facility for engines/binlog to participate in both the parallel and the
sequential part.
To do this, we add two new handlerton methods:
int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
The idea is that the existing prepare() and commit() methods run in the
parallel part of group commit, and the new prepare_ordered() and
commit_ordered() run in the sequential part.
The prepare_ordered() method is called after prepare(). The order of
tranctions that call into prepare_ordered() is guaranteed to be the same among
all storage engines and binlog, and it is serialised so no two calls can be
running inside the same engine at the same time.
The commit_ordered() method is called before commit(), and similarly is
guaranteed to have same transaction order in all participants, and to be
serialised within one engine.
As the prepare_ordered() and commit_ordered() calls are serialised, the idea
is that handlers should do the minimum amount of work needed in these calls,
relaying most of the work (eg. fsync() ...) to prepare() and commit().
As a concrete example, for InnoDB the commit_ordered() method will do the
first part of commit that fixed the commit order in the transaction log
buffer, and the commit() method will write the log to disk and fsync()
it. This split already exists inside the InnoDB code, running before
respectively after releasing the prepare_commit_mutex.
In addition, the XA transaction coordinator (TC_LOG) is special, since it is
the one responsible for deciding whether to commit or rollback the
transaction. For this we need an extra method, since this decision can be done
only after we know that all prepare() and prepare_ordered() calls succeed, and
must be done to know whether to call commit_ordered()/commit(), or do rollback.
The existing method for this is TC_LOG::log_xid(). To make implementing group
commit simpler to implement in a transaction coordinator and more efficient,
we introduce a new method:
void group_log_xid(THD *first_thd);
This method runs in the sequential part of group commit. It receives a list of
transactions to perform log_xid() on, in the correct commit order. (Note that
TC_LOG can do parallel parts of group commit in its own prepare() and commit()
methods).
This method can make it easier to implement the group commit in TC_LOG, as it
gets directly the list of transactions in the right order. Without it, it
might need to compute such order anyway in a prepare_ordered() method, and the
server has to create this ordered list anyway to implement the order guarantee
for prepare_ordered() and commit_ordered().
This group_log_xid() method also is more efficient, as it avoids some
inter-thread synchronisation. Since group_log_xid() is serialised, we can run
it together with all the commit_ordered() method calls and need only a single
sequential code section. With the log_xid() methods, we would need first a
sequential part for the prepare_ordered() calls, then a parallel part with
log_xid() calls (to not loose group commit ability for log_xid()), then again
a sequential part for the commit_ordered() method calls.
The extra synchronisation is needed, as each commit_ordered() call will have
to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
should not be called), and also wait for commit_ordered() to finish in all
threads handling earlier commits. In effect we will need to bounce the
execution from one thread to the other among all participants in the group
commit.
As a consequence of the group_log_xid() optimisation, handlers must be aware
that the commit_ordered() call can happen in another thread than the one
running commit() (so thread local storage is not available). This should not
be a big issue as the THD is available for storing any needed information.
Since group_log_xid() runs for multiple transactions in a single thread, it
can not do error reporting (my_error()) as that relies on thread local
storage. Instead it sets an error code in THD::xid_error, and if there is an
error then later another method will be called (in correct thread context) to
actually report the error:
int xid_delayed_error(THD *thd)
The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
are optional (as is xid_delayed_error). A storage engine or transaction
coordinator is free to not implement them if they are not needed. In this case
there will be no order guarantee for the corresponding stage of group commit
for that engine. For example, InnoDB needs no ordering of the prepare phase,
so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
all, so does not need to implement any of them.
Note in particular that all existing engines (/binlog implementations if they
exist) will work unmodified (and also without any change in group commit
facilities or commit order guaranteed).
Using these new APIs, the work will be to
- In ha_commit_trans(), implement the correct semantics for the three new
calls.
- In XtraDB, use the new commit_ordered() call to remove the
prepare_commit_mutex (and resurrect group commit) without loosing the
consistency with binlog commit order.
- In log.cc (binlog module), implement group_log_xid() to do group commit of
multiple transactions to the binlog with a single shared fsync() call.
-----------------------------------------------------------------------
Some possible alternative for this worklog:
- We could eliminate the group_log_xid() method for a simpler API, at the
cost of extra synchronisation between threads to do in-order
commit_ordered() method calls. This would also allow to call
commit_ordered() in the correct thread context.
- Alternatively, we could eliminate log_xid() and require that all
transaction coordinators implement group_log_xid() instead, again for some
moderate simplification.
- At the moment there is no plugin actually using prepare_ordered(), so, it
could be removed from the design. But it fits in well, is efficient to
implement, and could be useful later (eg. for the requested feature of
releasing locks early in InnoDB).
-----------------------------------------------------------------------
Some possible follow-up projects after this is implemented:
- Add statistics about how efficient group commit is (#fsyncs/#commits in
each engine and binlog).
- Implement an XtraDB prepare_ordered() methods that can release row locks
early (Mark Callaghan from Facebook advocates this, but need to determine
exactly how to do this safely).
- Implement a new crash recovery algorithm that uses the consistent commit
ordering to need only fsync() for the binlog. At crash recovery, any
missing transactions in an engine is replayed from the correct point in the
binlog (this point must be stored transactionally inside the engine, as
XtraDB already does today).
- Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
consistent snapshow, with same set of committed and not committed
transactions in all engines, 2) returns a corresponding consistent binlog
position. This should be easy by piggybacking on the synchronisation
implemented for ha_commit_trans().
- Use this in XtraBackup to get consistent binlog position without having to
block all updates with FLUSH TABLES WITH READ LOCK.
LOW-LEVEL DESIGN:
1. Changes for ha_commit_trans()
The gut of the code for commit is in the function ha_commit_trans() (and in
commit_one_phase() which is called from it). This must be extended to use the
new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
1.1 Atomic queue of committing transactions
To keep the right commit order among participants, we put transactions into a
queue. The operations on the queue are non-locking:
- Insert THD at the head of the queue, and return old queue.
THD *enqueue_atomic(THD *thd)
- Fetch (and delete) the whole queue.
THD *atomic_grab_reverse_queue()
These are simple to implement with atomic compare-and-set. Note that there is
no ABA problem [2], as we do not delete individual elements from the queue, we
grab the whole queue and replace it with NULL.
A transaction enters the queue when it does prepare_ordered(). This way, the
scheduling order for prepare_ordered() calls is what determines the sequence
in the queue and effectively the commit order.
The queue is grabbed by the code doing group_log_xid() and commit_ordered()
calls. The queue is passed directly to group_log_xid(), and afterwards
iterated to do individual commit_ordered() calls.
Using a lock-free queue allows prepare_ordered() (for one transaction) to run
in parallel with commit_ordered (in another transaction), increasing potential
parallelism.
The queue is simply a linked list of THD objects, linked through a
THD::next_commit_ordered field. Since we add at the head of the queue, the
list is actually in reverse order, so must be reversed when we grab and delete
it.
The reason that enqueue_atomic() returns the old queue is so that we can check
if an insert goes to the head of the queue. The thread at the head of the
queue will do the sequential part of group commit for everyone.
1.2 Locks
1.2.1 Global LOCK_prepare_ordered
This lock is taken to serialise calls to prepare_ordered(). Note that
effectively, the commit order is decided by the order in which threads obtain
this lock.
1.2.2 Global LOCK_group_commit and COND_group_commit
This lock is used to protect the serial part of group commit. It is taken
around the code where we grab the queue, call group_log_xid() on the queue,
and call commit_ordered() on each element of the queue, to make sure they
happen serialised and in consistent order. It also protects the variable
group_commit_queue_busy, which is used when not using group_log_xid() to delay
running over a new queue until the first queue is completely done.
1.2.3 Global LOCK_commit_ordered
This lock is taken around calls to commit_ordered(), to ensure they happen
serialised.
1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
This lock protects the thd->group_commit_ready variable, as well as the
condition variable used to wake up threads after log_xid() and
commit_ordered() finishes.
1.2.5 Global LOCK_group_commit_queue
This is only used on platforms with no native compare-and-set operations, to
make the queue operations atomic.
1.3 Commit algorithm.
This is the basic algorithm, simplified by
- omitting some error handling
- omitting looping over all handlers when invoking handler methods
- omitting some possible optimisations when not all calls needed (see next
section).
- Omitting the case where no group_log_xid() is used, see below.
---- BEGIN ALGORITHM ----
ht->prepare()
// Call prepare_ordered() and enqueue in correct commit order
lock(LOCK_prepare_ordered)
ht->prepare_ordered()
old_queue= enqueue_atomic(thd)
thd->group_commit_ready= FALSE
is_group_commit_leader= (old_queue == NULL)
unlock(LOCK_prepare_ordered)
if (is_group_commit_leader)
// The first in queue handles group commit for everyone
lock(LOCK_group_commit)
// Wait while queue is busy, see below for when this occurs
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
// Grab and reverse the queue to get correct order of transactions
queue= atomic_grab_reverse_queue()
// This call will set individual error codes in thd->xid_error
// It also sets the cookie for unlog() in thd->xid_cookie
group_log_xid(queue)
lock(LOCK_commit_ordered)
for (other IN queue)
if (!other->xid_error)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
unlock(LOCK_group_commit)
// Now we are done, so wake up all the others.
for (other IN TAIL(queue))
lock(other->LOCK_commit_ordered)
other->group_commit_ready= TRUE
cond_signal(other->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
else
// If not the leader, just wait until leader did the work for us.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
// Finally do any error reporting now that we're back in own thread.
if (thd->xid_error)
xid_delayed_error(thd)
else
ht->commit(thd)
unlog(thd->xid_cookie, thd->xid)
---- END ALGORITHM ----
If the transaction coordinator does not support group_log_xid(), we have to do
things differently. In this case after the serialisation point at
prepare_ordered(), we have to parallelise again when running log_xid()
(otherwise we would loose group commit). But then when log_xid() is done, we
have to serialise again to check for any error and call commit_ordered() in
correct sequence for any transaction where log_xid() did not return error.
The central part of the algorithm in this case (when using log_xid()) is:
---- BEGIN ALGORITHM ----
cookie= log_xid(thd)
error= (cookie == 0)
if (is_group_commit_leader)
// The first to enqueue grabs the queue and runs first.
// But we must wait until a previous queue run is fully done.
lock(LOCK_group_commit)
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
queue= atomic_grab_reverse_queue()
// The queue will be busy until last thread in it is done.
group_commit_queue_busy= TRUE
unlock(LOCK_group_commit)
else
// Not first in queue -> wait for previous one to wake us up.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
if (!error) // Only if log_xid() was successful
lock(LOCK_commit_ordered)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
// Wake up the next thread, and release queue in last.
next= thd->next_commit_ordered
if (next)
lock(next->LOCK_commit_ordered)
next->group_commit_ready= TRUE
cond_signal(next->COND_commit_ordered)
unlock(next->LOCK_commit_ordered)
else
lock(LOCK_group_commit)
group_commit_queue_busy= FALSE
unlock(LOCK_group_commit)
---- END ALGORITHM ----
There are a number of locks taken in the algorithm, but in the group_log_xid()
case most of them should be uncontended most of the time. The
LOCK_group_commit of course will be contended, as new threads queue up waiting
for the previous group commit (and binlog fsync()) to finish so they can do
the next group commit. This is the whole point of implementing group commit.
The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
contended as long as handlers follow the intension of having the corresponding
handler calls execute quickly.
The per-thread LOCK_commit_ordered mutexes should not be contended; they are
only used to wake up a sleeping thread.
1.4 Optimisations when not using all three new calls
The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
optional, and if not implemented by a particular handler/transaction
coordinator, we can optimise the algorithm to take advantage of not having to
keep ordering for the missing parts.
If there is no prepare_ordered(), then we need not take the
LOCK_prepare_ordered mutex.
If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
mutex.
If there is no group_log_xid(), then we only need the queue to ensure same
ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
if either of these (or both) are also not present, we do not need to use the
queue at all.
2. Binlog code changes (log.cc)
The bulk of the work needed for the binary log is to extend the code to allow
group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
inside the binlog code for group commit.
The existing code runs most of the write + fsync to the binary lock under the
global LOCK_log mutex, preventing any group commit.
To enable group commit, this code must be split into two parts:
- one part that runs per transaction, re-writing the embedded event positions
for the correct offset, and writing this into the in-memory log cache.
- another part that writes a set of transactions to the disk, and runs
fsync().
Then in group_log_xid(), we can run the first part in a loop over all the
transactions in the passed-in queue, and run the second part only once.
The binlog code also has other code paths that write into the binlog,
eg. non-transactional statements. These have to be adapted also to work with
the new code.
In order to get some group commit facility for these also, we change that part
of the code in a similar way to ha_commit_trans. We keep another,
binlog-internal queue of such non-transactional binlog writes, and such writes
queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
LOCK_log, it loops over the queue for the fast part, and does the slow part
once, then finally wakes up the others in the queue.
In the transactional case in group_log_xid(), before we run the passed-in
queue, we add any members found in the binlog-internal queue. This allows
these non-transactional writes to share the group commit.
However, in the case where it is a non-transactional write that gets the
LOCK_log, the transactional transactions from the ha_commit_trans() queue will
not be able to take part (they will have to wait for their turn to do another
fsync). It seems difficult to cleanly let the binlog code grab the queue from
out of the ha_commit_trans() algorithm. I think the group commit is mostly
useful in transactional workloads anyway (non-transactional engines will loose
data anyway in case of crash, so why fsync() after each transaction?)
3. XtraDB changes (ha_innodb.cc)
The changes needed in XtraDB are comparatively simple, as XtraDB already
implements group commit, it just needs to be enabled with the new
commit_ordered() call.
The existing commit() method already is logically in two parts. The first part
runs under the prepare_commit_mutex() and must be run in same order as binlog
commit. This part needs to be moved to commit_ordered(). The second part runs
after releasing prepare_commit_mutex and does transaction log write+fsync; it
can remain.
Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
XtraDB option to disable it).
There are two asserts that check that the thread running the first part of
XtraDB commit is the same as the thread running the other operations for the
transaction. These have to be removed (as commit_ordered() can run in a
different thread). Also an error reporting with sql_print_error() has to be
delayed until commit() time.
4. Proof-of-concept implementation
There is a proof-of-concept implementation of this architecture, in the form
of a quilt patch series [3].
A quick benchmark was done, with sync_binlog=1 and
innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
transactions against one table.
Without the patch, we get only 25 queries per second.
With the patch, we get 650 queries per second.
5. Open issues/tasks
5.1 XA / other prepare() and commit() call sites.
Check that user-level XA is handled correctly and working. And covered
sufficiently with tests. Also check that any other calls of ha->prepare() and
ha->commit() outside of ha_commit_trans() are handled correctly.
5.2 Testing
This worklog needs additions to the test suite, including error inserts to
check error handling, and synchronisation points to check thread parallelism
correctness.
6. Alternative implementations
- The binlog code maintains its own extra atomic transaction queue to handle
non-transactional commits in a good way together with transactional (with
respect to group commit). Alternatively, we could ignore this issue and
just give up on group commit for non-transactional statements, for some
code simplifications.
- The binlog code has two ways to prepare end_event and similar, one that
uses stack-allocation, and another for when stack allocation is not
possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
so small that it would make sense to use the same code for both cases.
- Instead of adding extra fields to THD, we could allocate a separate
structure on the thd->mem_root() with the required extra fields (including
the THD pointer). Would seem to require initialising mutexes at every
commit though.
- It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
(should not be hard).
-----------------------------------------------------------------------
References:
[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
[3] https://knielsen-hq.org/maria/patches.mwl116/
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
Name: 5.1 => 5.1-converting
--
lp:maria/5.1
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:maria/5.1.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0

[Maria-developers] Updated (by Knielsen): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply@askmonty.org 28 May '10
by worklog-noreply@askmonty.org 28 May '10
28 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add Sphinx storage engine to MariaDB
CREATION DATE..: Mon, 10 Aug 2009, 23:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 42 (http://askmonty.org/worklog/?tid=42)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 16 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
-=-=(Knielsen - Fri, 28 May 2010, 07:49)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.4369 2010-05-28 07:49:13.000000000 +0000
+++ /tmp/wklog.42.new.4369 2010-05-28 07:49:13.000000000 +0000
@@ -49,6 +49,10 @@
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
+I pushed a proof-of-concept patch for this here:
+
+ lp:~knielsen/maria/5.2-sphinxse
+
Here is a sample test case using this:
--source include/have_sphinx.inc
-=-=(Knielsen - Fri, 28 May 2010, 06:31)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.32746 2010-05-28 06:31:24.000000000 +0000
+++ /tmp/wklog.42.new.32746 2010-05-28 06:31:24.000000000 +0000
@@ -1 +1,63 @@
+Code
+----
+
+Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
+the MariaDB tree.
+
+It is a plugin, so it can be added to the tree just by including the
+sub-directory storage/sphinx/.
+
+The Sphinx plugin is already of some maturity, having been used with MySQL for
+some time.
+
+
+Testing
+-------
+
+To get testing in the mysql-test-run framework, some extensions are needed.
+
+To use the Sphinx storage engine, the external Sphinx search daemon needs to
+be running with some data directory containing indexed data. It also needs to
+be allocated a port.
+
+This is the indended approach:
+
+1. Testing will use an external Sphinx setup installed on the machine. Sphinx
+binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
+or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
+and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
+binaries can not be found, then Sphinx tests will be disabled (using some
+--source include/have_sphinx.inc in the test cases).
+
+2. The mysql-test-run framework will install Sphinx search data and start/stop
+the Sphinx search daemon for the test cases, similarly how it is done for the
+other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
+options --console, --config, and --pidfile.
+
+3. The mysql-test-run framework will generate a Sphinx config file from a
+template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
+ports and data directories appropriate for avoiding conflicts between multiple
+simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
+similar to MySQL my.cnf that we can use the existing framework for generating
+config file, with just a slightly modified variant of the code writing the
+file to disk.
+
+4. The mysql-test-run framework will pre-load the mysql database with tables
+and data for Sphinx to index. It will then run the `indexer` program to
+generate the indexes, and then start the `searchd` daemon. These three steps
+must be done in order, as each step depends on the previous. ALTERNATIVE: it
+might be possible to pre-generate the necessary data/index files and store
+them in the source tree.
+
+Here is a sample test case using this:
+
+--source include/have_sphinx.inc
+--source include/have_sphinxse.inc
+
+--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
+eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
+not null, index(q) ) engine=sphinx
+connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
+select * from ts where q='test';
+drop table ts;
-=-=(Knielsen - Fri, 28 May 2010, 06:07)=-=-
Version updated.
--- /tmp/wklog.42.old.32184 2010-05-28 06:07:00.000000000 +0000
+++ /tmp/wklog.42.new.32184 2010-05-28 06:07:00.000000000 +0000
@@ -1 +1 @@
-9.x
+Server-5.2
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Category updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-BackLog
+Server-Sprint
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Version updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-2.0
+9.x
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Status updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Guest - Tue, 15 Sep 2009, 02:25)=-=-
no
Reported zero hours worked. Estimate unchanged.
-=-=(Guest - Tue, 15 Sep 2009, 02:24)=-=-
Version updated.
--- /tmp/wklog.42.old.13241 2009-09-15 02:24:07.000000000 +0300
+++ /tmp/wklog.42.new.13241 2009-09-15 02:24:07.000000000 +0300
@@ -1 +1 @@
-Connector/.NET-5.1
+Maria-2.0
DESCRIPTION:
Add the Sphinx storage engine to the MariaDB tree
HIGH-LEVEL SPECIFICATION:
Code
----
Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
the MariaDB tree.
It is a plugin, so it can be added to the tree just by including the
sub-directory storage/sphinx/.
The Sphinx plugin is already of some maturity, having been used with MySQL for
some time.
Testing
-------
To get testing in the mysql-test-run framework, some extensions are needed.
To use the Sphinx storage engine, the external Sphinx search daemon needs to
be running with some data directory containing indexed data. It also needs to
be allocated a port.
This is the indended approach:
1. Testing will use an external Sphinx setup installed on the machine. Sphinx
binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
binaries can not be found, then Sphinx tests will be disabled (using some
--source include/have_sphinx.inc in the test cases).
2. The mysql-test-run framework will install Sphinx search data and start/stop
the Sphinx search daemon for the test cases, similarly how it is done for the
other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
options --console, --config, and --pidfile.
3. The mysql-test-run framework will generate a Sphinx config file from a
template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
ports and data directories appropriate for avoiding conflicts between multiple
simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
similar to MySQL my.cnf that we can use the existing framework for generating
config file, with just a slightly modified variant of the code writing the
file to disk.
4. The mysql-test-run framework will pre-load the mysql database with tables
and data for Sphinx to index. It will then run the `indexer` program to
generate the indexes, and then start the `searchd` daemon. These three steps
must be done in order, as each step depends on the previous. ALTERNATIVE: it
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
I pushed a proof-of-concept patch for this here:
lp:~knielsen/maria/5.2-sphinxse
Here is a sample test case using this:
--source include/have_sphinx.inc
--source include/have_sphinxse.inc
--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
not null, index(q) ) engine=sphinx
connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
select * from ts where q='test';
drop table ts;
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Knielsen): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply@askmonty.org 28 May '10
by worklog-noreply@askmonty.org 28 May '10
28 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add Sphinx storage engine to MariaDB
CREATION DATE..: Mon, 10 Aug 2009, 23:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 42 (http://askmonty.org/worklog/?tid=42)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 16 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
-=-=(Knielsen - Fri, 28 May 2010, 07:49)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.4369 2010-05-28 07:49:13.000000000 +0000
+++ /tmp/wklog.42.new.4369 2010-05-28 07:49:13.000000000 +0000
@@ -49,6 +49,10 @@
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
+I pushed a proof-of-concept patch for this here:
+
+ lp:~knielsen/maria/5.2-sphinxse
+
Here is a sample test case using this:
--source include/have_sphinx.inc
-=-=(Knielsen - Fri, 28 May 2010, 06:31)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.32746 2010-05-28 06:31:24.000000000 +0000
+++ /tmp/wklog.42.new.32746 2010-05-28 06:31:24.000000000 +0000
@@ -1 +1,63 @@
+Code
+----
+
+Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
+the MariaDB tree.
+
+It is a plugin, so it can be added to the tree just by including the
+sub-directory storage/sphinx/.
+
+The Sphinx plugin is already of some maturity, having been used with MySQL for
+some time.
+
+
+Testing
+-------
+
+To get testing in the mysql-test-run framework, some extensions are needed.
+
+To use the Sphinx storage engine, the external Sphinx search daemon needs to
+be running with some data directory containing indexed data. It also needs to
+be allocated a port.
+
+This is the indended approach:
+
+1. Testing will use an external Sphinx setup installed on the machine. Sphinx
+binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
+or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
+and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
+binaries can not be found, then Sphinx tests will be disabled (using some
+--source include/have_sphinx.inc in the test cases).
+
+2. The mysql-test-run framework will install Sphinx search data and start/stop
+the Sphinx search daemon for the test cases, similarly how it is done for the
+other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
+options --console, --config, and --pidfile.
+
+3. The mysql-test-run framework will generate a Sphinx config file from a
+template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
+ports and data directories appropriate for avoiding conflicts between multiple
+simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
+similar to MySQL my.cnf that we can use the existing framework for generating
+config file, with just a slightly modified variant of the code writing the
+file to disk.
+
+4. The mysql-test-run framework will pre-load the mysql database with tables
+and data for Sphinx to index. It will then run the `indexer` program to
+generate the indexes, and then start the `searchd` daemon. These three steps
+must be done in order, as each step depends on the previous. ALTERNATIVE: it
+might be possible to pre-generate the necessary data/index files and store
+them in the source tree.
+
+Here is a sample test case using this:
+
+--source include/have_sphinx.inc
+--source include/have_sphinxse.inc
+
+--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
+eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
+not null, index(q) ) engine=sphinx
+connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
+select * from ts where q='test';
+drop table ts;
-=-=(Knielsen - Fri, 28 May 2010, 06:07)=-=-
Version updated.
--- /tmp/wklog.42.old.32184 2010-05-28 06:07:00.000000000 +0000
+++ /tmp/wklog.42.new.32184 2010-05-28 06:07:00.000000000 +0000
@@ -1 +1 @@
-9.x
+Server-5.2
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Category updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-BackLog
+Server-Sprint
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Version updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-2.0
+9.x
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Status updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Guest - Tue, 15 Sep 2009, 02:25)=-=-
no
Reported zero hours worked. Estimate unchanged.
-=-=(Guest - Tue, 15 Sep 2009, 02:24)=-=-
Version updated.
--- /tmp/wklog.42.old.13241 2009-09-15 02:24:07.000000000 +0300
+++ /tmp/wklog.42.new.13241 2009-09-15 02:24:07.000000000 +0300
@@ -1 +1 @@
-Connector/.NET-5.1
+Maria-2.0
DESCRIPTION:
Add the Sphinx storage engine to the MariaDB tree
HIGH-LEVEL SPECIFICATION:
Code
----
Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
the MariaDB tree.
It is a plugin, so it can be added to the tree just by including the
sub-directory storage/sphinx/.
The Sphinx plugin is already of some maturity, having been used with MySQL for
some time.
Testing
-------
To get testing in the mysql-test-run framework, some extensions are needed.
To use the Sphinx storage engine, the external Sphinx search daemon needs to
be running with some data directory containing indexed data. It also needs to
be allocated a port.
This is the indended approach:
1. Testing will use an external Sphinx setup installed on the machine. Sphinx
binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
binaries can not be found, then Sphinx tests will be disabled (using some
--source include/have_sphinx.inc in the test cases).
2. The mysql-test-run framework will install Sphinx search data and start/stop
the Sphinx search daemon for the test cases, similarly how it is done for the
other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
options --console, --config, and --pidfile.
3. The mysql-test-run framework will generate a Sphinx config file from a
template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
ports and data directories appropriate for avoiding conflicts between multiple
simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
similar to MySQL my.cnf that we can use the existing framework for generating
config file, with just a slightly modified variant of the code writing the
file to disk.
4. The mysql-test-run framework will pre-load the mysql database with tables
and data for Sphinx to index. It will then run the `indexer` program to
generate the indexes, and then start the `searchd` daemon. These three steps
must be done in order, as each step depends on the previous. ALTERNATIVE: it
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
I pushed a proof-of-concept patch for this here:
lp:~knielsen/maria/5.2-sphinxse
Here is a sample test case using this:
--source include/have_sphinx.inc
--source include/have_sphinxse.inc
--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
not null, index(q) ) engine=sphinx
connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
select * from ts where q='test';
drop table ts;
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Knielsen): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply@askmonty.org 28 May '10
by worklog-noreply@askmonty.org 28 May '10
28 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add Sphinx storage engine to MariaDB
CREATION DATE..: Mon, 10 Aug 2009, 23:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 42 (http://askmonty.org/worklog/?tid=42)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 16 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
-=-=(Knielsen - Fri, 28 May 2010, 06:31)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.32746 2010-05-28 06:31:24.000000000 +0000
+++ /tmp/wklog.42.new.32746 2010-05-28 06:31:24.000000000 +0000
@@ -1 +1,63 @@
+Code
+----
+
+Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
+the MariaDB tree.
+
+It is a plugin, so it can be added to the tree just by including the
+sub-directory storage/sphinx/.
+
+The Sphinx plugin is already of some maturity, having been used with MySQL for
+some time.
+
+
+Testing
+-------
+
+To get testing in the mysql-test-run framework, some extensions are needed.
+
+To use the Sphinx storage engine, the external Sphinx search daemon needs to
+be running with some data directory containing indexed data. It also needs to
+be allocated a port.
+
+This is the indended approach:
+
+1. Testing will use an external Sphinx setup installed on the machine. Sphinx
+binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
+or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
+and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
+binaries can not be found, then Sphinx tests will be disabled (using some
+--source include/have_sphinx.inc in the test cases).
+
+2. The mysql-test-run framework will install Sphinx search data and start/stop
+the Sphinx search daemon for the test cases, similarly how it is done for the
+other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
+options --console, --config, and --pidfile.
+
+3. The mysql-test-run framework will generate a Sphinx config file from a
+template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
+ports and data directories appropriate for avoiding conflicts between multiple
+simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
+similar to MySQL my.cnf that we can use the existing framework for generating
+config file, with just a slightly modified variant of the code writing the
+file to disk.
+
+4. The mysql-test-run framework will pre-load the mysql database with tables
+and data for Sphinx to index. It will then run the `indexer` program to
+generate the indexes, and then start the `searchd` daemon. These three steps
+must be done in order, as each step depends on the previous. ALTERNATIVE: it
+might be possible to pre-generate the necessary data/index files and store
+them in the source tree.
+
+Here is a sample test case using this:
+
+--source include/have_sphinx.inc
+--source include/have_sphinxse.inc
+
+--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
+eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
+not null, index(q) ) engine=sphinx
+connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
+select * from ts where q='test';
+drop table ts;
-=-=(Knielsen - Fri, 28 May 2010, 06:07)=-=-
Version updated.
--- /tmp/wklog.42.old.32184 2010-05-28 06:07:00.000000000 +0000
+++ /tmp/wklog.42.new.32184 2010-05-28 06:07:00.000000000 +0000
@@ -1 +1 @@
-9.x
+Server-5.2
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Category updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-BackLog
+Server-Sprint
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Version updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-2.0
+9.x
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Status updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Guest - Tue, 15 Sep 2009, 02:25)=-=-
no
Reported zero hours worked. Estimate unchanged.
-=-=(Guest - Tue, 15 Sep 2009, 02:24)=-=-
Version updated.
--- /tmp/wklog.42.old.13241 2009-09-15 02:24:07.000000000 +0300
+++ /tmp/wklog.42.new.13241 2009-09-15 02:24:07.000000000 +0300
@@ -1 +1 @@
-Connector/.NET-5.1
+Maria-2.0
DESCRIPTION:
Add the Sphinx storage engine to the MariaDB tree
HIGH-LEVEL SPECIFICATION:
Code
----
Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
the MariaDB tree.
It is a plugin, so it can be added to the tree just by including the
sub-directory storage/sphinx/.
The Sphinx plugin is already of some maturity, having been used with MySQL for
some time.
Testing
-------
To get testing in the mysql-test-run framework, some extensions are needed.
To use the Sphinx storage engine, the external Sphinx search daemon needs to
be running with some data directory containing indexed data. It also needs to
be allocated a port.
This is the indended approach:
1. Testing will use an external Sphinx setup installed on the machine. Sphinx
binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
binaries can not be found, then Sphinx tests will be disabled (using some
--source include/have_sphinx.inc in the test cases).
2. The mysql-test-run framework will install Sphinx search data and start/stop
the Sphinx search daemon for the test cases, similarly how it is done for the
other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
options --console, --config, and --pidfile.
3. The mysql-test-run framework will generate a Sphinx config file from a
template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
ports and data directories appropriate for avoiding conflicts between multiple
simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
similar to MySQL my.cnf that we can use the existing framework for generating
config file, with just a slightly modified variant of the code writing the
file to disk.
4. The mysql-test-run framework will pre-load the mysql database with tables
and data for Sphinx to index. It will then run the `indexer` program to
generate the indexes, and then start the `searchd` daemon. These three steps
must be done in order, as each step depends on the previous. ALTERNATIVE: it
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
Here is a sample test case using this:
--source include/have_sphinx.inc
--source include/have_sphinxse.inc
--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
not null, index(q) ) engine=sphinx
connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
select * from ts where q='test';
drop table ts;
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Knielsen): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply@askmonty.org 28 May '10
by worklog-noreply@askmonty.org 28 May '10
28 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add Sphinx storage engine to MariaDB
CREATION DATE..: Mon, 10 Aug 2009, 23:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 42 (http://askmonty.org/worklog/?tid=42)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 16 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
-=-=(Knielsen - Fri, 28 May 2010, 06:31)=-=-
High-Level Specification modified.
--- /tmp/wklog.42.old.32746 2010-05-28 06:31:24.000000000 +0000
+++ /tmp/wklog.42.new.32746 2010-05-28 06:31:24.000000000 +0000
@@ -1 +1,63 @@
+Code
+----
+
+Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
+the MariaDB tree.
+
+It is a plugin, so it can be added to the tree just by including the
+sub-directory storage/sphinx/.
+
+The Sphinx plugin is already of some maturity, having been used with MySQL for
+some time.
+
+
+Testing
+-------
+
+To get testing in the mysql-test-run framework, some extensions are needed.
+
+To use the Sphinx storage engine, the external Sphinx search daemon needs to
+be running with some data directory containing indexed data. It also needs to
+be allocated a port.
+
+This is the indended approach:
+
+1. Testing will use an external Sphinx setup installed on the machine. Sphinx
+binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
+or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
+and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
+binaries can not be found, then Sphinx tests will be disabled (using some
+--source include/have_sphinx.inc in the test cases).
+
+2. The mysql-test-run framework will install Sphinx search data and start/stop
+the Sphinx search daemon for the test cases, similarly how it is done for the
+other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
+options --console, --config, and --pidfile.
+
+3. The mysql-test-run framework will generate a Sphinx config file from a
+template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
+ports and data directories appropriate for avoiding conflicts between multiple
+simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
+similar to MySQL my.cnf that we can use the existing framework for generating
+config file, with just a slightly modified variant of the code writing the
+file to disk.
+
+4. The mysql-test-run framework will pre-load the mysql database with tables
+and data for Sphinx to index. It will then run the `indexer` program to
+generate the indexes, and then start the `searchd` daemon. These three steps
+must be done in order, as each step depends on the previous. ALTERNATIVE: it
+might be possible to pre-generate the necessary data/index files and store
+them in the source tree.
+
+Here is a sample test case using this:
+
+--source include/have_sphinx.inc
+--source include/have_sphinxse.inc
+
+--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
+eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
+not null, index(q) ) engine=sphinx
+connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
+select * from ts where q='test';
+drop table ts;
-=-=(Knielsen - Fri, 28 May 2010, 06:07)=-=-
Version updated.
--- /tmp/wklog.42.old.32184 2010-05-28 06:07:00.000000000 +0000
+++ /tmp/wklog.42.new.32184 2010-05-28 06:07:00.000000000 +0000
@@ -1 +1 @@
-9.x
+Server-5.2
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Category updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-BackLog
+Server-Sprint
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Version updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-2.0
+9.x
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Status updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Guest - Tue, 15 Sep 2009, 02:25)=-=-
no
Reported zero hours worked. Estimate unchanged.
-=-=(Guest - Tue, 15 Sep 2009, 02:24)=-=-
Version updated.
--- /tmp/wklog.42.old.13241 2009-09-15 02:24:07.000000000 +0300
+++ /tmp/wklog.42.new.13241 2009-09-15 02:24:07.000000000 +0300
@@ -1 +1 @@
-Connector/.NET-5.1
+Maria-2.0
DESCRIPTION:
Add the Sphinx storage engine to the MariaDB tree
HIGH-LEVEL SPECIFICATION:
Code
----
Andrew Aksyonoff from Sphinx is helping to integrate the SphinxSE plugin into
the MariaDB tree.
It is a plugin, so it can be added to the tree just by including the
sub-directory storage/sphinx/.
The Sphinx plugin is already of some maturity, having been used with MySQL for
some time.
Testing
-------
To get testing in the mysql-test-run framework, some extensions are needed.
To use the Sphinx storage engine, the external Sphinx search daemon needs to
be running with some data directory containing indexed data. It also needs to
be allocated a port.
This is the indended approach:
1. Testing will use an external Sphinx setup installed on the machine. Sphinx
binaries will be searched in typical locations (eg. /usr/bin, /usr/local/bin),
or can be specified explicitly in the environment with SPHINXSEARCH_INDEXER
and SPHINXSEARCH_SEARCHD for the two required binaries. If the external Sphinx
binaries can not be found, then Sphinx tests will be disabled (using some
--source include/have_sphinx.inc in the test cases).
2. The mysql-test-run framework will install Sphinx search data and start/stop
the Sphinx search daemon for the test cases, similarly how it is done for the
other servers mysqld, ndbd, etc. We will run the Sphinx search daemon with
options --console, --config, and --pidfile.
3. The mysql-test-run framework will generate a Sphinx config file from a
template in mysql-test/suite/sphinx/my.cnf. This config file will allocate
ports and data directories appropriate for avoiding conflicts between multiple
simultaneous mysql-test-run executions. The Sphinx config file is sufficiently
similar to MySQL my.cnf that we can use the existing framework for generating
config file, with just a slightly modified variant of the code writing the
file to disk.
4. The mysql-test-run framework will pre-load the mysql database with tables
and data for Sphinx to index. It will then run the `indexer` program to
generate the indexes, and then start the `searchd` daemon. These three steps
must be done in order, as each step depends on the previous. ALTERNATIVE: it
might be possible to pre-generate the necessary data/index files and store
them in the source tree.
Here is a sample test case using this:
--source include/have_sphinx.inc
--source include/have_sphinxse.inc
--replace_result $SPHINXSEARCH_PORT SPHINXSEARCH_PORT
eval create table ts ( id int unsigned not null, w int not null, q varchar(255)
not null, index(q) ) engine=sphinx
connection="sphinx://127.0.0.1:$SPHINXSEARCH_PORT/*";
select * from ts where q='test';
drop table ts;
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Knielsen): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply@askmonty.org 28 May '10
by worklog-noreply@askmonty.org 28 May '10
28 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add Sphinx storage engine to MariaDB
CREATION DATE..: Mon, 10 Aug 2009, 23:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 42 (http://askmonty.org/worklog/?tid=42)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 16 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
-=-=(Knielsen - Fri, 28 May 2010, 06:07)=-=-
Version updated.
--- /tmp/wklog.42.old.32184 2010-05-28 06:07:00.000000000 +0000
+++ /tmp/wklog.42.new.32184 2010-05-28 06:07:00.000000000 +0000
@@ -1 +1 @@
-9.x
+Server-5.2
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Category updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-BackLog
+Server-Sprint
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Version updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-2.0
+9.x
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Status updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Guest - Tue, 15 Sep 2009, 02:25)=-=-
no
Reported zero hours worked. Estimate unchanged.
-=-=(Guest - Tue, 15 Sep 2009, 02:24)=-=-
Version updated.
--- /tmp/wklog.42.old.13241 2009-09-15 02:24:07.000000000 +0300
+++ /tmp/wklog.42.new.13241 2009-09-15 02:24:07.000000000 +0300
@@ -1 +1 @@
-Connector/.NET-5.1
+Maria-2.0
DESCRIPTION:
Add the Sphinx storage engine to the MariaDB tree
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Knielsen): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply@askmonty.org 28 May '10
by worklog-noreply@askmonty.org 28 May '10
28 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add Sphinx storage engine to MariaDB
CREATION DATE..: Mon, 10 Aug 2009, 23:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 42 (http://askmonty.org/worklog/?tid=42)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 16 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
-=-=(Knielsen - Fri, 28 May 2010, 06:07)=-=-
Version updated.
--- /tmp/wklog.42.old.32184 2010-05-28 06:07:00.000000000 +0000
+++ /tmp/wklog.42.new.32184 2010-05-28 06:07:00.000000000 +0000
@@ -1 +1 @@
-9.x
+Server-5.2
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Category updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-BackLog
+Server-Sprint
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Version updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-2.0
+9.x
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Status updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Guest - Tue, 15 Sep 2009, 02:25)=-=-
no
Reported zero hours worked. Estimate unchanged.
-=-=(Guest - Tue, 15 Sep 2009, 02:24)=-=-
Version updated.
--- /tmp/wklog.42.old.13241 2009-09-15 02:24:07.000000000 +0300
+++ /tmp/wklog.42.new.13241 2009-09-15 02:24:07.000000000 +0300
@@ -1 +1 @@
-Connector/.NET-5.1
+Maria-2.0
DESCRIPTION:
Add the Sphinx storage engine to the MariaDB tree
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Knielsen): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply@askmonty.org 28 May '10
by worklog-noreply@askmonty.org 28 May '10
28 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add Sphinx storage engine to MariaDB
CREATION DATE..: Mon, 10 Aug 2009, 23:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 42 (http://askmonty.org/worklog/?tid=42)
VERSION........: 9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 16 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Category updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-BackLog
+Server-Sprint
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Version updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-2.0
+9.x
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Status updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Guest - Tue, 15 Sep 2009, 02:25)=-=-
no
Reported zero hours worked. Estimate unchanged.
-=-=(Guest - Tue, 15 Sep 2009, 02:24)=-=-
Version updated.
--- /tmp/wklog.42.old.13241 2009-09-15 02:24:07.000000000 +0300
+++ /tmp/wklog.42.new.13241 2009-09-15 02:24:07.000000000 +0300
@@ -1 +1 @@
-Connector/.NET-5.1
+Maria-2.0
DESCRIPTION:
Add the Sphinx storage engine to the MariaDB tree
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0

[Maria-developers] Updated (by Knielsen): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply@askmonty.org 28 May '10
by worklog-noreply@askmonty.org 28 May '10
28 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add Sphinx storage engine to MariaDB
CREATION DATE..: Mon, 10 Aug 2009, 23:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 42 (http://askmonty.org/worklog/?tid=42)
VERSION........: 9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 16 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Category updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-BackLog
+Server-Sprint
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Version updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Maria-2.0
+9.x
-=-=(Knielsen - Fri, 28 May 2010, 06:06)=-=-
Status updated.
--- /tmp/wklog.42.old.32171 2010-05-28 06:06:23.000000000 +0000
+++ /tmp/wklog.42.new.32171 2010-05-28 06:06:23.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Guest - Tue, 15 Sep 2009, 02:25)=-=-
no
Reported zero hours worked. Estimate unchanged.
-=-=(Guest - Tue, 15 Sep 2009, 02:24)=-=-
Version updated.
--- /tmp/wklog.42.old.13241 2009-09-15 02:24:07.000000000 +0300
+++ /tmp/wklog.42.new.13241 2009-09-15 02:24:07.000000000 +0300
@@ -1 +1 @@
-Connector/.NET-5.1
+Maria-2.0
DESCRIPTION:
Add the Sphinx storage engine to the MariaDB tree
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0