developers
Threads by month
- ----- 2025 -----
- March
- February
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- 1 participants
- 6832 discussions

[Maria-developers] WL#10 Updated (by Sergei): Microseconds
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Microseconds
CREATION DATE..: Thu, 26 Mar 2009, 00:29
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 10 (http://askmonty.org/worklog/?tid=10)
VERSION........: Server-5.3
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:03)=-=-
Category updated.
--- /tmp/wklog.10.old.31970 2010-06-29 14:03:01.000000000 +0000
+++ /tmp/wklog.10.new.31970 2010-06-29 14:03:01.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-RawIdeaBin
-=-=(Monty - Fri, 29 Jan 2010, 19:05)=-=-
Version updated.
--- /tmp/wklog.10.old.5698 2010-01-29 19:05:42.000000000 +0200
+++ /tmp/wklog.10.new.5698 2010-01-29 19:05:42.000000000 +0200
@@ -1 +1 @@
-Server-5.2
+Server-5.3
DESCRIPTION:
Add microsecond precision to NOW()
Add new field types for time and datetime with microprecision
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0

[Maria-developers] WL#24 Updated (by Sergei): index_merge: fair choice between index_merge union and range access
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: fair choice between index_merge union and range access
CREATION DATE..: Tue, 26 May 2009, 12:10
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 24 (http://askmonty.org/worklog/?tid=24)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:00)=-=-
Category updated.
--- /tmp/wklog.24.old.31772 2010-06-29 14:00:05.000000000 +0000
+++ /tmp/wklog.24.new.31772 2010-06-29 14:00:05.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-RawIdeaBin
-=-=(Guest - Sun, 16 Aug 2009, 02:13)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.23383 2009-08-16 02:13:54.000000000 +0300
+++ /tmp/wklog.24.new.23383 2009-08-16 02:13:54.000000000 +0300
@@ -125,7 +125,7 @@
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
-(Here an imerge for col2=c2 OR col3=c3 won't be built since neither col2=c2 nor
+(Here no imerge for col2=c2 OR col3=c3 will be built since neither col2=c2 nor
col3=c3 represent index ranges.)
@@ -199,7 +199,7 @@
O2. "Create index_merge accesses when possible"
Current tree_or() will not create index_merge access when it could create
- non-index merge access (see DISCARD-IMERGE-3 and its example in the "Problems
+ non-index merge access (see DISCARD-IMERGE-2 and its example in the "Problems
in the current implementation" section). This will be changed to work as
follows: we will create index_merge made for index scans that didn't have
their match in the other sel_tree.
-=-=(Guest - Sun, 16 Aug 2009, 01:03)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.20767 2009-08-16 01:03:11.000000000 +0300
+++ /tmp/wklog.24.new.20767 2009-08-16 01:03:11.000000000 +0300
@@ -18,6 +18,8 @@
# a range tree has range access options, possibly for several keys
range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
+ (here range(keyi) may represent ranges not for initial keyi prefixes,
+ but ranges for any infixes for keyi)
# merge tree represents several way to index_merge
imerge_tree = imerge1 AND imerge2 AND ...
@@ -47,13 +49,13 @@
R.add(range_union(A.range(i), B.range(i)));
if (R has at least one range access)
- return R;
+ return R; // DISCARD-IMERGE-2
else
{
/* could not build any range accesses. construct index_merge */
- remove non-ranges from A; // DISCARD-IMERGE-2
+ remove non-ranges from A;
remove non-ranges from B;
- return new index_merge(A, B);
+ return new index_merge(A, B); // DISCARD-IMERGE-3
}
}
else if (A is range tree and B is index_merge tree (or vice versa))
@@ -65,12 +67,12 @@
(range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
(range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
...
- (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN) AND
+ (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN)
=
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
(range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
...
- (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N)
Now each line represents an index_merge..
}
@@ -82,18 +84,18 @@
OR
imergeB1 AND imergeB2 AND ... AND imergeBN
- -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-3
+ -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-4
imergeA1
OR
- imergeB1 AND imergeB2 AND ... AND imergeBN =
+ imergeB1 =
- = (combine imergeA1 with each of the imergeB{i} ) =
+ = (combine imergeA1 with each of the range_treeB_1{i} ) =
- combine(imergeA1 OR imergeB1) AND
- combine(imergeA1 OR imergeB2) AND
+ combine(imergeA1 OR range_treeB_11) AND
+ combine(imergeA1 OR range_treeB_12) AND
... AND
- combine(imergeA1 OR imergeBN)
+ combine(imergeA1 OR range_treeB_1N)
}
}
@@ -109,7 +111,7 @@
DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
the WHERE clause has this form (conditions t.badkey may have abritrary form):
- (t.badkey<c1 AND t.key1=c1) OR (t.key1=c2 AND t.badkey < c2)
+ (t.badkey<c1 AND t.key1=c1) OR (t.key2=c2 AND t.badkey < c2)
DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
two indexes:
@@ -123,6 +125,8 @@
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
+(Here an imerge for col2=c2 OR col3=c3 won't be built since neither col2=c2 nor
+col3=c3 represent index ranges.)
2. New implementation
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 24
-=-=(Guest - Sat, 20 Jun 2009, 09:34)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.21663 2009-06-20 09:34:48.000000000 +0300
+++ /tmp/wklog.24.new.21663 2009-06-20 09:34:48.000000000 +0300
@@ -4,6 +4,7 @@
2. New implementation
2.1 New tree_and()
2.2 New tree_or()
+3. Testing and required coverage
</contents>
1. Current implementation overview
@@ -240,3 +241,14 @@
In order to limit the impact of this combinatorial explosion, we will
introduce a rule that we won't generate more than #defined
MAX_IMERGE_OPTS options.
+
+3. Testing and required coverage
+================================
+So far could find the following user cases:
+
+* BUG#17259: Query optimizer chooses wrong index
+* BUG#17673: Optimizer does not use Index Merge optimization in some cases
+* BUG#23322: Optimizer sometimes erroniously prefers other index over index merge
+* BUG#30151: optimizer is very reluctant to chose index_merge algorithm
+
+
-=-=(Guest - Thu, 18 Jun 2009, 16:55)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.19152 2009-06-18 16:55:00.000000000 +0300
+++ /tmp/wklog.24.new.19152 2009-06-18 16:55:00.000000000 +0300
@@ -141,13 +141,15 @@
Operations on SEL_ARG trees will be modified to produce/process the trees of
this kind:
+
2.1 New tree_and()
------------------
In order not to lose plans, we'll make these changes:
-1. Don't remove index_merge part of the tree.
+A1. Don't remove index_merge part of the tree (this will take care of
+ DISCARD-IMERGE-1 problem)
-2. Push range conditions down into index_merge trees that may support them.
+A2. Push range conditions down into index_merge trees that may support them.
if one tree has range(key1) and the other tree has imerge(key1 OR key2)
then perform an equvalent of this operation:
@@ -155,8 +157,86 @@
(rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
-3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
+A3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
concatenate them together.
-2.2 New tree_or()
+2.2 New tree_or()
+-----------------
+O1. Dont remove non-range plans:
+ Current tree_or() code will refuse to produce index_merge plans for
+ conditions like
+
+ "t.key1part2=const OR t.key2part1=const"
+
+ (this is marked as DISCARD-IMERGE-3). This was justifed as the left part of
+ the AND condition is not usable for range access, and the operation of
+ tree_and() guaranteed that there was no way it could changed to make a
+ usable range plan. With new tree_and() and rule A2, this is no longer the
+ case. For example for this query:
+
+ (t.key1part2=const OR t.key2part1=const) AND t.key1part1=const
+
+ it will construct a
+
+ imerge(t.key1part2=const OR t.key2part1=const), range(t.key1part1=const)
+
+ then tree_and() will apply rule A2 to push the range down into index merge
+ and after that we'll have:
+
+ range(t.key1part1=const)
+ imerge(
+ t.key1part2=const AND t.key1part1=const,
+ t.key2part1=const
+ )
+ note that imerge(...) describes a usable index_merge plan and it's possible
+ that it will be the best access path.
+
+O2. "Create index_merge accesses when possible"
+ Current tree_or() will not create index_merge access when it could create
+ non-index merge access (see DISCARD-IMERGE-3 and its example in the "Problems
+ in the current implementation" section). This will be changed to work as
+ follows: we will create index_merge made for index scans that didn't have
+ their match in the other sel_tree.
+ Ilustrating it with an example:
+
+ | sel_tree_A | sel_tree_B | A or B | include in index_merge?
+ ------+------------+------------+--------+------------------------
+ key1 | cond1 | cond2 | condM | no
+ key2 | cond3 | cond4 | NULL | no
+ key3 | cond5 | | | yes, A-side
+ key4 | cond6 | | | yes, A-side
+ key5 | | cond7 | | yes, B-side
+ key6 | | cond8 | | yes, B-side
+
+ here we assume that
+ - (cond1 OR cond2) did produce a combined range. Not including them in
+ index_merge.
+ - (cond3 OR cond4) didn't produce a usable range (e.g. they were
+ t.key1part1=c1 AND t.key1part2=c1, respectively, and combining them
+ didn't yield any range list)
+ - All other scand didn't have their counterparts, so we'll end up with a
+ SEL_TREE of:
+
+ range(condM) AND index_merge((cond5 AND cond6),(cond7 AND cond8))
+ .
+
+O4. There is no O4. DISCARD-INDEX-MERGE-4 will remain there. The idea is
+that although DISCARD-INDEX-MERGE-4 does discard plans, so far we haven
+seen any complaints that could be attributed to it.
+If we face the need to lift DISCARD-INDEX-MERGE-4, our answer will be to
+lift it ,and produce a cross-product:
+
+ ((key1p OR key2p) AND (key3p OR key4p))
+ OR
+ ((key5p OR key6p) AND (key7p OR key8p))
+
+ = (key1p OR key2p OR key5p OR key6p) AND // this part is currently
+ (key3p OR key4p OR key5p OR key6p) AND // produced
+
+ (key1p OR key2p OR key5p OR key6p) AND // this part will be added
+ (key3p OR key4p OR key5p OR key6p) //.
+
+In order to limit the impact of this combinatorial explosion, we will
+introduce a rule that we won't generate more than #defined
+MAX_IMERGE_OPTS options.
-=-=(Guest - Thu, 18 Jun 2009, 14:56)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.15612 2009-06-18 14:56:09.000000000 +0300
+++ /tmp/wklog.24.new.15612 2009-06-18 14:56:09.000000000 +0300
@@ -1 +1,162 @@
+<contents>
+1. Current implementation overview
+1.1. Problems in the current implementation
+2. New implementation
+2.1 New tree_and()
+2.2 New tree_or()
+</contents>
+
+1. Current implementation overview
+==================================
+At the moment, range analyzer works as follows:
+
+SEL_TREE structure represents
+
+ # There are sel_trees, a sel_tree is either range or merge tree
+ sel_tree = range_tree | imerge_tree
+
+ # a range tree has range access options, possibly for several keys
+ range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
+
+ # merge tree represents several way to index_merge
+ imerge_tree = imerge1 AND imerge2 AND ...
+
+ # a way to do index merge == a set to use of different indexes.
+ imergeX = range_tree1 OR range_tree2 OR ..
+ where no pair of range_treeX have ranges over the same index.
+
+
+ tree_and(A, B)
+ {
+ if (both A and B are range trees)
+ return a range_tree with computed intersection for each range;
+ if (only one of A and B is a range tree)
+ return that tree; // DISCARD-IMERGE-1
+ // at this point both trees are index_merge trees
+ return concat_lists( A.imerge1 ... A.imergeN, B.imerge1 ... B.imergeN);
+ }
+
+
+ tree_or(A, B)
+ {
+ if (A and B are range trees)
+ {
+ R = new range_tree;
+ for each index i
+ R.add(range_union(A.range(i), B.range(i)));
+
+ if (R has at least one range access)
+ return R;
+ else
+ {
+ /* could not build any range accesses. construct index_merge */
+ remove non-ranges from A; // DISCARD-IMERGE-2
+ remove non-ranges from B;
+ return new index_merge(A, B);
+ }
+ }
+ else if (A is range tree and B is index_merge tree (or vice versa))
+ {
+ Perform this transformation:
+
+ range_treeA // this is A
+ OR
+ (range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
+ (range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
+ ...
+ (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN) AND
+ =
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+ (range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
+ ...
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+
+ Now each line represents an index_merge..
+ }
+ else if (both A and B are index_merge trees)
+ {
+ Perform this transformation:
+
+ imergeA1 AND imergeA2 AND ... AND imergeAN
+ OR
+ imergeB1 AND imergeB2 AND ... AND imergeBN
+
+ -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-3
+
+ imergeA1
+ OR
+ imergeB1 AND imergeB2 AND ... AND imergeBN =
+
+ = (combine imergeA1 with each of the imergeB{i} ) =
+
+ combine(imergeA1 OR imergeB1) AND
+ combine(imergeA1 OR imergeB2) AND
+ ... AND
+ combine(imergeA1 OR imergeBN)
+ }
+ }
+
+1.1. Problems in the current implementation
+-------------------------------------------
+As marked in the code above:
+
+DISCARD-IMERGE-1 step will cause index_merge option to be discarded when
+the WHERE clause has this form:
+
+ (t.key1=c1 OR t.key2=c2) AND t.badkey < c3
+
+DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
+the WHERE clause has this form (conditions t.badkey may have abritrary form):
+
+ (t.badkey<c1 AND t.key1=c1) OR (t.key1=c2 AND t.badkey < c2)
+
+DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
+two indexes:
+
+ INDEX i1(col1, col2),
+ INDEX i2(col1, col3)
+
+and this WHERE clause:
+
+ col1=c1 AND (col2=c2 OR col3=c3)
+
+The optimizer will generate the plans that only use the "col1=c1" part. The
+right side of the AND will be ignored even if it has good selectivity.
+
+
+2. New implementation
+=====================
+
+<general idea>
+* Don't start fighting combinatorial explosion until we've actually got one.
+</>
+
+SEL_TREE structure will be now able to hold both index_merge and range scan
+candidates at the same time. That is,
+
+ sel_tree2 = range_tree AND imerge_tree
+
+where both parts are optional (i.e. can be empty)
+
+Operations on SEL_ARG trees will be modified to produce/process the trees of
+this kind:
+
+2.1 New tree_and()
+------------------
+In order not to lose plans, we'll make these changes:
+
+1. Don't remove index_merge part of the tree.
+
+2. Push range conditions down into index_merge trees that may support them.
+ if one tree has range(key1) and the other tree has imerge(key1 OR key2)
+ then perform an equvalent of this operation:
+
+ rangeA(key1) AND ( rangeB(key1) OR rangeB(key2)) =
+
+ (rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
+
+3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
+ concatenate them together.
+
+2.2 New tree_or()
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 24
-=-=(Guest - Mon, 01 Jun 2009, 23:30)=-=-
High-Level Specification modified.
--- /tmp/wklog.24.old.21580 2009-06-01 23:30:06.000000000 +0300
+++ /tmp/wklog.24.new.21580 2009-06-01 23:30:06.000000000 +0300
@@ -64,6 +64,9 @@
* How strict is the limitation on the form of the WHERE?
+* Which version should this be based on? 5.1? Which patches are should be in
+ (google's/percona's/maria/etc?)
+
* TODO: The optimizer didn't compare costs of index_merge and range before (ok
it did but that was done for accesses to different tables). Will there be any
possible gotchas here?
-=-=(Guest - Wed, 27 May 2009, 13:59)=-=-
Title modified.
--- /tmp/wklog.24.old.9498 2009-05-27 13:59:23.000000000 +0300
+++ /tmp/wklog.24.new.9498 2009-05-27 13:59:23.000000000 +0300
@@ -1 +1 @@
-index_merge optimizer: dont discard index_merge union strategies when range is available
+index_merge: fair choice between index_merge union and range access
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=24&nolimit=1
DESCRIPTION:
Current range optimizer will discard possible index_merge/[sort]union
strategies when there is a possible range plan. This action is a part of
measures we take to avoid combinatorial explosion of possible range/
index_merge strategies.
A bad side effect of this is that for WHERE clauses in form
t.key1= 'very-frequent-value' AND (t.key2='rare-value1' OR t.key3='rare-value2')
the optimizer will
- discard union(key2,key3) in favor of range(key1)
- consider costs of using range(key1) and discard that plan also
and the overall effect is that possible poor range access will cause possible
good index_merge access not to be considered.
This WL is to about lifting this limitation at least for some subset of WHERE
clauses.
HIGH-LEVEL SPECIFICATION:
(Not a ready HLS but draft)
<contents>
Solution overview
Limitations
TODO
</contents>
Solution overview
=================
The idea is to delay discarding potential index_merge plans until the point
where it is really necessary.
This way, we won't have to do much changes in the range analyzer, but will be
able to keep potential index_merge plan just enough so that it's possible to
take it into consideration together with range access plans.
Since there are no changes in the optimizer, the ability to consider both
range and index_merge options will be limited to WHERE clauses of this form:
WHERE := range_cond(key1_1) AND
range_cond(key2_1) AND
other_cond AND
index_merge_OR_cond1(key3_1, key3_2, ...)
index_merge_OR_cond2(key4_1, key4_2, ...)
where
index_merge_OR_cond{N} := (range_cond(keyN_1) OR
range_cond(keyN_2) OR ...)
range_cond(keyX) := condition that allows to construct range access of keyX
and doesn't allow to construct range/index_merge accesses
for any keys of the table in question.
For such WHERE clauses, the range analyzer will produce SEL_TREE of this form:
SEL_TREE(
range(key1_1),
...
range(key2_1),
SEL_IMERGE( (1)
SEL_TREE(key3_1})
SEL_TREE(key3_2})
...
)
...
)
which can be used to make a cost-based choice between range and index_merge.
Limitations
-----------
This will not be a full solution in a sense that the range analyzer will not
be able to produce sel_tree (1) if the WHERE clause is specified in other form
(e.g. brackets were opened).
TODO
----
* is it a problem if there are keys that are referred to both from
index_merge and from range access?
* How strict is the limitation on the form of the WHERE?
* Which version should this be based on? 5.1? Which patches are should be in
(google's/percona's/maria/etc?)
* TODO: The optimizer didn't compare costs of index_merge and range before (ok
it did but that was done for accesses to different tables). Will there be any
possible gotchas here?
LOW-LEVEL DESIGN:
<contents>
1. Current implementation overview
1.1. Problems in the current implementation
2. New implementation
2.1 New tree_and()
2.2 New tree_or()
3. Testing and required coverage
</contents>
1. Current implementation overview
==================================
At the moment, range analyzer works as follows:
SEL_TREE structure represents
# There are sel_trees, a sel_tree is either range or merge tree
sel_tree = range_tree | imerge_tree
# a range tree has range access options, possibly for several keys
range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
(here range(keyi) may represent ranges not for initial keyi prefixes,
but ranges for any infixes for keyi)
# merge tree represents several way to index_merge
imerge_tree = imerge1 AND imerge2 AND ...
# a way to do index merge == a set to use of different indexes.
imergeX = range_tree1 OR range_tree2 OR ..
where no pair of range_treeX have ranges over the same index.
tree_and(A, B)
{
if (both A and B are range trees)
return a range_tree with computed intersection for each range;
if (only one of A and B is a range tree)
return that tree; // DISCARD-IMERGE-1
// at this point both trees are index_merge trees
return concat_lists( A.imerge1 ... A.imergeN, B.imerge1 ... B.imergeN);
}
tree_or(A, B)
{
if (A and B are range trees)
{
R = new range_tree;
for each index i
R.add(range_union(A.range(i), B.range(i)));
if (R has at least one range access)
return R; // DISCARD-IMERGE-2
else
{
/* could not build any range accesses. construct index_merge */
remove non-ranges from A;
remove non-ranges from B;
return new index_merge(A, B); // DISCARD-IMERGE-3
}
}
else if (A is range tree and B is index_merge tree (or vice versa))
{
Perform this transformation:
range_treeA // this is A
OR
(range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
(range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
...
(range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN)
=
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
(range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
...
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N)
Now each line represents an index_merge..
}
else if (both A and B are index_merge trees)
{
Perform this transformation:
imergeA1 AND imergeA2 AND ... AND imergeAN
OR
imergeB1 AND imergeB2 AND ... AND imergeBN
-> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-4
imergeA1
OR
imergeB1 =
= (combine imergeA1 with each of the range_treeB_1{i} ) =
combine(imergeA1 OR range_treeB_11) AND
combine(imergeA1 OR range_treeB_12) AND
... AND
combine(imergeA1 OR range_treeB_1N)
}
}
1.1. Problems in the current implementation
-------------------------------------------
As marked in the code above:
DISCARD-IMERGE-1 step will cause index_merge option to be discarded when
the WHERE clause has this form:
(t.key1=c1 OR t.key2=c2) AND t.badkey < c3
DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
the WHERE clause has this form (conditions t.badkey may have abritrary form):
(t.badkey<c1 AND t.key1=c1) OR (t.key2=c2 AND t.badkey < c2)
DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
two indexes:
INDEX i1(col1, col2),
INDEX i2(col1, col3)
and this WHERE clause:
col1=c1 AND (col2=c2 OR col3=c3)
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
(Here no imerge for col2=c2 OR col3=c3 will be built since neither col2=c2 nor
col3=c3 represent index ranges.)
2. New implementation
=====================
<general idea>
* Don't start fighting combinatorial explosion until we've actually got one.
</>
SEL_TREE structure will be now able to hold both index_merge and range scan
candidates at the same time. That is,
sel_tree2 = range_tree AND imerge_tree
where both parts are optional (i.e. can be empty)
Operations on SEL_ARG trees will be modified to produce/process the trees of
this kind:
2.1 New tree_and()
------------------
In order not to lose plans, we'll make these changes:
A1. Don't remove index_merge part of the tree (this will take care of
DISCARD-IMERGE-1 problem)
A2. Push range conditions down into index_merge trees that may support them.
if one tree has range(key1) and the other tree has imerge(key1 OR key2)
then perform an equvalent of this operation:
rangeA(key1) AND ( rangeB(key1) OR rangeB(key2)) =
(rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
A3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
concatenate them together.
2.2 New tree_or()
-----------------
O1. Dont remove non-range plans:
Current tree_or() code will refuse to produce index_merge plans for
conditions like
"t.key1part2=const OR t.key2part1=const"
(this is marked as DISCARD-IMERGE-3). This was justifed as the left part of
the AND condition is not usable for range access, and the operation of
tree_and() guaranteed that there was no way it could changed to make a
usable range plan. With new tree_and() and rule A2, this is no longer the
case. For example for this query:
(t.key1part2=const OR t.key2part1=const) AND t.key1part1=const
it will construct a
imerge(t.key1part2=const OR t.key2part1=const), range(t.key1part1=const)
then tree_and() will apply rule A2 to push the range down into index merge
and after that we'll have:
range(t.key1part1=const)
imerge(
t.key1part2=const AND t.key1part1=const,
t.key2part1=const
)
note that imerge(...) describes a usable index_merge plan and it's possible
that it will be the best access path.
O2. "Create index_merge accesses when possible"
Current tree_or() will not create index_merge access when it could create
non-index merge access (see DISCARD-IMERGE-2 and its example in the "Problems
in the current implementation" section). This will be changed to work as
follows: we will create index_merge made for index scans that didn't have
their match in the other sel_tree.
Ilustrating it with an example:
| sel_tree_A | sel_tree_B | A or B | include in index_merge?
------+------------+------------+--------+------------------------
key1 | cond1 | cond2 | condM | no
key2 | cond3 | cond4 | NULL | no
key3 | cond5 | | | yes, A-side
key4 | cond6 | | | yes, A-side
key5 | | cond7 | | yes, B-side
key6 | | cond8 | | yes, B-side
here we assume that
- (cond1 OR cond2) did produce a combined range. Not including them in
index_merge.
- (cond3 OR cond4) didn't produce a usable range (e.g. they were
t.key1part1=c1 AND t.key1part2=c1, respectively, and combining them
didn't yield any range list)
- All other scand didn't have their counterparts, so we'll end up with a
SEL_TREE of:
range(condM) AND index_merge((cond5 AND cond6),(cond7 AND cond8))
.
O4. There is no O4. DISCARD-INDEX-MERGE-4 will remain there. The idea is
that although DISCARD-INDEX-MERGE-4 does discard plans, so far we haven
seen any complaints that could be attributed to it.
If we face the need to lift DISCARD-INDEX-MERGE-4, our answer will be to
lift it ,and produce a cross-product:
((key1p OR key2p) AND (key3p OR key4p))
OR
((key5p OR key6p) AND (key7p OR key8p))
= (key1p OR key2p OR key5p OR key6p) AND // this part is currently
(key3p OR key4p OR key5p OR key6p) AND // produced
(key1p OR key2p OR key5p OR key6p) AND // this part will be added
(key3p OR key4p OR key5p OR key6p) //.
In order to limit the impact of this combinatorial explosion, we will
introduce a rule that we won't generate more than #defined
MAX_IMERGE_OPTS options.
3. Testing and required coverage
================================
So far could find the following user cases:
* BUG#17259: Query optimizer chooses wrong index
* BUG#17673: Optimizer does not use Index Merge optimization in some cases
* BUG#23322: Optimizer sometimes erroniously prefers other index over index merge
* BUG#30151: optimizer is very reluctant to chose index_merge algorithm
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0

[Maria-developers] WL#24 Updated (by Sergei): index_merge: fair choice between index_merge union and range access
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: fair choice between index_merge union and range access
CREATION DATE..: Tue, 26 May 2009, 12:10
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 24 (http://askmonty.org/worklog/?tid=24)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:00)=-=-
Category updated.
--- /tmp/wklog.24.old.31772 2010-06-29 14:00:05.000000000 +0000
+++ /tmp/wklog.24.new.31772 2010-06-29 14:00:05.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-RawIdeaBin
-=-=(Guest - Sun, 16 Aug 2009, 02:13)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.23383 2009-08-16 02:13:54.000000000 +0300
+++ /tmp/wklog.24.new.23383 2009-08-16 02:13:54.000000000 +0300
@@ -125,7 +125,7 @@
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
-(Here an imerge for col2=c2 OR col3=c3 won't be built since neither col2=c2 nor
+(Here no imerge for col2=c2 OR col3=c3 will be built since neither col2=c2 nor
col3=c3 represent index ranges.)
@@ -199,7 +199,7 @@
O2. "Create index_merge accesses when possible"
Current tree_or() will not create index_merge access when it could create
- non-index merge access (see DISCARD-IMERGE-3 and its example in the "Problems
+ non-index merge access (see DISCARD-IMERGE-2 and its example in the "Problems
in the current implementation" section). This will be changed to work as
follows: we will create index_merge made for index scans that didn't have
their match in the other sel_tree.
-=-=(Guest - Sun, 16 Aug 2009, 01:03)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.20767 2009-08-16 01:03:11.000000000 +0300
+++ /tmp/wklog.24.new.20767 2009-08-16 01:03:11.000000000 +0300
@@ -18,6 +18,8 @@
# a range tree has range access options, possibly for several keys
range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
+ (here range(keyi) may represent ranges not for initial keyi prefixes,
+ but ranges for any infixes for keyi)
# merge tree represents several way to index_merge
imerge_tree = imerge1 AND imerge2 AND ...
@@ -47,13 +49,13 @@
R.add(range_union(A.range(i), B.range(i)));
if (R has at least one range access)
- return R;
+ return R; // DISCARD-IMERGE-2
else
{
/* could not build any range accesses. construct index_merge */
- remove non-ranges from A; // DISCARD-IMERGE-2
+ remove non-ranges from A;
remove non-ranges from B;
- return new index_merge(A, B);
+ return new index_merge(A, B); // DISCARD-IMERGE-3
}
}
else if (A is range tree and B is index_merge tree (or vice versa))
@@ -65,12 +67,12 @@
(range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
(range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
...
- (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN) AND
+ (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN)
=
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
(range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
...
- (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N)
Now each line represents an index_merge..
}
@@ -82,18 +84,18 @@
OR
imergeB1 AND imergeB2 AND ... AND imergeBN
- -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-3
+ -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-4
imergeA1
OR
- imergeB1 AND imergeB2 AND ... AND imergeBN =
+ imergeB1 =
- = (combine imergeA1 with each of the imergeB{i} ) =
+ = (combine imergeA1 with each of the range_treeB_1{i} ) =
- combine(imergeA1 OR imergeB1) AND
- combine(imergeA1 OR imergeB2) AND
+ combine(imergeA1 OR range_treeB_11) AND
+ combine(imergeA1 OR range_treeB_12) AND
... AND
- combine(imergeA1 OR imergeBN)
+ combine(imergeA1 OR range_treeB_1N)
}
}
@@ -109,7 +111,7 @@
DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
the WHERE clause has this form (conditions t.badkey may have abritrary form):
- (t.badkey<c1 AND t.key1=c1) OR (t.key1=c2 AND t.badkey < c2)
+ (t.badkey<c1 AND t.key1=c1) OR (t.key2=c2 AND t.badkey < c2)
DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
two indexes:
@@ -123,6 +125,8 @@
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
+(Here an imerge for col2=c2 OR col3=c3 won't be built since neither col2=c2 nor
+col3=c3 represent index ranges.)
2. New implementation
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 24
-=-=(Guest - Sat, 20 Jun 2009, 09:34)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.21663 2009-06-20 09:34:48.000000000 +0300
+++ /tmp/wklog.24.new.21663 2009-06-20 09:34:48.000000000 +0300
@@ -4,6 +4,7 @@
2. New implementation
2.1 New tree_and()
2.2 New tree_or()
+3. Testing and required coverage
</contents>
1. Current implementation overview
@@ -240,3 +241,14 @@
In order to limit the impact of this combinatorial explosion, we will
introduce a rule that we won't generate more than #defined
MAX_IMERGE_OPTS options.
+
+3. Testing and required coverage
+================================
+So far could find the following user cases:
+
+* BUG#17259: Query optimizer chooses wrong index
+* BUG#17673: Optimizer does not use Index Merge optimization in some cases
+* BUG#23322: Optimizer sometimes erroniously prefers other index over index merge
+* BUG#30151: optimizer is very reluctant to chose index_merge algorithm
+
+
-=-=(Guest - Thu, 18 Jun 2009, 16:55)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.19152 2009-06-18 16:55:00.000000000 +0300
+++ /tmp/wklog.24.new.19152 2009-06-18 16:55:00.000000000 +0300
@@ -141,13 +141,15 @@
Operations on SEL_ARG trees will be modified to produce/process the trees of
this kind:
+
2.1 New tree_and()
------------------
In order not to lose plans, we'll make these changes:
-1. Don't remove index_merge part of the tree.
+A1. Don't remove index_merge part of the tree (this will take care of
+ DISCARD-IMERGE-1 problem)
-2. Push range conditions down into index_merge trees that may support them.
+A2. Push range conditions down into index_merge trees that may support them.
if one tree has range(key1) and the other tree has imerge(key1 OR key2)
then perform an equvalent of this operation:
@@ -155,8 +157,86 @@
(rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
-3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
+A3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
concatenate them together.
-2.2 New tree_or()
+2.2 New tree_or()
+-----------------
+O1. Dont remove non-range plans:
+ Current tree_or() code will refuse to produce index_merge plans for
+ conditions like
+
+ "t.key1part2=const OR t.key2part1=const"
+
+ (this is marked as DISCARD-IMERGE-3). This was justifed as the left part of
+ the AND condition is not usable for range access, and the operation of
+ tree_and() guaranteed that there was no way it could changed to make a
+ usable range plan. With new tree_and() and rule A2, this is no longer the
+ case. For example for this query:
+
+ (t.key1part2=const OR t.key2part1=const) AND t.key1part1=const
+
+ it will construct a
+
+ imerge(t.key1part2=const OR t.key2part1=const), range(t.key1part1=const)
+
+ then tree_and() will apply rule A2 to push the range down into index merge
+ and after that we'll have:
+
+ range(t.key1part1=const)
+ imerge(
+ t.key1part2=const AND t.key1part1=const,
+ t.key2part1=const
+ )
+ note that imerge(...) describes a usable index_merge plan and it's possible
+ that it will be the best access path.
+
+O2. "Create index_merge accesses when possible"
+ Current tree_or() will not create index_merge access when it could create
+ non-index merge access (see DISCARD-IMERGE-3 and its example in the "Problems
+ in the current implementation" section). This will be changed to work as
+ follows: we will create index_merge made for index scans that didn't have
+ their match in the other sel_tree.
+ Ilustrating it with an example:
+
+ | sel_tree_A | sel_tree_B | A or B | include in index_merge?
+ ------+------------+------------+--------+------------------------
+ key1 | cond1 | cond2 | condM | no
+ key2 | cond3 | cond4 | NULL | no
+ key3 | cond5 | | | yes, A-side
+ key4 | cond6 | | | yes, A-side
+ key5 | | cond7 | | yes, B-side
+ key6 | | cond8 | | yes, B-side
+
+ here we assume that
+ - (cond1 OR cond2) did produce a combined range. Not including them in
+ index_merge.
+ - (cond3 OR cond4) didn't produce a usable range (e.g. they were
+ t.key1part1=c1 AND t.key1part2=c1, respectively, and combining them
+ didn't yield any range list)
+ - All other scand didn't have their counterparts, so we'll end up with a
+ SEL_TREE of:
+
+ range(condM) AND index_merge((cond5 AND cond6),(cond7 AND cond8))
+ .
+
+O4. There is no O4. DISCARD-INDEX-MERGE-4 will remain there. The idea is
+that although DISCARD-INDEX-MERGE-4 does discard plans, so far we haven
+seen any complaints that could be attributed to it.
+If we face the need to lift DISCARD-INDEX-MERGE-4, our answer will be to
+lift it ,and produce a cross-product:
+
+ ((key1p OR key2p) AND (key3p OR key4p))
+ OR
+ ((key5p OR key6p) AND (key7p OR key8p))
+
+ = (key1p OR key2p OR key5p OR key6p) AND // this part is currently
+ (key3p OR key4p OR key5p OR key6p) AND // produced
+
+ (key1p OR key2p OR key5p OR key6p) AND // this part will be added
+ (key3p OR key4p OR key5p OR key6p) //.
+
+In order to limit the impact of this combinatorial explosion, we will
+introduce a rule that we won't generate more than #defined
+MAX_IMERGE_OPTS options.
-=-=(Guest - Thu, 18 Jun 2009, 14:56)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.15612 2009-06-18 14:56:09.000000000 +0300
+++ /tmp/wklog.24.new.15612 2009-06-18 14:56:09.000000000 +0300
@@ -1 +1,162 @@
+<contents>
+1. Current implementation overview
+1.1. Problems in the current implementation
+2. New implementation
+2.1 New tree_and()
+2.2 New tree_or()
+</contents>
+
+1. Current implementation overview
+==================================
+At the moment, range analyzer works as follows:
+
+SEL_TREE structure represents
+
+ # There are sel_trees, a sel_tree is either range or merge tree
+ sel_tree = range_tree | imerge_tree
+
+ # a range tree has range access options, possibly for several keys
+ range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
+
+ # merge tree represents several way to index_merge
+ imerge_tree = imerge1 AND imerge2 AND ...
+
+ # a way to do index merge == a set to use of different indexes.
+ imergeX = range_tree1 OR range_tree2 OR ..
+ where no pair of range_treeX have ranges over the same index.
+
+
+ tree_and(A, B)
+ {
+ if (both A and B are range trees)
+ return a range_tree with computed intersection for each range;
+ if (only one of A and B is a range tree)
+ return that tree; // DISCARD-IMERGE-1
+ // at this point both trees are index_merge trees
+ return concat_lists( A.imerge1 ... A.imergeN, B.imerge1 ... B.imergeN);
+ }
+
+
+ tree_or(A, B)
+ {
+ if (A and B are range trees)
+ {
+ R = new range_tree;
+ for each index i
+ R.add(range_union(A.range(i), B.range(i)));
+
+ if (R has at least one range access)
+ return R;
+ else
+ {
+ /* could not build any range accesses. construct index_merge */
+ remove non-ranges from A; // DISCARD-IMERGE-2
+ remove non-ranges from B;
+ return new index_merge(A, B);
+ }
+ }
+ else if (A is range tree and B is index_merge tree (or vice versa))
+ {
+ Perform this transformation:
+
+ range_treeA // this is A
+ OR
+ (range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
+ (range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
+ ...
+ (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN) AND
+ =
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+ (range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
+ ...
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+
+ Now each line represents an index_merge..
+ }
+ else if (both A and B are index_merge trees)
+ {
+ Perform this transformation:
+
+ imergeA1 AND imergeA2 AND ... AND imergeAN
+ OR
+ imergeB1 AND imergeB2 AND ... AND imergeBN
+
+ -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-3
+
+ imergeA1
+ OR
+ imergeB1 AND imergeB2 AND ... AND imergeBN =
+
+ = (combine imergeA1 with each of the imergeB{i} ) =
+
+ combine(imergeA1 OR imergeB1) AND
+ combine(imergeA1 OR imergeB2) AND
+ ... AND
+ combine(imergeA1 OR imergeBN)
+ }
+ }
+
+1.1. Problems in the current implementation
+-------------------------------------------
+As marked in the code above:
+
+DISCARD-IMERGE-1 step will cause index_merge option to be discarded when
+the WHERE clause has this form:
+
+ (t.key1=c1 OR t.key2=c2) AND t.badkey < c3
+
+DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
+the WHERE clause has this form (conditions t.badkey may have abritrary form):
+
+ (t.badkey<c1 AND t.key1=c1) OR (t.key1=c2 AND t.badkey < c2)
+
+DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
+two indexes:
+
+ INDEX i1(col1, col2),
+ INDEX i2(col1, col3)
+
+and this WHERE clause:
+
+ col1=c1 AND (col2=c2 OR col3=c3)
+
+The optimizer will generate the plans that only use the "col1=c1" part. The
+right side of the AND will be ignored even if it has good selectivity.
+
+
+2. New implementation
+=====================
+
+<general idea>
+* Don't start fighting combinatorial explosion until we've actually got one.
+</>
+
+SEL_TREE structure will be now able to hold both index_merge and range scan
+candidates at the same time. That is,
+
+ sel_tree2 = range_tree AND imerge_tree
+
+where both parts are optional (i.e. can be empty)
+
+Operations on SEL_ARG trees will be modified to produce/process the trees of
+this kind:
+
+2.1 New tree_and()
+------------------
+In order not to lose plans, we'll make these changes:
+
+1. Don't remove index_merge part of the tree.
+
+2. Push range conditions down into index_merge trees that may support them.
+ if one tree has range(key1) and the other tree has imerge(key1 OR key2)
+ then perform an equvalent of this operation:
+
+ rangeA(key1) AND ( rangeB(key1) OR rangeB(key2)) =
+
+ (rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
+
+3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
+ concatenate them together.
+
+2.2 New tree_or()
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 24
-=-=(Guest - Mon, 01 Jun 2009, 23:30)=-=-
High-Level Specification modified.
--- /tmp/wklog.24.old.21580 2009-06-01 23:30:06.000000000 +0300
+++ /tmp/wklog.24.new.21580 2009-06-01 23:30:06.000000000 +0300
@@ -64,6 +64,9 @@
* How strict is the limitation on the form of the WHERE?
+* Which version should this be based on? 5.1? Which patches are should be in
+ (google's/percona's/maria/etc?)
+
* TODO: The optimizer didn't compare costs of index_merge and range before (ok
it did but that was done for accesses to different tables). Will there be any
possible gotchas here?
-=-=(Guest - Wed, 27 May 2009, 13:59)=-=-
Title modified.
--- /tmp/wklog.24.old.9498 2009-05-27 13:59:23.000000000 +0300
+++ /tmp/wklog.24.new.9498 2009-05-27 13:59:23.000000000 +0300
@@ -1 +1 @@
-index_merge optimizer: dont discard index_merge union strategies when range is available
+index_merge: fair choice between index_merge union and range access
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=24&nolimit=1
DESCRIPTION:
Current range optimizer will discard possible index_merge/[sort]union
strategies when there is a possible range plan. This action is a part of
measures we take to avoid combinatorial explosion of possible range/
index_merge strategies.
A bad side effect of this is that for WHERE clauses in form
t.key1= 'very-frequent-value' AND (t.key2='rare-value1' OR t.key3='rare-value2')
the optimizer will
- discard union(key2,key3) in favor of range(key1)
- consider costs of using range(key1) and discard that plan also
and the overall effect is that possible poor range access will cause possible
good index_merge access not to be considered.
This WL is to about lifting this limitation at least for some subset of WHERE
clauses.
HIGH-LEVEL SPECIFICATION:
(Not a ready HLS but draft)
<contents>
Solution overview
Limitations
TODO
</contents>
Solution overview
=================
The idea is to delay discarding potential index_merge plans until the point
where it is really necessary.
This way, we won't have to do much changes in the range analyzer, but will be
able to keep potential index_merge plan just enough so that it's possible to
take it into consideration together with range access plans.
Since there are no changes in the optimizer, the ability to consider both
range and index_merge options will be limited to WHERE clauses of this form:
WHERE := range_cond(key1_1) AND
range_cond(key2_1) AND
other_cond AND
index_merge_OR_cond1(key3_1, key3_2, ...)
index_merge_OR_cond2(key4_1, key4_2, ...)
where
index_merge_OR_cond{N} := (range_cond(keyN_1) OR
range_cond(keyN_2) OR ...)
range_cond(keyX) := condition that allows to construct range access of keyX
and doesn't allow to construct range/index_merge accesses
for any keys of the table in question.
For such WHERE clauses, the range analyzer will produce SEL_TREE of this form:
SEL_TREE(
range(key1_1),
...
range(key2_1),
SEL_IMERGE( (1)
SEL_TREE(key3_1})
SEL_TREE(key3_2})
...
)
...
)
which can be used to make a cost-based choice between range and index_merge.
Limitations
-----------
This will not be a full solution in a sense that the range analyzer will not
be able to produce sel_tree (1) if the WHERE clause is specified in other form
(e.g. brackets were opened).
TODO
----
* is it a problem if there are keys that are referred to both from
index_merge and from range access?
* How strict is the limitation on the form of the WHERE?
* Which version should this be based on? 5.1? Which patches are should be in
(google's/percona's/maria/etc?)
* TODO: The optimizer didn't compare costs of index_merge and range before (ok
it did but that was done for accesses to different tables). Will there be any
possible gotchas here?
LOW-LEVEL DESIGN:
<contents>
1. Current implementation overview
1.1. Problems in the current implementation
2. New implementation
2.1 New tree_and()
2.2 New tree_or()
3. Testing and required coverage
</contents>
1. Current implementation overview
==================================
At the moment, range analyzer works as follows:
SEL_TREE structure represents
# There are sel_trees, a sel_tree is either range or merge tree
sel_tree = range_tree | imerge_tree
# a range tree has range access options, possibly for several keys
range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
(here range(keyi) may represent ranges not for initial keyi prefixes,
but ranges for any infixes for keyi)
# merge tree represents several way to index_merge
imerge_tree = imerge1 AND imerge2 AND ...
# a way to do index merge == a set to use of different indexes.
imergeX = range_tree1 OR range_tree2 OR ..
where no pair of range_treeX have ranges over the same index.
tree_and(A, B)
{
if (both A and B are range trees)
return a range_tree with computed intersection for each range;
if (only one of A and B is a range tree)
return that tree; // DISCARD-IMERGE-1
// at this point both trees are index_merge trees
return concat_lists( A.imerge1 ... A.imergeN, B.imerge1 ... B.imergeN);
}
tree_or(A, B)
{
if (A and B are range trees)
{
R = new range_tree;
for each index i
R.add(range_union(A.range(i), B.range(i)));
if (R has at least one range access)
return R; // DISCARD-IMERGE-2
else
{
/* could not build any range accesses. construct index_merge */
remove non-ranges from A;
remove non-ranges from B;
return new index_merge(A, B); // DISCARD-IMERGE-3
}
}
else if (A is range tree and B is index_merge tree (or vice versa))
{
Perform this transformation:
range_treeA // this is A
OR
(range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
(range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
...
(range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN)
=
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
(range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
...
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N)
Now each line represents an index_merge..
}
else if (both A and B are index_merge trees)
{
Perform this transformation:
imergeA1 AND imergeA2 AND ... AND imergeAN
OR
imergeB1 AND imergeB2 AND ... AND imergeBN
-> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-4
imergeA1
OR
imergeB1 =
= (combine imergeA1 with each of the range_treeB_1{i} ) =
combine(imergeA1 OR range_treeB_11) AND
combine(imergeA1 OR range_treeB_12) AND
... AND
combine(imergeA1 OR range_treeB_1N)
}
}
1.1. Problems in the current implementation
-------------------------------------------
As marked in the code above:
DISCARD-IMERGE-1 step will cause index_merge option to be discarded when
the WHERE clause has this form:
(t.key1=c1 OR t.key2=c2) AND t.badkey < c3
DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
the WHERE clause has this form (conditions t.badkey may have abritrary form):
(t.badkey<c1 AND t.key1=c1) OR (t.key2=c2 AND t.badkey < c2)
DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
two indexes:
INDEX i1(col1, col2),
INDEX i2(col1, col3)
and this WHERE clause:
col1=c1 AND (col2=c2 OR col3=c3)
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
(Here no imerge for col2=c2 OR col3=c3 will be built since neither col2=c2 nor
col3=c3 represent index ranges.)
2. New implementation
=====================
<general idea>
* Don't start fighting combinatorial explosion until we've actually got one.
</>
SEL_TREE structure will be now able to hold both index_merge and range scan
candidates at the same time. That is,
sel_tree2 = range_tree AND imerge_tree
where both parts are optional (i.e. can be empty)
Operations on SEL_ARG trees will be modified to produce/process the trees of
this kind:
2.1 New tree_and()
------------------
In order not to lose plans, we'll make these changes:
A1. Don't remove index_merge part of the tree (this will take care of
DISCARD-IMERGE-1 problem)
A2. Push range conditions down into index_merge trees that may support them.
if one tree has range(key1) and the other tree has imerge(key1 OR key2)
then perform an equvalent of this operation:
rangeA(key1) AND ( rangeB(key1) OR rangeB(key2)) =
(rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
A3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
concatenate them together.
2.2 New tree_or()
-----------------
O1. Dont remove non-range plans:
Current tree_or() code will refuse to produce index_merge plans for
conditions like
"t.key1part2=const OR t.key2part1=const"
(this is marked as DISCARD-IMERGE-3). This was justifed as the left part of
the AND condition is not usable for range access, and the operation of
tree_and() guaranteed that there was no way it could changed to make a
usable range plan. With new tree_and() and rule A2, this is no longer the
case. For example for this query:
(t.key1part2=const OR t.key2part1=const) AND t.key1part1=const
it will construct a
imerge(t.key1part2=const OR t.key2part1=const), range(t.key1part1=const)
then tree_and() will apply rule A2 to push the range down into index merge
and after that we'll have:
range(t.key1part1=const)
imerge(
t.key1part2=const AND t.key1part1=const,
t.key2part1=const
)
note that imerge(...) describes a usable index_merge plan and it's possible
that it will be the best access path.
O2. "Create index_merge accesses when possible"
Current tree_or() will not create index_merge access when it could create
non-index merge access (see DISCARD-IMERGE-2 and its example in the "Problems
in the current implementation" section). This will be changed to work as
follows: we will create index_merge made for index scans that didn't have
their match in the other sel_tree.
Ilustrating it with an example:
| sel_tree_A | sel_tree_B | A or B | include in index_merge?
------+------------+------------+--------+------------------------
key1 | cond1 | cond2 | condM | no
key2 | cond3 | cond4 | NULL | no
key3 | cond5 | | | yes, A-side
key4 | cond6 | | | yes, A-side
key5 | | cond7 | | yes, B-side
key6 | | cond8 | | yes, B-side
here we assume that
- (cond1 OR cond2) did produce a combined range. Not including them in
index_merge.
- (cond3 OR cond4) didn't produce a usable range (e.g. they were
t.key1part1=c1 AND t.key1part2=c1, respectively, and combining them
didn't yield any range list)
- All other scand didn't have their counterparts, so we'll end up with a
SEL_TREE of:
range(condM) AND index_merge((cond5 AND cond6),(cond7 AND cond8))
.
O4. There is no O4. DISCARD-INDEX-MERGE-4 will remain there. The idea is
that although DISCARD-INDEX-MERGE-4 does discard plans, so far we haven
seen any complaints that could be attributed to it.
If we face the need to lift DISCARD-INDEX-MERGE-4, our answer will be to
lift it ,and produce a cross-product:
((key1p OR key2p) AND (key3p OR key4p))
OR
((key5p OR key6p) AND (key7p OR key8p))
= (key1p OR key2p OR key5p OR key6p) AND // this part is currently
(key3p OR key4p OR key5p OR key6p) AND // produced
(key1p OR key2p OR key5p OR key6p) AND // this part will be added
(key3p OR key4p OR key5p OR key6p) //.
In order to limit the impact of this combinatorial explosion, we will
introduce a rule that we won't generate more than #defined
MAX_IMERGE_OPTS options.
3. Testing and required coverage
================================
So far could find the following user cases:
* BUG#17259: Query optimizer chooses wrong index
* BUG#17673: Optimizer does not use Index Merge optimization in some cases
* BUG#23322: Optimizer sometimes erroniously prefers other index over index merge
* BUG#30151: optimizer is very reluctant to chose index_merge algorithm
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0

[Maria-developers] WL#67 Updated (by Psergey): ICP/MRR backport
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: ICP/MRR backport
CREATION DATE..: Thu, 26 Nov 2009, 15:19
SUPERVISOR.....: Monty
IMPLEMENTOR....: Psergey
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 67 (http://askmonty.org/worklog/?tid=67)
VERSION........: Server-9.x
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Tue, 29 Jun 2010, 13:57)=-=-
Status updated.
--- /tmp/wklog.67.old.31561 2010-06-29 13:57:50.000000000 +0000
+++ /tmp/wklog.67.new.31561 2010-06-29 13:57:50.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Complete
-=-=(Guest - Sun, 13 Jun 2010, 16:57)=-=-
Dependency deleted: 91 no longer depends on 67
-=-=(Igor - Wed, 10 Mar 2010, 19:14)=-=-
High Level Description modified.
--- /tmp/wklog.67.old.25641 2010-03-10 19:14:45.000000000 +0000
+++ /tmp/wklog.67.new.25641 2010-03-10 19:14:45.000000000 +0000
@@ -1,2 +1,2 @@
-Backport DS-MRR into MariaDB-5.2 codebase, also adding certain extra features to
-make it more usable.
+Backport ICP and DS-MRR into MariaDB-5.2 codebase, also adding certain extra
+features to make it more usable.
-=-=(Guest - Wed, 10 Mar 2010, 19:12)=-=-
Title modified.
--- /tmp/wklog.67.old.25456 2010-03-10 19:12:57.000000000 +0000
+++ /tmp/wklog.67.new.25456 2010-03-10 19:12:57.000000000 +0000
@@ -1 +1 @@
-MRR backport
+ICP/MRR backport
-=-=(Psergey - Sun, 28 Feb 2010, 14:56)=-=-
Dependency created: 91 now depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 67
-=-=(Psergey - Thu, 26 Nov 2009, 20:21)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.9329 2009-11-26 20:21:28.000000000 +0200
+++ /tmp/wklog.67.new.9329 2009-11-26 20:21:28.000000000 +0200
@@ -65,17 +65,19 @@
2.5 Make MRR code more of a module
----------------------------------
-Some code in handler.cc can be moved to separate file.
-But changes in opt_range.cc can't.
-TODO: Sort out how much we really can do here. Initial guess is not much as the
-code consists of:
+It is not possible to make MRR to be a totally separate module, as its code
+consists of :
- Default MRR implementation in handler.cc
- Changes in opt_range.cc to use MRR instead of multiple records_in_range()
- calls. These rely on opt_range.cc's internal structures like SEL_ARG trees and
+ calls. These rely on opt_range.cc's internal stuctures like SEL_ARG trees and
so there is not much point in moving them out.
-- DS-MRR implementations which are spread over storage engines.
-and the only modularization we see is to move #1 into a separate file which
-won't achieve much.
+- DS-MRR impelementations which are spread over storage engines.
+
+We'll try to modularize what we can:
+- Move out default MRR implementation from handler.cc
+- Move possible parts out of opt_range.cc into a separate file.
+
+
2.6 Improve the cost model
--------------------------
-=-=(Psergey - Thu, 26 Nov 2009, 19:06)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.6449 2009-11-26 19:06:04.000000000 +0200
+++ /tmp/wklog.67.new.6449 2009-11-26 19:06:04.000000000 +0200
@@ -1,4 +1,3 @@
-
<contents>
1. Requirements
2. Required actions
@@ -44,6 +43,7 @@
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=index_condi…
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=mrr
+http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=icp
2.2 Backport DS-MRR code to MariaDB 5.2
---------------------------------------
-=-=(Psergey - Thu, 26 Nov 2009, 18:15)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.4161 2009-11-26 18:15:36.000000000 +0200
+++ /tmp/wklog.67.new.4161 2009-11-26 18:15:36.000000000 +0200
@@ -1,3 +1,17 @@
+
+<contents>
+1. Requirements
+2. Required actions
+2.1 Fix DS-MRR/InnoDB bugs
+2.2 Backport DS-MRR code to MariaDB 5.2
+2.3 Introduce control variables
+2.4 Other backport issues
+2.5 Make MRR code more of a module
+2.6 Improve the cost model
+2.7 Let DS-MRR support clustered primary keys
+</contents>
+
+
1. Requirements
===============
@@ -63,4 +77,28 @@
and the only modularization we see is to move #1 into a separate file which
won't achieve much.
+2.6 Improve the cost model
+--------------------------
+At the moment DS-MRR cost formula re-uses non-MRR scan costs, which uses
+records_in_range() calls, followed by index_only_read_time() or read_time()
+calls to produce the estimate for read cost.
+
+ We should change this (TODO sort out how exactly)
+
+Note: this means that the query plans will change from MariaDB 5.2.
+
+2.7 Let DS-MRR support clustered primary keys
+---------------------------------------------
+At the moment DS-MRR is not supported for clustered primary keys. It is not
+needed when MRR is used for range access, because range access is done over
+an ordered list of ranges, but it is useful for BKA.
+
+TODO:
+ it's useful for BKA because BKA makes MRR scans over un-orderered
+ non-disjoint lists of ranges. Then we can sort these and do ordered scans.
+ There is still no use for DS-MRR over clustered primary key for range
+ access, where the ranges are disjoint and ordered.
+ How about postponing this item until BKA is backported?
+
+
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=67&nolimit=1
DESCRIPTION:
Backport ICP and DS-MRR into MariaDB-5.2 codebase, also adding certain extra
features to make it more usable.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Requirements
2. Required actions
2.1 Fix DS-MRR/InnoDB bugs
2.2 Backport DS-MRR code to MariaDB 5.2
2.3 Introduce control variables
2.4 Other backport issues
2.5 Make MRR code more of a module
2.6 Improve the cost model
2.7 Let DS-MRR support clustered primary keys
</contents>
1. Requirements
===============
We need the following:
1. Latest MRR interface support, including extensions to support ICP when
using BKA.
2. Let DS-MRR support clustered primary keys (needed when using BKA).
3. Remove conditions used for key access from the condition pushed to index
(ATM this manifests itself as "Using index condition" appearing where there
was no "Using where". TODO: example of this?)
4. Introduce a separate @@optimizer_switch flag for turning on/out ICP (atm it
is switched on/off by @@engine_condition_pushdown)
5. Introduce a separate @@mrr_buffer_size variable to control MRR buffer size
for range+MRR scans. ATM it is controlled by @@read_rnd_size flag and that
makes it unobvious for a number of users.
6. Rename multi_range_read_info_const() to look like it is not a part of MRR
interface.
8. Try to make MRR to be more of a module
7. Improve MRR's cost model.
2. Required actions
===================
Roughly in the order in which it will be done:
2.1 Fix DS-MRR/InnoDB bugs
--------------------------
We need to fix the bugs listed here:
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=index_condi…
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=mrr
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=icp
2.2 Backport DS-MRR code to MariaDB 5.2
---------------------------------------
The easiest way seems to be to to manually move the needed code from mysql-6.0
(or whatever it's called now) to MariaDB.
2.3 Introduce control variables
-------------------------------
Act on items #4 and #5 from the requirements. Should be easy as
@@optimizer_switch is supported in 5.1 codebase.
2.4 Other backport issues
-------------------------
* Figure out what to do with NDB/MRR. 5.1 codebase has "old" NDB/MRR
implementation. mysql-6.0 (and NDB's branch) have the updated NDB/MRR
but merging it into 5.1 can be very labor-intensive.
Will it be ok to disable NDB/MRR altogether?
2.5 Make MRR code more of a module
----------------------------------
It is not possible to make MRR to be a totally separate module, as its code
consists of :
- Default MRR implementation in handler.cc
- Changes in opt_range.cc to use MRR instead of multiple records_in_range()
calls. These rely on opt_range.cc's internal stuctures like SEL_ARG trees and
so there is not much point in moving them out.
- DS-MRR impelementations which are spread over storage engines.
We'll try to modularize what we can:
- Move out default MRR implementation from handler.cc
- Move possible parts out of opt_range.cc into a separate file.
2.6 Improve the cost model
--------------------------
At the moment DS-MRR cost formula re-uses non-MRR scan costs, which uses
records_in_range() calls, followed by index_only_read_time() or read_time()
calls to produce the estimate for read cost.
We should change this (TODO sort out how exactly)
Note: this means that the query plans will change from MariaDB 5.2.
2.7 Let DS-MRR support clustered primary keys
---------------------------------------------
At the moment DS-MRR is not supported for clustered primary keys. It is not
needed when MRR is used for range access, because range access is done over
an ordered list of ranges, but it is useful for BKA.
TODO:
it's useful for BKA because BKA makes MRR scans over un-orderered
non-disjoint lists of ranges. Then we can sort these and do ordered scans.
There is still no use for DS-MRR over clustered primary key for range
access, where the ranges are disjoint and ordered.
How about postponing this item until BKA is backported?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0

[Maria-developers] WL#67 Updated (by Psergey): ICP/MRR backport
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: ICP/MRR backport
CREATION DATE..: Thu, 26 Nov 2009, 15:19
SUPERVISOR.....: Monty
IMPLEMENTOR....: Psergey
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 67 (http://askmonty.org/worklog/?tid=67)
VERSION........: Server-9.x
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Tue, 29 Jun 2010, 13:57)=-=-
Status updated.
--- /tmp/wklog.67.old.31561 2010-06-29 13:57:50.000000000 +0000
+++ /tmp/wklog.67.new.31561 2010-06-29 13:57:50.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Complete
-=-=(Guest - Sun, 13 Jun 2010, 16:57)=-=-
Dependency deleted: 91 no longer depends on 67
-=-=(Igor - Wed, 10 Mar 2010, 19:14)=-=-
High Level Description modified.
--- /tmp/wklog.67.old.25641 2010-03-10 19:14:45.000000000 +0000
+++ /tmp/wklog.67.new.25641 2010-03-10 19:14:45.000000000 +0000
@@ -1,2 +1,2 @@
-Backport DS-MRR into MariaDB-5.2 codebase, also adding certain extra features to
-make it more usable.
+Backport ICP and DS-MRR into MariaDB-5.2 codebase, also adding certain extra
+features to make it more usable.
-=-=(Guest - Wed, 10 Mar 2010, 19:12)=-=-
Title modified.
--- /tmp/wklog.67.old.25456 2010-03-10 19:12:57.000000000 +0000
+++ /tmp/wklog.67.new.25456 2010-03-10 19:12:57.000000000 +0000
@@ -1 +1 @@
-MRR backport
+ICP/MRR backport
-=-=(Psergey - Sun, 28 Feb 2010, 14:56)=-=-
Dependency created: 91 now depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 67
-=-=(Psergey - Thu, 26 Nov 2009, 20:21)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.9329 2009-11-26 20:21:28.000000000 +0200
+++ /tmp/wklog.67.new.9329 2009-11-26 20:21:28.000000000 +0200
@@ -65,17 +65,19 @@
2.5 Make MRR code more of a module
----------------------------------
-Some code in handler.cc can be moved to separate file.
-But changes in opt_range.cc can't.
-TODO: Sort out how much we really can do here. Initial guess is not much as the
-code consists of:
+It is not possible to make MRR to be a totally separate module, as its code
+consists of :
- Default MRR implementation in handler.cc
- Changes in opt_range.cc to use MRR instead of multiple records_in_range()
- calls. These rely on opt_range.cc's internal structures like SEL_ARG trees and
+ calls. These rely on opt_range.cc's internal stuctures like SEL_ARG trees and
so there is not much point in moving them out.
-- DS-MRR implementations which are spread over storage engines.
-and the only modularization we see is to move #1 into a separate file which
-won't achieve much.
+- DS-MRR impelementations which are spread over storage engines.
+
+We'll try to modularize what we can:
+- Move out default MRR implementation from handler.cc
+- Move possible parts out of opt_range.cc into a separate file.
+
+
2.6 Improve the cost model
--------------------------
-=-=(Psergey - Thu, 26 Nov 2009, 19:06)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.6449 2009-11-26 19:06:04.000000000 +0200
+++ /tmp/wklog.67.new.6449 2009-11-26 19:06:04.000000000 +0200
@@ -1,4 +1,3 @@
-
<contents>
1. Requirements
2. Required actions
@@ -44,6 +43,7 @@
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=index_condi…
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=mrr
+http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=icp
2.2 Backport DS-MRR code to MariaDB 5.2
---------------------------------------
-=-=(Psergey - Thu, 26 Nov 2009, 18:15)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.4161 2009-11-26 18:15:36.000000000 +0200
+++ /tmp/wklog.67.new.4161 2009-11-26 18:15:36.000000000 +0200
@@ -1,3 +1,17 @@
+
+<contents>
+1. Requirements
+2. Required actions
+2.1 Fix DS-MRR/InnoDB bugs
+2.2 Backport DS-MRR code to MariaDB 5.2
+2.3 Introduce control variables
+2.4 Other backport issues
+2.5 Make MRR code more of a module
+2.6 Improve the cost model
+2.7 Let DS-MRR support clustered primary keys
+</contents>
+
+
1. Requirements
===============
@@ -63,4 +77,28 @@
and the only modularization we see is to move #1 into a separate file which
won't achieve much.
+2.6 Improve the cost model
+--------------------------
+At the moment DS-MRR cost formula re-uses non-MRR scan costs, which uses
+records_in_range() calls, followed by index_only_read_time() or read_time()
+calls to produce the estimate for read cost.
+
+ We should change this (TODO sort out how exactly)
+
+Note: this means that the query plans will change from MariaDB 5.2.
+
+2.7 Let DS-MRR support clustered primary keys
+---------------------------------------------
+At the moment DS-MRR is not supported for clustered primary keys. It is not
+needed when MRR is used for range access, because range access is done over
+an ordered list of ranges, but it is useful for BKA.
+
+TODO:
+ it's useful for BKA because BKA makes MRR scans over un-orderered
+ non-disjoint lists of ranges. Then we can sort these and do ordered scans.
+ There is still no use for DS-MRR over clustered primary key for range
+ access, where the ranges are disjoint and ordered.
+ How about postponing this item until BKA is backported?
+
+
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=67&nolimit=1
DESCRIPTION:
Backport ICP and DS-MRR into MariaDB-5.2 codebase, also adding certain extra
features to make it more usable.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Requirements
2. Required actions
2.1 Fix DS-MRR/InnoDB bugs
2.2 Backport DS-MRR code to MariaDB 5.2
2.3 Introduce control variables
2.4 Other backport issues
2.5 Make MRR code more of a module
2.6 Improve the cost model
2.7 Let DS-MRR support clustered primary keys
</contents>
1. Requirements
===============
We need the following:
1. Latest MRR interface support, including extensions to support ICP when
using BKA.
2. Let DS-MRR support clustered primary keys (needed when using BKA).
3. Remove conditions used for key access from the condition pushed to index
(ATM this manifests itself as "Using index condition" appearing where there
was no "Using where". TODO: example of this?)
4. Introduce a separate @@optimizer_switch flag for turning on/out ICP (atm it
is switched on/off by @@engine_condition_pushdown)
5. Introduce a separate @@mrr_buffer_size variable to control MRR buffer size
for range+MRR scans. ATM it is controlled by @@read_rnd_size flag and that
makes it unobvious for a number of users.
6. Rename multi_range_read_info_const() to look like it is not a part of MRR
interface.
8. Try to make MRR to be more of a module
7. Improve MRR's cost model.
2. Required actions
===================
Roughly in the order in which it will be done:
2.1 Fix DS-MRR/InnoDB bugs
--------------------------
We need to fix the bugs listed here:
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=index_condi…
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=mrr
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=icp
2.2 Backport DS-MRR code to MariaDB 5.2
---------------------------------------
The easiest way seems to be to to manually move the needed code from mysql-6.0
(or whatever it's called now) to MariaDB.
2.3 Introduce control variables
-------------------------------
Act on items #4 and #5 from the requirements. Should be easy as
@@optimizer_switch is supported in 5.1 codebase.
2.4 Other backport issues
-------------------------
* Figure out what to do with NDB/MRR. 5.1 codebase has "old" NDB/MRR
implementation. mysql-6.0 (and NDB's branch) have the updated NDB/MRR
but merging it into 5.1 can be very labor-intensive.
Will it be ok to disable NDB/MRR altogether?
2.5 Make MRR code more of a module
----------------------------------
It is not possible to make MRR to be a totally separate module, as its code
consists of :
- Default MRR implementation in handler.cc
- Changes in opt_range.cc to use MRR instead of multiple records_in_range()
calls. These rely on opt_range.cc's internal stuctures like SEL_ARG trees and
so there is not much point in moving them out.
- DS-MRR impelementations which are spread over storage engines.
We'll try to modularize what we can:
- Move out default MRR implementation from handler.cc
- Move possible parts out of opt_range.cc into a separate file.
2.6 Improve the cost model
--------------------------
At the moment DS-MRR cost formula re-uses non-MRR scan costs, which uses
records_in_range() calls, followed by index_only_read_time() or read_time()
calls to produce the estimate for read cost.
We should change this (TODO sort out how exactly)
Note: this means that the query plans will change from MariaDB 5.2.
2.7 Let DS-MRR support clustered primary keys
---------------------------------------------
At the moment DS-MRR is not supported for clustered primary keys. It is not
needed when MRR is used for range access, because range access is done over
an ordered list of ranges, but it is useful for BKA.
TODO:
it's useful for BKA because BKA makes MRR scans over un-orderered
non-disjoint lists of ranges. Then we can sort these and do ordered scans.
There is still no use for DS-MRR over clustered primary key for range
access, where the ranges are disjoint and ordered.
How about postponing this item until BKA is backported?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0

[Maria-developers] WL#120 Updated (by Knielsen): Replication API for stacked event generators
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: In-Progress
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Tue, 29 Jun 2010, 13:51)=-=-
Status updated.
--- /tmp/wklog.120.old.31179 2010-06-29 13:51:20.000000000 +0000
+++ /tmp/wklog.120.new.31179 2010-06-29 13:51:20.000000000 +0000
@@ -1 +1 @@
-Assigned
+In-Progress
-=-=(Knielsen - Mon, 28 Jun 2010, 07:11)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.18416 2010-06-28 07:11:09.000000000 +0000
+++ /tmp/wklog.120.new.18416 2010-06-28 07:11:09.000000000 +0000
@@ -14,10 +14,14 @@
An example of an event consumer is the writing of the binlog on a master.
-Event generators are not really plugins. Rather, there are specific points in
-the server where events are generated. However, a generator can be part of a
-plugin, for example a PBXT engine-level replication event generator would be
-part of the PBXT storage engine plugin.
+Some event generators are not really plugins. Rather, there are specific
+points in the server where events are generated. However, a generator can be
+part of a plugin, for example a PBXT engine-level replication event generator
+would be part of the PBXT storage engine plugin. And for example we could
+write a filter plugin, which would be stacked on top of an existing generator
+and provide the same event types and interfaces, but filtered in some way (for
+example by removing certain events on the master side, or by re-writing events
+in certain ways).
Event consumers on the other hand could be a plugin.
@@ -85,6 +89,9 @@
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
+Also, the non-materialising interface should be a good interface on top of
+which to build a materialising interface.
+
The design proposed here aims for as little materialisation as possible.
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Some event generators are not really plugins. Rather, there are specific
points in the server where events are generated. However, a generator can be
part of a plugin, for example a PBXT engine-level replication event generator
would be part of the PBXT storage engine plugin. And for example we could
write a filter plugin, which would be stacked on top of an existing generator
and provide the same event types and interfaces, but filtered in some way (for
example by removing certain events on the master side, or by re-writing events
in certain ways).
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
Also, the non-materialising interface should be a good interface on top of
which to build a materialising interface.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0

[Maria-developers] WL#120 Updated (by Knielsen): Replication API for stacked event generators
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: In-Progress
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Tue, 29 Jun 2010, 13:51)=-=-
Status updated.
--- /tmp/wklog.120.old.31179 2010-06-29 13:51:20.000000000 +0000
+++ /tmp/wklog.120.new.31179 2010-06-29 13:51:20.000000000 +0000
@@ -1 +1 @@
-Assigned
+In-Progress
-=-=(Knielsen - Mon, 28 Jun 2010, 07:11)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.18416 2010-06-28 07:11:09.000000000 +0000
+++ /tmp/wklog.120.new.18416 2010-06-28 07:11:09.000000000 +0000
@@ -14,10 +14,14 @@
An example of an event consumer is the writing of the binlog on a master.
-Event generators are not really plugins. Rather, there are specific points in
-the server where events are generated. However, a generator can be part of a
-plugin, for example a PBXT engine-level replication event generator would be
-part of the PBXT storage engine plugin.
+Some event generators are not really plugins. Rather, there are specific
+points in the server where events are generated. However, a generator can be
+part of a plugin, for example a PBXT engine-level replication event generator
+would be part of the PBXT storage engine plugin. And for example we could
+write a filter plugin, which would be stacked on top of an existing generator
+and provide the same event types and interfaces, but filtered in some way (for
+example by removing certain events on the master side, or by re-writing events
+in certain ways).
Event consumers on the other hand could be a plugin.
@@ -85,6 +89,9 @@
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
+Also, the non-materialising interface should be a good interface on top of
+which to build a materialising interface.
+
The design proposed here aims for as little materialisation as possible.
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Some event generators are not really plugins. Rather, there are specific
points in the server where events are generated. However, a generator can be
part of a plugin, for example a PBXT engine-level replication event generator
would be part of the PBXT storage engine plugin. And for example we could
write a filter plugin, which would be stacked on top of an existing generator
and provide the same event types and interfaces, but filtered in some way (for
example by removing certain events on the master side, or by re-writing events
in certain ways).
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
Also, the non-materialising interface should be a good interface on top of
which to build a materialising interface.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0

[Maria-developers] WL#107 Updated (by Sergei): New replication APIs
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: New replication APIs
CREATION DATE..: Mon, 15 Mar 2010, 13:55
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 107 (http://askmonty.org/worklog/?tid=107)
VERSION........: Server-9.x
STATUS.........: In-Progress
PRIORITY.......: 60
WORKED HOURS...: 69
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 13:50)=-=-
Status updated.
--- /tmp/wklog.107.old.31164 2010-06-29 13:50:15.000000000 +0000
+++ /tmp/wklog.107.new.31164 2010-06-29 13:50:15.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+In-Progress
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Mon, 21 Jun 2010, 08:36)=-=-
Research and design thoughts.
-=-=(Knielsen - Mon, 07 Jun 2010, 12:11)=-=-
High Level Description modified.
--- /tmp/wklog.107.old.31097 2010-06-07 12:11:57.000000000 +0000
+++ /tmp/wklog.107.new.31097 2010-06-07 12:11:57.000000000 +0000
@@ -7,3 +7,6 @@
https://lists.launchpad.net/maria-developers/msg01998.html
+Wiki page for the project:
+
+ http://askmonty.org/wiki/ReplicationProject
-=-=(Knielsen - Mon, 29 Mar 2010, 07:33)=-=-
Research and design discussions: Galera, 2pc/XA, group commit, multi-engine transactions.
-=-=(Knielsen - Wed, 24 Mar 2010, 10:39)=-=-
Design discussions
-=-=(Knielsen - Mon, 15 Mar 2010, 14:28)=-=-
Research into the problem, and discussions on phone/mailing list
-=-=(Guest - Mon, 15 Mar 2010, 14:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.107.old.9086 2010-03-15 14:18:18.000000000 +0000
+++ /tmp/wklog.107.new.9086 2010-03-15 14:18:18.000000000 +0000
@@ -1 +1,43 @@
+Current ideas/status after discussions on the mailing list:
+
+ - Implement a set of plugin APIs and use them to move all of the existing
+ MySQL replication into a (set of) plugins.
+
+ - Design the APIs so that they can support full MySQL replication, but also
+ so that they do not hardcode assumptions about how this replication
+ implementation is done, and so that they will be suitable for other types of
+ replication (Tungsten, Galera, parallel replication, ...).
+
+ - APIs need to include the concept of a global transaction ID. Need to
+ determine the extent to which the semantics of such ID will be defined
+ by the API, and to which extend it will be defined by the plugin
+ implementations.
+
+ - APIs should properly support reliable crash-recovery with decent
+ performance (eg. not require multiple mandatory fsync()s per commit, and
+ not make group commit impossible).
+
+ - Would be nice if the API provided facilities for implementing good
+ consistency checking support (mainly checking master tables against slave
+ tables is hard here I think, but also applying wrong binlog data and
+ individual event checksums).
+
+
+Steps to make this more concrete:
+
+ - Investigate the current MySQL replication, and list all of the places where
+ a plugin implementation will need to connect/hook into the MySQL server.
+ * handler::{write,update,delete}_row()
+ * Statement execution
+ * Transaction start/commit
+ * Table open
+ * Query safe/not/safe for statement based replication
+ * Statement-based logging details (user variables, random seed, etc.)
+ * ...
+
+ - Use this list to make an initial sketch of the set of APIs we need.
+
+ - Use the list to determine the feasibility of this project and the level of
+ detail in the API needed to support a full replication implementation as a
+ plugin.
-=-=(Sergei - Mon, 15 Mar 2010, 14:13)=-=-
Observers changed: Sergei
DESCRIPTION:
This is a top-level task for the project of designing a new set of replication
APIs for MariaDB.
This task is for the initial discussion of what to do and where to focus.
The project is started in this email thread:
https://lists.launchpad.net/maria-developers/msg01998.html
Wiki page for the project:
http://askmonty.org/wiki/ReplicationProject
HIGH-LEVEL SPECIFICATION:
Current ideas/status after discussions on the mailing list:
- Implement a set of plugin APIs and use them to move all of the existing
MySQL replication into a (set of) plugins.
- Design the APIs so that they can support full MySQL replication, but also
so that they do not hardcode assumptions about how this replication
implementation is done, and so that they will be suitable for other types of
replication (Tungsten, Galera, parallel replication, ...).
- APIs need to include the concept of a global transaction ID. Need to
determine the extent to which the semantics of such ID will be defined
by the API, and to which extend it will be defined by the plugin
implementations.
- APIs should properly support reliable crash-recovery with decent
performance (eg. not require multiple mandatory fsync()s per commit, and
not make group commit impossible).
- Would be nice if the API provided facilities for implementing good
consistency checking support (mainly checking master tables against slave
tables is hard here I think, but also applying wrong binlog data and
individual event checksums).
Steps to make this more concrete:
- Investigate the current MySQL replication, and list all of the places where
a plugin implementation will need to connect/hook into the MySQL server.
* handler::{write,update,delete}_row()
* Statement execution
* Transaction start/commit
* Table open
* Query safe/not/safe for statement based replication
* Statement-based logging details (user variables, random seed, etc.)
* ...
- Use this list to make an initial sketch of the set of APIs we need.
- Use the list to determine the feasibility of this project and the level of
detail in the API needed to support a full replication implementation as a
plugin.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0

[Maria-developers] WL#107 Updated (by Sergei): New replication APIs
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: New replication APIs
CREATION DATE..: Mon, 15 Mar 2010, 13:55
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 107 (http://askmonty.org/worklog/?tid=107)
VERSION........: Server-9.x
STATUS.........: In-Progress
PRIORITY.......: 60
WORKED HOURS...: 69
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 13:50)=-=-
Status updated.
--- /tmp/wklog.107.old.31164 2010-06-29 13:50:15.000000000 +0000
+++ /tmp/wklog.107.new.31164 2010-06-29 13:50:15.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+In-Progress
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Mon, 21 Jun 2010, 08:36)=-=-
Research and design thoughts.
-=-=(Knielsen - Mon, 07 Jun 2010, 12:11)=-=-
High Level Description modified.
--- /tmp/wklog.107.old.31097 2010-06-07 12:11:57.000000000 +0000
+++ /tmp/wklog.107.new.31097 2010-06-07 12:11:57.000000000 +0000
@@ -7,3 +7,6 @@
https://lists.launchpad.net/maria-developers/msg01998.html
+Wiki page for the project:
+
+ http://askmonty.org/wiki/ReplicationProject
-=-=(Knielsen - Mon, 29 Mar 2010, 07:33)=-=-
Research and design discussions: Galera, 2pc/XA, group commit, multi-engine transactions.
-=-=(Knielsen - Wed, 24 Mar 2010, 10:39)=-=-
Design discussions
-=-=(Knielsen - Mon, 15 Mar 2010, 14:28)=-=-
Research into the problem, and discussions on phone/mailing list
-=-=(Guest - Mon, 15 Mar 2010, 14:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.107.old.9086 2010-03-15 14:18:18.000000000 +0000
+++ /tmp/wklog.107.new.9086 2010-03-15 14:18:18.000000000 +0000
@@ -1 +1,43 @@
+Current ideas/status after discussions on the mailing list:
+
+ - Implement a set of plugin APIs and use them to move all of the existing
+ MySQL replication into a (set of) plugins.
+
+ - Design the APIs so that they can support full MySQL replication, but also
+ so that they do not hardcode assumptions about how this replication
+ implementation is done, and so that they will be suitable for other types of
+ replication (Tungsten, Galera, parallel replication, ...).
+
+ - APIs need to include the concept of a global transaction ID. Need to
+ determine the extent to which the semantics of such ID will be defined
+ by the API, and to which extend it will be defined by the plugin
+ implementations.
+
+ - APIs should properly support reliable crash-recovery with decent
+ performance (eg. not require multiple mandatory fsync()s per commit, and
+ not make group commit impossible).
+
+ - Would be nice if the API provided facilities for implementing good
+ consistency checking support (mainly checking master tables against slave
+ tables is hard here I think, but also applying wrong binlog data and
+ individual event checksums).
+
+
+Steps to make this more concrete:
+
+ - Investigate the current MySQL replication, and list all of the places where
+ a plugin implementation will need to connect/hook into the MySQL server.
+ * handler::{write,update,delete}_row()
+ * Statement execution
+ * Transaction start/commit
+ * Table open
+ * Query safe/not/safe for statement based replication
+ * Statement-based logging details (user variables, random seed, etc.)
+ * ...
+
+ - Use this list to make an initial sketch of the set of APIs we need.
+
+ - Use the list to determine the feasibility of this project and the level of
+ detail in the API needed to support a full replication implementation as a
+ plugin.
-=-=(Sergei - Mon, 15 Mar 2010, 14:13)=-=-
Observers changed: Sergei
DESCRIPTION:
This is a top-level task for the project of designing a new set of replication
APIs for MariaDB.
This task is for the initial discussion of what to do and where to focus.
The project is started in this email thread:
https://lists.launchpad.net/maria-developers/msg01998.html
Wiki page for the project:
http://askmonty.org/wiki/ReplicationProject
HIGH-LEVEL SPECIFICATION:
Current ideas/status after discussions on the mailing list:
- Implement a set of plugin APIs and use them to move all of the existing
MySQL replication into a (set of) plugins.
- Design the APIs so that they can support full MySQL replication, but also
so that they do not hardcode assumptions about how this replication
implementation is done, and so that they will be suitable for other types of
replication (Tungsten, Galera, parallel replication, ...).
- APIs need to include the concept of a global transaction ID. Need to
determine the extent to which the semantics of such ID will be defined
by the API, and to which extend it will be defined by the plugin
implementations.
- APIs should properly support reliable crash-recovery with decent
performance (eg. not require multiple mandatory fsync()s per commit, and
not make group commit impossible).
- Would be nice if the API provided facilities for implementing good
consistency checking support (mainly checking master tables against slave
tables is hard here I think, but also applying wrong binlog data and
individual event checksums).
Steps to make this more concrete:
- Investigate the current MySQL replication, and list all of the places where
a plugin implementation will need to connect/hook into the MySQL server.
* handler::{write,update,delete}_row()
* Statement execution
* Transaction start/commit
* Table open
* Query safe/not/safe for statement based replication
* Statement-based logging details (user variables, random seed, etc.)
* ...
- Use this list to make an initial sketch of the set of APIs we need.
- Use the list to determine the feasibility of this project and the level of
detail in the API needed to support a full replication implementation as a
plugin.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
This patch installs all the files that were missing from the installer
package. Now, the installer has the same set of files as the zip file.
Diff'ed against the current 5.1 tree.
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
2
2