developers
Threads by month
- ----- 2025 -----
- February
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- 6826 discussions
[Maria-developers] WL#85 Updated (by Sergei): Partitioned Key Cache for MyISAM
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Partitioned Key Cache for MyISAM
CREATION DATE..: Sun, 14 Feb 2010, 00:10
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Monty, Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 85 (http://askmonty.org/worklog/?tid=85)
VERSION........: Server-5.2
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:05)=-=-
Status updated.
--- /tmp/wklog.85.old.32131 2010-06-29 14:05:44.000000000 +0000
+++ /tmp/wklog.85.new.32131 2010-06-29 14:05:44.000000000 +0000
@@ -1 +1 @@
-Assigned
+Complete
-=-=(Igor - Tue, 16 Mar 2010, 19:34)=-=-
High Level Description modified.
--- /tmp/wklog.85.old.22371 2010-03-16 19:34:33.000000000 +0000
+++ /tmp/wklog.85.new.22371 2010-03-16 19:34:33.000000000 +0000
@@ -15,4 +15,5 @@
the chances for threads not compete for the same key cache lock better.
The idea and the original of the partitioned key cache was provided by one of
-our external contributers.
+our external contributers (see the attached file segmented_keycache_v2.diff with
+the original patch from the contributor).
-=-=(Igor - Sun, 14 Feb 2010, 00:15)=-=-
Category updated.
--- /tmp/wklog.85.old.9810 2010-02-13 22:15:43.000000000 +0000
+++ /tmp/wklog.85.new.9810 2010-02-13 22:15:43.000000000 +0000
@@ -1 +1 @@
-Server-BackLog
+Server-Sprint
-=-=(Igor - Sun, 14 Feb 2010, 00:15)=-=-
Version updated.
--- /tmp/wklog.85.old.9810 2010-02-13 22:15:43.000000000 +0000
+++ /tmp/wklog.85.new.9810 2010-02-13 22:15:43.000000000 +0000
@@ -1 +1 @@
-Benchmarks-3.0
+Server-5.2
-=-=(Igor - Sun, 14 Feb 2010, 00:12)=-=-
New attachment: 'segmented_keycache_v2.diff'
DESCRIPTION:
A partitioned key cache is a collection of structures for regular MyiSAM key
caches called key cache partitions. Any page from a file can be placed into a
buffer of only one partition. The number of the partition is calculated from the
file number and the position of the page in the file, and it's always the same
for the page. The function that maps pages into partitions takes care of even
distribution of pages among partitions.
Partition key cache mitigate one of the major problem of simple key cache:
thread contention for key cache lock (mutex). Every call of a key cache
interface function must acquire this lock. So threads compete for this lock even
in the case when they have acquired shared locks for the file and pages they
want read from are in the key cache buffers. When working with a partitioned key
cache any key cache interface function that needs only one page has to acquire
the key cache lock only for the partition the page is ascribed to. This makes
the chances for threads not compete for the same key cache lock better.
The idea and the original of the partitioned key cache was provided by one of
our external contributers (see the attached file segmented_keycache_v2.diff with
the original patch from the contributor).
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#112 Updated (by Sergei): Merge OQGraph into MariaDB
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Merge OQGraph into MariaDB
CREATION DATE..: Mon, 29 Mar 2010, 18:00
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 112 (http://askmonty.org/worklog/?tid=112)
VERSION........: Server-5.2
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 13
ESTIMATE.......: 2 (hours remain)
ORIG. ESTIMATE.: 15
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:04)=-=-
Status updated.
--- /tmp/wklog.112.old.32115 2010-06-29 14:04:40.000000000 +0000
+++ /tmp/wklog.112.new.32115 2010-06-29 14:04:40.000000000 +0000
@@ -1 +1 @@
-Code-Review
+Complete
-=-=(Knielsen - Tue, 06 Apr 2010, 15:28)=-=-
Fixed all issues from first code review.
Implement packaging for OQGraph in bakery.
Set up buildbot hosts for including OQGraph, including binary packaging.
-=-=(Knielsen - Wed, 31 Mar 2010, 13:38)=-=-
Status updated.
--- /tmp/wklog.112.old.12166 2010-03-31 13:38:25.000000000 +0000
+++ /tmp/wklog.112.new.12166 2010-03-31 13:38:25.000000000 +0000
@@ -1 +1 @@
-Assigned
+Code-Review
-=-=(Knielsen - Wed, 31 Mar 2010, 13:38)=-=-
High-Level Specification modified.
--- /tmp/wklog.112.old.12070 2010-03-31 13:38:08.000000000 +0000
+++ /tmp/wklog.112.new.12070 2010-03-31 13:38:08.000000000 +0000
@@ -15,3 +15,5 @@
Fix OQGraph plug.in to detect boost version >= 1.40.0, and only enable OQGraph
if such boost is found.
+Update the packaging in ourdelta/bakery to include the oqgraph_engine.so and
+link with g++ rather than gcc.
-=-=(Knielsen - Mon, 29 Mar 2010, 21:46)=-=-
High-Level Specification modified.
--- /tmp/wklog.112.old.31142 2010-03-29 21:46:11.000000000 +0000
+++ /tmp/wklog.112.new.31142 2010-03-29 21:46:11.000000000 +0000
@@ -1,28 +1,17 @@
Tasks:
-Find the latest version of OQGraph to base this on (there should be a
-Launchpad branch somewhere, match it up with what is in the OQGraph patch for
-MySQL 5.0 in the ourdelta stuff).
-
-Extract the correct version of Boost from the MySQL 5.0 ourdelta patch. This
-is a patched version of Boost fixing a bug that is supposedly fatal for
-OQGraph (details are not known at the time of writing).
+Base work on the Launchpad branch lp:~knielsen/maria/mariadb-5.1-oqgraph
-Document in OQGraph README the need for boost of a specific version, and point
-to where it can be obtained. Also include the patch for boost if the correct
-base version of boost to do this against can be determined.
+OQGraph requires Boost >= 1.40.0 (earlier versions have a bug that affects
+OQGraph).
-Install the patched boost in /usr/local/ on the build machines (release builds
-and selected Buildbot slaves).
+Document in OQGraph README the need for boost of a specific version, and point
+to where it can be obtained.
-Fix OQGraph plug.in to detect correct version of OQGraph that makes the build
-not break. Check which version in Ubuntu starts working (I think it was
-Jaunty), and require at least that version.
-
-Setup some repository or source tarball of the patched boost
-somewhere. Preferably a Launchpad branch or similar (if upstream project can
-be found).
+Install the patched boost in /usr/local/include/boost on the build machines
+(release builds and selected Buildbot slaves). G++ seems to by default look in
+/usr/local/include, so that is sufficient to find it.
-Setup in plug.in or /configure.in appropriate --with-boost=xxx. Or in a pinch,
-we can make do with CFLAGS=-Ixxx, or even default look in /usr/local/.
+Fix OQGraph plug.in to detect boost version >= 1.40.0, and only enable OQGraph
+if such boost is found.
-=-=(Knielsen - Mon, 29 Mar 2010, 18:09)=-=-
High-Level Specification modified.
--- /tmp/wklog.112.old.23061 2010-03-29 18:09:27.000000000 +0000
+++ /tmp/wklog.112.new.23061 2010-03-29 18:09:27.000000000 +0000
@@ -1 +1,28 @@
+Tasks:
+
+Find the latest version of OQGraph to base this on (there should be a
+Launchpad branch somewhere, match it up with what is in the OQGraph patch for
+MySQL 5.0 in the ourdelta stuff).
+
+Extract the correct version of Boost from the MySQL 5.0 ourdelta patch. This
+is a patched version of Boost fixing a bug that is supposedly fatal for
+OQGraph (details are not known at the time of writing).
+
+Document in OQGraph README the need for boost of a specific version, and point
+to where it can be obtained. Also include the patch for boost if the correct
+base version of boost to do this against can be determined.
+
+Install the patched boost in /usr/local/ on the build machines (release builds
+and selected Buildbot slaves).
+
+Fix OQGraph plug.in to detect correct version of OQGraph that makes the build
+not break. Check which version in Ubuntu starts working (I think it was
+Jaunty), and require at least that version.
+
+Setup some repository or source tarball of the patched boost
+somewhere. Preferably a Launchpad branch or similar (if upstream project can
+be found).
+
+Setup in plug.in or /configure.in appropriate --with-boost=xxx. Or in a pinch,
+we can make do with CFLAGS=-Ixxx, or even default look in /usr/local/.
DESCRIPTION:
Get the OQGraph storage engine merged into MariaDB, fixing the remaining
problems blocking the merge.
HIGH-LEVEL SPECIFICATION:
Tasks:
Base work on the Launchpad branch lp:~knielsen/maria/mariadb-5.1-oqgraph
OQGraph requires Boost >= 1.40.0 (earlier versions have a bug that affects
OQGraph).
Document in OQGraph README the need for boost of a specific version, and point
to where it can be obtained.
Install the patched boost in /usr/local/include/boost on the build machines
(release builds and selected Buildbot slaves). G++ seems to by default look in
/usr/local/include, so that is sufficient to find it.
Fix OQGraph plug.in to detect boost version >= 1.40.0, and only enable OQGraph
if such boost is found.
Update the packaging in ourdelta/bakery to include the oqgraph_engine.so and
link with g++ rather than gcc.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Connect by
CREATION DATE..: Thu, 26 Mar 2009, 00:30
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-BackLog
TASK ID........: 11 (http://askmonty.org/worklog/?tid=11)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 220 (hours remain)
ORIG. ESTIMATE.: 220
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:03)=-=-
Category updated.
--- /tmp/wklog.11.old.32063 2010-06-29 14:03:35.000000000 +0000
+++ /tmp/wklog.11.new.32063 2010-06-29 14:03:35.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-BackLog
-=-=(Guest - Tue, 19 May 2009, 18:27)=-=-
High Level Description modified.
--- /tmp/wklog.11.old.21953 2009-05-19 18:27:14.000000000 +0300
+++ /tmp/wklog.11.new.21953 2009-05-19 18:27:14.000000000 +0300
@@ -1 +1,360 @@
-Add CONNECT BY syntax
+<contents>
+1. Background information
+2. CONNECT BY semantics, properties and limitations
+2.1 Additional CONNECT BY features
+2.2 Limitations
+3. Our implementation
+3.1 Scope Questions
+3.2 CONNECT BY execution
+3.2.1 Straightforward (recursive) evaluation algorithm
+3.2.2 Transitive-closure evaluation algorithms
+3.2.3 Other algorithms
+3.2.4 Loop detection
+3.2.4.1 The upper bound of produced records
+3.2.4.1 Straightforward approach: track chains
+3.2.3 Improvements for straightforward execution strategy
+3.3. Optimization
+4. Use-cases dump
+</contents>
+
+1. Background information
+-------------------------
+* CONNECT BY is a non-standard, Oracle's syntax. It is also supported by
+ EnterpriseDB (Q: any other implementations?)
+
+* PostgreSQL 8.4 (now beta) has support for SQL-standard compliant WITH
+ RECURSIVE (aka Common Table Expressions, CTE) query syntax:
+ http://www.postgresql.org/docs/8.4/static/queries-with.html
+ http://www.postgresql.org/about/news.1074
+ http://archives.postgresql.org/pgsql-hackers/2008-02/msg00642.php
+ http://archives.postgresql.org/pgsql-patches/2008-05/msg00362.php
+
+* Evgen's attempt:
+ http://lists.mysql.com/internals/15569
+
+DB2 and MS SQL support SQL standard's WITH RECURSIVE clause.
+
+2. CONNECT BY semantics, properties and limitations
+---------------------------------------------------
+From Oracle's manual:
+
+<almost-quote>
+
+ SELECT ...
+ FROM ...
+ WHERE ...
+ START WITH cond
+ CONNECT BY connect_cond
+ ORDER [SIBLINGS] BY
+
+In oracle, one expression in connect_cond must be
+
+ PRIOR expr = expr
+
+ or
+
+ expr = PRIOR expr
+
+The manner in which Oracle processes a WHERE clause (if any) in a hierarchical
+query depends on whether the WHERE clause contains a join:
+
+ * If the WHERE predicate contains a join, Oracle applies the join predicates
+ before doing the CONNECT BY processing.
+ * If the WHERE clause does not contain a join, Oracle applies all predicates
+ other than the CONNECT BY predicates after doing the CONNECT BY processing
+ without affecting the other rows of the hierarchy.
+</almost-quote>
+
+See http://www.adp-gmbh.ch/ora/sql/connect_by.html
+http://download-uk.oracle.com/docs/cd/B10501_01/server.920/a96540/queries4a.htm
+
+
+2.1 Additional CONNECT BY features
+----------------------------------
+
+LEVEL pseudocolumn
+ indicates ancestry depth of the record (inital row has level=1, its children
+ have level=2 and so forth). Can be used in CONNECT BY clause to limit
+ traversal depth.
+
+SYS_CONNECT_BY_PATH(column, 'char')
+ returns path from root to the node.
+
+NOCYCLE and CONNECT_BY_ISCYCLE
+ "With the 10g keyword NOCYCLE, hierarchical queries detect loops and do not
+ generate errors. CONNECT_BY_ISCYCLE pseudo-column is a flag that can be used
+ to detect which row is cycling"
+ http://www.dba-oracle.com/t_advanced_sql_connect_by_loop.htm
+
+ORDER SIBLINGS BY
+ CONNECT BY produces records in "children follow parents" order, with order
+ of the siblings unspecified. ORDER SIBLINGS BY orders siblings within each
+ "generation".
+
+2.2 Limitations
+---------------
+Other limitations (which we might or might not want to replicate)
+
+* There is this error:
+ ORA-01437: cannot have join with CONNECT BY
+ Cause: A join operation was specified with a CONNECT BY clause. If a
+ CONNECT BY clause is used in a SELECT statement for a tree-
+ structured query, only one table may be referenced in the query.
+ Action: Remove either the CONNECT BY clause or the join operation from
+ the SQL statement.
+ It seems oracle had this limitation before version 10G
+
+* LEVEL cannot be used on the left side of IN-comparison if the right side is a
+ subquery
+http://download.oracle.com/docs/cd/B10501_01/server.920/a96540/sql_elements6a.htm#9547
+ This seems to have been lifted in version 10?
+
+3. Our implementation
+---------------------
+
+3.1 Scope Questions
+-------------------
+* Are we sure we want CONNECT BY syntax and not SQL standard' one? (I'm not
+ suggesting one or the other, just want to make sure we've made a conscious
+ decision)
+
+* Any use-cases we need to make sure to handle well?
+
+Will we implement any of these features:
+
+* Output is ordered (children follow parents)
+* "ORDER SIBLINGS BY" variant of ORDER BY
+* NOCYCLE/CONNECT_BY_ISCYCLE
+ - It seems any checking for cycles will cause overhead. Do we implement a
+ mode for those who know what they are doing, where the server doesn't
+ actually check cycles but only reports error if it happened to enumerate,
+ say MAX(1M, #records_in_table * 10) records? (This doesn't guarantee that
+ there are no cycles, but this is just beyond what one could logically want)
+
+* Oracle's treatment of WHERE (if there's a join - the WHERE is applied after
+ connect by, otherwise before) [Yes]
+* Can one use SYS_CONNECT_BY_PATH in the CONNECT BY expression?
+
+
+3.2 CONNECT BY execution
+------------------------
+
+3.2.1 Straightforward (recursive) evaluation algorithm
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+As specified in CONNECT BY definition, breadth-first, parent-to-children
+traversal:
+
+ start with a scan that retrieves records using the START WITH condition;
+ pass rows to ouptut and also record them (i.e. needed columns) in
+ some sort of growable, overflow-to-disk buffer in_buf;
+
+ while(in_buf is not empty)
+ {
+ for each record in the buffer
+ {
+ do a scan based on CONNECT BY condition;
+ pass rows to output and also record them (i.e. needed columns) in
+ a growable, overflow-to-disk buffer out_buf;
+ }
+ in_buf= out_buf;
+ }
+
+This algorithm will produce rows in the required order.
+
+3.2.2 Transitive-closure evaluation algorithms
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+When CONNECT BY clause refers only to current and PRIOR records (and doesn't
+refer to connect path using LEVEL or SYS_CONNECT_BY_PATH functions), then
+evaluation of CONNECT BY operation is equivalent to building a transitive
+closure of a certain relation.
+
+TODO: can we use LEVEL/SYS_CONNECT_BY_PATH in select list with these
+ algorithms? looks like no?
+
+There are special algorithms to build transitive closure of relation that is
+represented as a table of edges, e.g. Blocked Warshall Algorithm.
+
+Q: Do we investigate further in this direction?
+
+3.2.3 Other algorithms
+----------------------
+To be resolved: Do we always start from the first clause and go to children?
+Does it make sense to proceed in other direction, from children to parents?
+Looks like no? TODO need definite answer.
+
+3.2.4 Loop detection
+~~~~~~~~~~~~~~~~~~~~
+Transitive-closure algorithms can detect loops (it seems some of them can also
+handle loop avoidance but that needs to be verified).
+
+Straightforward-evaluation algorithm will work forever if there is a loop,
+hence will need assistance in loop detection/avoidance.
+
+3.2.4.1 The upper bound of produced records
+-------------------------------------------
+There is an upper bound of the amount of records CONNECT BY runtime can
+generate without generating a loop.
+
+The worst case is when
+ * every record in a source table was in the parent generation (and thus has
+ started a parent->child->child->... chain)
+ * every chain is of #table-records length.
+
+example of such case:
+
+ SELECT * FROM employees
+ START WITH true
+ CONNECT BY
+ PRIOR emp_id = (emp_id + 1) MOD $n_employees AND
+ length(SYS_CONNECT_BY_PATH('-')) = $n_employees -- guard againist
+ -- forming loops
+
+this gives that we can at most generate O(#table_records^2) records. This
+limitation can be used as a primitive way to stop evaluation.
+
+
+3.2.4.1 Straightforward approach: track chains
+----------------------------------------------
+In general case, we will have to track which records we have seen across each
+of the parent-child chains. The same record can show up in different chains
+at different times and this won't form a loop:
+
+ parent generation1 generation2
+
+ row1- --+---row2---- ---row3-- (chain1)
+ |
+ \--row3-+-- ---row2-- (chain2)
+ |
+ \- ---row4-- (chain3)
+ row4- ...
+
+Tracking can be done by
+- Numbering the chains and using one structure (e.g temptable) to store
+ (rowid, chain#) pairs and check them for uniqueness.
+
+- Using per-chain data structure which we could serialize/deserialize. This
+ could be
+ - serializable hashtable
+ - ordered rowid list
+ - serializable sparse bitmap
+
+One can expect a lot of chains to have common starts (eg. look at chain2 and
+chain3). I don't see how one could take advantage of that, though.
+
+3.2.3 Improvements for straightforward execution strategy
+---------------------------------------------------------
+
+* If the query is a join, it may make sense to materialize it join result
+ (including creation of appropriate index) so we're able to make
+ parent-to-child transitions faster.
+ This seems to be connected to Evgen's work on FROM subqueries.
+
+* If there is a suitable index, we can employ a variant of BatchedKeyAccess.
+
+* Part of CONNECT BY expression that places restrictions on subsequent
+ generation can be moved to the WHERE. If we do that, we get two recordsets:
+
+1. Initial START WITH recordset
+
+2. A recordset to be used to advance to subsequent generation
+
+
+3.3. Optimization
+-----------------
+It seems it is nearly impossible to estimate how many iterations we'll have
+to make and how many records we will end up producing.
+
+TODO: some bad estimates.. assume a fixed number of generations, reuse ref
+accces estimations for fanount, which gives
+
+ access_method_estimate ^ number_of_generations
+
+estimate?
+
+4. Use-cases dump
+=================
+
+http://www.orafaq.com/usenet/comp.databases.oracle.server/2007/01/05/0264.htm:
+ select mak_xx,nr_porz,level lvl from spacer_strona
+ where nvl(dervlvl,0)<3
+ start with mak_xx=125414 and nr_porz=0
+ connect by mak_xx = prior derv_mak_xx and nr_porz = prior derv_nr_porz
+ and prior dervlvl=3
+
+
+http://www.orafaq.com/usenet/comp.databases.oracle.server/2007/01/04/0196.htm:
+ SELECT OPM_N_ID FROM RAFC_ADM.AFC_T_OPERATION_METIER START WITH OPM_N_ID IN
+ (
+ SELECT OPM_N_ID FROM RAFC_ADM.AFC_T_OPERATION_METIER x
+ START WITH x.OPM_N_ID IN (4846)
+ CONNECT BY ((PRIOR x.OPM_MERE_OPM_N_ID = x.OPM_N_ID)
+ OR (PRIOR x.OPM_ANNULEE_OPM_N_ID = x.OPM_N_ID))
+ )
+ CONNECT BY ((PRIOR OPM_N_ID = OPM_MERE_OPM_N_ID) OR (PRIOR OPM_N_ID =
+OPM_ANNULEE_OPM_N_ID))
+
+http://forums.enterprisedb.com/posts/list/737.page:
+ select lpad(' ',2*(level-1)) || to_char(child) s
+ from x
+ start with parent is null
+ connect by prior child = parent;
+
+ select *
+ from emp, dept
+ where dept.deptno = emp.deptno
+ start with mgr is null
+ connect by mgr = prior empno
+
+http://forums.oracle.com/forums/thread.jspa?threadID=623173:
+ SELECT cust_number
+ FROM customer
+ START WITH cust_number = '5568677999'
+ CONNECT BY PRIOR cust_number = cust_group_code.
+
+http://www.orafaq.com/forum/t/118879/0/
+ SELECT COUNT(a.dataid), c.name
+ FROM dauditnew a, dtree b, kuaf c
+ WHERE a.auditdate > SYSDATE-10 AND a.auditstr IN ('Create', 'AddVersion')
+ AND a.dataid = b.dataid AND c.id = a.performerid
+ AND a.SUBTYPE = 0
+ START WITH b.dataid = 6132086 CONNECT BY PRIOR a.dataid = b.parentid GROUP BY
+c.name
+
+
+http://www.postgresql-support.de/blog/blog_hans.html
+ SELECT METIER_ID||'|'||ORGANISATION_ID AS JOBORG
+ FROM INTRA_METIER,INTRA_ORGANISATION
+ WHERE METIER_ID IN(
+ SELECT METIER_ID
+ FROM INTRA_METIER
+ START WITH METIER_ID= '99533220-e8b2-4121-998c-808ea8ca2da7'
+ CONNECT BY METIER_ID= PRIOR PARENT_METIER_ID
+ ) AND ORGANISATION_ID IN (
+ SELECT ORGANISATION_ID
+ FROM INTRA_ORGANISATION
+ START WITH ORGANISATION_ID='025ee58f-35a3-4183-8679-01472838f753'
+ CONNECT BY ORGANISATION_ID= PRIOR PARENT_ORGANISATION_ID
+ );
+
+http://oracle.com
+ Oracle database uses CONNECT BY to generate EXPLAINs.
+
+http://practical-sql-tuning.blogspot.com/2009/01/use-of-statistically-incorrect.html
+
+ select sum(human_cnt) from facts
+ where territory_id in (select territory_id
+ from dic$territory
+ start with territory_code = :code
+ connect by prior territory_id = territory_parent);
+
+http://www.dbasupport.com/forums/archive/index.php/t-30008.html
+
+
+ SELECT LEVEL,LPAD(' ',8*(LEVEL-1))||T_COM_OBJ.OBJ_NAME, T_COM_OBJ.OBJ_PARENT,
+T_COM_OBJ.OBJ_ID
+ FROM VDR.T_COM_OBJ
+ START WITH T_COM_OBJ.OBJ_ID in (select obj_id obj_main from vdr.t_com_obj
+where obj_id=obj_parent)
+ CONNECT BY PRIOR T_COM_OBJ.OBJ_ID = T_COM_OBJ.OBJ_PARENT
+
+
DESCRIPTION:
<contents>
1. Background information
2. CONNECT BY semantics, properties and limitations
2.1 Additional CONNECT BY features
2.2 Limitations
3. Our implementation
3.1 Scope Questions
3.2 CONNECT BY execution
3.2.1 Straightforward (recursive) evaluation algorithm
3.2.2 Transitive-closure evaluation algorithms
3.2.3 Other algorithms
3.2.4 Loop detection
3.2.4.1 The upper bound of produced records
3.2.4.1 Straightforward approach: track chains
3.2.3 Improvements for straightforward execution strategy
3.3. Optimization
4. Use-cases dump
</contents>
1. Background information
-------------------------
* CONNECT BY is a non-standard, Oracle's syntax. It is also supported by
EnterpriseDB (Q: any other implementations?)
* PostgreSQL 8.4 (now beta) has support for SQL-standard compliant WITH
RECURSIVE (aka Common Table Expressions, CTE) query syntax:
http://www.postgresql.org/docs/8.4/static/queries-with.html
http://www.postgresql.org/about/news.1074
http://archives.postgresql.org/pgsql-hackers/2008-02/msg00642.php
http://archives.postgresql.org/pgsql-patches/2008-05/msg00362.php
* Evgen's attempt:
http://lists.mysql.com/internals/15569
DB2 and MS SQL support SQL standard's WITH RECURSIVE clause.
2. CONNECT BY semantics, properties and limitations
---------------------------------------------------
>From Oracle's manual:
<almost-quote>
SELECT ...
FROM ...
WHERE ...
START WITH cond
CONNECT BY connect_cond
ORDER [SIBLINGS] BY
In oracle, one expression in connect_cond must be
PRIOR expr = expr
or
expr = PRIOR expr
The manner in which Oracle processes a WHERE clause (if any) in a hierarchical
query depends on whether the WHERE clause contains a join:
* If the WHERE predicate contains a join, Oracle applies the join predicates
before doing the CONNECT BY processing.
* If the WHERE clause does not contain a join, Oracle applies all predicates
other than the CONNECT BY predicates after doing the CONNECT BY processing
without affecting the other rows of the hierarchy.
</almost-quote>
See http://www.adp-gmbh.ch/ora/sql/connect_by.html
http://download-uk.oracle.com/docs/cd/B10501_01/server.920/a96540/queries4a…
2.1 Additional CONNECT BY features
----------------------------------
LEVEL pseudocolumn
indicates ancestry depth of the record (inital row has level=1, its children
have level=2 and so forth). Can be used in CONNECT BY clause to limit
traversal depth.
SYS_CONNECT_BY_PATH(column, 'char')
returns path from root to the node.
NOCYCLE and CONNECT_BY_ISCYCLE
"With the 10g keyword NOCYCLE, hierarchical queries detect loops and do not
generate errors. CONNECT_BY_ISCYCLE pseudo-column is a flag that can be used
to detect which row is cycling"
http://www.dba-oracle.com/t_advanced_sql_connect_by_loop.htm
ORDER SIBLINGS BY
CONNECT BY produces records in "children follow parents" order, with order
of the siblings unspecified. ORDER SIBLINGS BY orders siblings within each
"generation".
2.2 Limitations
---------------
Other limitations (which we might or might not want to replicate)
* There is this error:
ORA-01437: cannot have join with CONNECT BY
Cause: A join operation was specified with a CONNECT BY clause. If a
CONNECT BY clause is used in a SELECT statement for a tree-
structured query, only one table may be referenced in the query.
Action: Remove either the CONNECT BY clause or the join operation from
the SQL statement.
It seems oracle had this limitation before version 10G
* LEVEL cannot be used on the left side of IN-comparison if the right side is a
subquery
http://download.oracle.com/docs/cd/B10501_01/server.920/a96540/sql_elements…
This seems to have been lifted in version 10?
3. Our implementation
---------------------
3.1 Scope Questions
-------------------
* Are we sure we want CONNECT BY syntax and not SQL standard' one? (I'm not
suggesting one or the other, just want to make sure we've made a conscious
decision)
* Any use-cases we need to make sure to handle well?
Will we implement any of these features:
* Output is ordered (children follow parents)
* "ORDER SIBLINGS BY" variant of ORDER BY
* NOCYCLE/CONNECT_BY_ISCYCLE
- It seems any checking for cycles will cause overhead. Do we implement a
mode for those who know what they are doing, where the server doesn't
actually check cycles but only reports error if it happened to enumerate,
say MAX(1M, #records_in_table * 10) records? (This doesn't guarantee that
there are no cycles, but this is just beyond what one could logically want)
* Oracle's treatment of WHERE (if there's a join - the WHERE is applied after
connect by, otherwise before) [Yes]
* Can one use SYS_CONNECT_BY_PATH in the CONNECT BY expression?
3.2 CONNECT BY execution
------------------------
3.2.1 Straightforward (recursive) evaluation algorithm
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As specified in CONNECT BY definition, breadth-first, parent-to-children
traversal:
start with a scan that retrieves records using the START WITH condition;
pass rows to ouptut and also record them (i.e. needed columns) in
some sort of growable, overflow-to-disk buffer in_buf;
while(in_buf is not empty)
{
for each record in the buffer
{
do a scan based on CONNECT BY condition;
pass rows to output and also record them (i.e. needed columns) in
a growable, overflow-to-disk buffer out_buf;
}
in_buf= out_buf;
}
This algorithm will produce rows in the required order.
3.2.2 Transitive-closure evaluation algorithms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When CONNECT BY clause refers only to current and PRIOR records (and doesn't
refer to connect path using LEVEL or SYS_CONNECT_BY_PATH functions), then
evaluation of CONNECT BY operation is equivalent to building a transitive
closure of a certain relation.
TODO: can we use LEVEL/SYS_CONNECT_BY_PATH in select list with these
algorithms? looks like no?
There are special algorithms to build transitive closure of relation that is
represented as a table of edges, e.g. Blocked Warshall Algorithm.
Q: Do we investigate further in this direction?
3.2.3 Other algorithms
----------------------
To be resolved: Do we always start from the first clause and go to children?
Does it make sense to proceed in other direction, from children to parents?
Looks like no? TODO need definite answer.
3.2.4 Loop detection
~~~~~~~~~~~~~~~~~~~~
Transitive-closure algorithms can detect loops (it seems some of them can also
handle loop avoidance but that needs to be verified).
Straightforward-evaluation algorithm will work forever if there is a loop,
hence will need assistance in loop detection/avoidance.
3.2.4.1 The upper bound of produced records
-------------------------------------------
There is an upper bound of the amount of records CONNECT BY runtime can
generate without generating a loop.
The worst case is when
* every record in a source table was in the parent generation (and thus has
started a parent->child->child->... chain)
* every chain is of #table-records length.
example of such case:
SELECT * FROM employees
START WITH true
CONNECT BY
PRIOR emp_id = (emp_id + 1) MOD $n_employees AND
length(SYS_CONNECT_BY_PATH('-')) = $n_employees -- guard againist
-- forming loops
this gives that we can at most generate O(#table_records^2) records. This
limitation can be used as a primitive way to stop evaluation.
3.2.4.1 Straightforward approach: track chains
----------------------------------------------
In general case, we will have to track which records we have seen across each
of the parent-child chains. The same record can show up in different chains
at different times and this won't form a loop:
parent generation1 generation2
row1- --+---row2---- ---row3-- (chain1)
|
\--row3-+-- ---row2-- (chain2)
|
\- ---row4-- (chain3)
row4- ...
Tracking can be done by
- Numbering the chains and using one structure (e.g temptable) to store
(rowid, chain#) pairs and check them for uniqueness.
- Using per-chain data structure which we could serialize/deserialize. This
could be
- serializable hashtable
- ordered rowid list
- serializable sparse bitmap
One can expect a lot of chains to have common starts (eg. look at chain2 and
chain3). I don't see how one could take advantage of that, though.
3.2.3 Improvements for straightforward execution strategy
---------------------------------------------------------
* If the query is a join, it may make sense to materialize it join result
(including creation of appropriate index) so we're able to make
parent-to-child transitions faster.
This seems to be connected to Evgen's work on FROM subqueries.
* If there is a suitable index, we can employ a variant of BatchedKeyAccess.
* Part of CONNECT BY expression that places restrictions on subsequent
generation can be moved to the WHERE. If we do that, we get two recordsets:
1. Initial START WITH recordset
2. A recordset to be used to advance to subsequent generation
3.3. Optimization
-----------------
It seems it is nearly impossible to estimate how many iterations we'll have
to make and how many records we will end up producing.
TODO: some bad estimates.. assume a fixed number of generations, reuse ref
accces estimations for fanount, which gives
access_method_estimate ^ number_of_generations
estimate?
4. Use-cases dump
=================
http://www.orafaq.com/usenet/comp.databases.oracle.server/2007/01/05/0264.h…:
select mak_xx,nr_porz,level lvl from spacer_strona
where nvl(dervlvl,0)<3
start with mak_xx=125414 and nr_porz=0
connect by mak_xx = prior derv_mak_xx and nr_porz = prior derv_nr_porz
and prior dervlvl=3
http://www.orafaq.com/usenet/comp.databases.oracle.server/2007/01/04/0196.h…:
SELECT OPM_N_ID FROM RAFC_ADM.AFC_T_OPERATION_METIER START WITH OPM_N_ID IN
(
SELECT OPM_N_ID FROM RAFC_ADM.AFC_T_OPERATION_METIER x
START WITH x.OPM_N_ID IN (4846)
CONNECT BY ((PRIOR x.OPM_MERE_OPM_N_ID = x.OPM_N_ID)
OR (PRIOR x.OPM_ANNULEE_OPM_N_ID = x.OPM_N_ID))
)
CONNECT BY ((PRIOR OPM_N_ID = OPM_MERE_OPM_N_ID) OR (PRIOR OPM_N_ID =
OPM_ANNULEE_OPM_N_ID))
http://forums.enterprisedb.com/posts/list/737.page:
select lpad(' ',2*(level-1)) || to_char(child) s
from x
start with parent is null
connect by prior child = parent;
select *
from emp, dept
where dept.deptno = emp.deptno
start with mgr is null
connect by mgr = prior empno
http://forums.oracle.com/forums/thread.jspa?threadID=623173:
SELECT cust_number
FROM customer
START WITH cust_number = '5568677999'
CONNECT BY PRIOR cust_number = cust_group_code.
http://www.orafaq.com/forum/t/118879/0/
SELECT COUNT(a.dataid), c.name
FROM dauditnew a, dtree b, kuaf c
WHERE a.auditdate > SYSDATE-10 AND a.auditstr IN ('Create', 'AddVersion')
AND a.dataid = b.dataid AND c.id = a.performerid
AND a.SUBTYPE = 0
START WITH b.dataid = 6132086 CONNECT BY PRIOR a.dataid = b.parentid GROUP BY
c.name
http://www.postgresql-support.de/blog/blog_hans.html
SELECT METIER_ID||'|'||ORGANISATION_ID AS JOBORG
FROM INTRA_METIER,INTRA_ORGANISATION
WHERE METIER_ID IN(
SELECT METIER_ID
FROM INTRA_METIER
START WITH METIER_ID= '99533220-e8b2-4121-998c-808ea8ca2da7'
CONNECT BY METIER_ID= PRIOR PARENT_METIER_ID
) AND ORGANISATION_ID IN (
SELECT ORGANISATION_ID
FROM INTRA_ORGANISATION
START WITH ORGANISATION_ID='025ee58f-35a3-4183-8679-01472838f753'
CONNECT BY ORGANISATION_ID= PRIOR PARENT_ORGANISATION_ID
);
http://oracle.com
Oracle database uses CONNECT BY to generate EXPLAINs.
http://practical-sql-tuning.blogspot.com/2009/01/use-of-statistically-incor…
select sum(human_cnt) from facts
where territory_id in (select territory_id
from dic$territory
start with territory_code = :code
connect by prior territory_id = territory_parent);
http://www.dbasupport.com/forums/archive/index.php/t-30008.html
SELECT LEVEL,LPAD(' ',8*(LEVEL-1))||T_COM_OBJ.OBJ_NAME, T_COM_OBJ.OBJ_PARENT,
T_COM_OBJ.OBJ_ID
FROM VDR.T_COM_OBJ
START WITH T_COM_OBJ.OBJ_ID in (select obj_id obj_main from vdr.t_com_obj
where obj_id=obj_parent)
CONNECT BY PRIOR T_COM_OBJ.OBJ_ID = T_COM_OBJ.OBJ_PARENT
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#10 Updated (by Sergei): Microseconds
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Microseconds
CREATION DATE..: Thu, 26 Mar 2009, 00:29
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-BackLog
TASK ID........: 10 (http://askmonty.org/worklog/?tid=10)
VERSION........: Server-5.3
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:03)=-=-
Category updated.
--- /tmp/wklog.10.old.32058 2010-06-29 14:03:11.000000000 +0000
+++ /tmp/wklog.10.new.32058 2010-06-29 14:03:11.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-BackLog
-=-=(Sergei - Tue, 29 Jun 2010, 14:03)=-=-
Category updated.
--- /tmp/wklog.10.old.31970 2010-06-29 14:03:01.000000000 +0000
+++ /tmp/wklog.10.new.31970 2010-06-29 14:03:01.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-RawIdeaBin
-=-=(Monty - Fri, 29 Jan 2010, 19:05)=-=-
Version updated.
--- /tmp/wklog.10.old.5698 2010-01-29 19:05:42.000000000 +0200
+++ /tmp/wklog.10.new.5698 2010-01-29 19:05:42.000000000 +0200
@@ -1 +1 @@
-Server-5.2
+Server-5.3
DESCRIPTION:
Add microsecond precision to NOW()
Add new field types for time and datetime with microprecision
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#10 Updated (by Sergei): Microseconds
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Microseconds
CREATION DATE..: Thu, 26 Mar 2009, 00:29
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 10 (http://askmonty.org/worklog/?tid=10)
VERSION........: Server-5.3
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:03)=-=-
Category updated.
--- /tmp/wklog.10.old.31970 2010-06-29 14:03:01.000000000 +0000
+++ /tmp/wklog.10.new.31970 2010-06-29 14:03:01.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-RawIdeaBin
-=-=(Monty - Fri, 29 Jan 2010, 19:05)=-=-
Version updated.
--- /tmp/wklog.10.old.5698 2010-01-29 19:05:42.000000000 +0200
+++ /tmp/wklog.10.new.5698 2010-01-29 19:05:42.000000000 +0200
@@ -1 +1 @@
-Server-5.2
+Server-5.3
DESCRIPTION:
Add microsecond precision to NOW()
Add new field types for time and datetime with microprecision
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#24 Updated (by Sergei): index_merge: fair choice between index_merge union and range access
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: fair choice between index_merge union and range access
CREATION DATE..: Tue, 26 May 2009, 12:10
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 24 (http://askmonty.org/worklog/?tid=24)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:00)=-=-
Category updated.
--- /tmp/wklog.24.old.31772 2010-06-29 14:00:05.000000000 +0000
+++ /tmp/wklog.24.new.31772 2010-06-29 14:00:05.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-RawIdeaBin
-=-=(Guest - Sun, 16 Aug 2009, 02:13)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.23383 2009-08-16 02:13:54.000000000 +0300
+++ /tmp/wklog.24.new.23383 2009-08-16 02:13:54.000000000 +0300
@@ -125,7 +125,7 @@
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
-(Here an imerge for col2=c2 OR col3=c3 won't be built since neither col2=c2 nor
+(Here no imerge for col2=c2 OR col3=c3 will be built since neither col2=c2 nor
col3=c3 represent index ranges.)
@@ -199,7 +199,7 @@
O2. "Create index_merge accesses when possible"
Current tree_or() will not create index_merge access when it could create
- non-index merge access (see DISCARD-IMERGE-3 and its example in the "Problems
+ non-index merge access (see DISCARD-IMERGE-2 and its example in the "Problems
in the current implementation" section). This will be changed to work as
follows: we will create index_merge made for index scans that didn't have
their match in the other sel_tree.
-=-=(Guest - Sun, 16 Aug 2009, 01:03)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.20767 2009-08-16 01:03:11.000000000 +0300
+++ /tmp/wklog.24.new.20767 2009-08-16 01:03:11.000000000 +0300
@@ -18,6 +18,8 @@
# a range tree has range access options, possibly for several keys
range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
+ (here range(keyi) may represent ranges not for initial keyi prefixes,
+ but ranges for any infixes for keyi)
# merge tree represents several way to index_merge
imerge_tree = imerge1 AND imerge2 AND ...
@@ -47,13 +49,13 @@
R.add(range_union(A.range(i), B.range(i)));
if (R has at least one range access)
- return R;
+ return R; // DISCARD-IMERGE-2
else
{
/* could not build any range accesses. construct index_merge */
- remove non-ranges from A; // DISCARD-IMERGE-2
+ remove non-ranges from A;
remove non-ranges from B;
- return new index_merge(A, B);
+ return new index_merge(A, B); // DISCARD-IMERGE-3
}
}
else if (A is range tree and B is index_merge tree (or vice versa))
@@ -65,12 +67,12 @@
(range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
(range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
...
- (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN) AND
+ (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN)
=
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
(range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
...
- (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N)
Now each line represents an index_merge..
}
@@ -82,18 +84,18 @@
OR
imergeB1 AND imergeB2 AND ... AND imergeBN
- -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-3
+ -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-4
imergeA1
OR
- imergeB1 AND imergeB2 AND ... AND imergeBN =
+ imergeB1 =
- = (combine imergeA1 with each of the imergeB{i} ) =
+ = (combine imergeA1 with each of the range_treeB_1{i} ) =
- combine(imergeA1 OR imergeB1) AND
- combine(imergeA1 OR imergeB2) AND
+ combine(imergeA1 OR range_treeB_11) AND
+ combine(imergeA1 OR range_treeB_12) AND
... AND
- combine(imergeA1 OR imergeBN)
+ combine(imergeA1 OR range_treeB_1N)
}
}
@@ -109,7 +111,7 @@
DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
the WHERE clause has this form (conditions t.badkey may have abritrary form):
- (t.badkey<c1 AND t.key1=c1) OR (t.key1=c2 AND t.badkey < c2)
+ (t.badkey<c1 AND t.key1=c1) OR (t.key2=c2 AND t.badkey < c2)
DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
two indexes:
@@ -123,6 +125,8 @@
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
+(Here an imerge for col2=c2 OR col3=c3 won't be built since neither col2=c2 nor
+col3=c3 represent index ranges.)
2. New implementation
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 24
-=-=(Guest - Sat, 20 Jun 2009, 09:34)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.21663 2009-06-20 09:34:48.000000000 +0300
+++ /tmp/wklog.24.new.21663 2009-06-20 09:34:48.000000000 +0300
@@ -4,6 +4,7 @@
2. New implementation
2.1 New tree_and()
2.2 New tree_or()
+3. Testing and required coverage
</contents>
1. Current implementation overview
@@ -240,3 +241,14 @@
In order to limit the impact of this combinatorial explosion, we will
introduce a rule that we won't generate more than #defined
MAX_IMERGE_OPTS options.
+
+3. Testing and required coverage
+================================
+So far could find the following user cases:
+
+* BUG#17259: Query optimizer chooses wrong index
+* BUG#17673: Optimizer does not use Index Merge optimization in some cases
+* BUG#23322: Optimizer sometimes erroniously prefers other index over index merge
+* BUG#30151: optimizer is very reluctant to chose index_merge algorithm
+
+
-=-=(Guest - Thu, 18 Jun 2009, 16:55)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.19152 2009-06-18 16:55:00.000000000 +0300
+++ /tmp/wklog.24.new.19152 2009-06-18 16:55:00.000000000 +0300
@@ -141,13 +141,15 @@
Operations on SEL_ARG trees will be modified to produce/process the trees of
this kind:
+
2.1 New tree_and()
------------------
In order not to lose plans, we'll make these changes:
-1. Don't remove index_merge part of the tree.
+A1. Don't remove index_merge part of the tree (this will take care of
+ DISCARD-IMERGE-1 problem)
-2. Push range conditions down into index_merge trees that may support them.
+A2. Push range conditions down into index_merge trees that may support them.
if one tree has range(key1) and the other tree has imerge(key1 OR key2)
then perform an equvalent of this operation:
@@ -155,8 +157,86 @@
(rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
-3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
+A3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
concatenate them together.
-2.2 New tree_or()
+2.2 New tree_or()
+-----------------
+O1. Dont remove non-range plans:
+ Current tree_or() code will refuse to produce index_merge plans for
+ conditions like
+
+ "t.key1part2=const OR t.key2part1=const"
+
+ (this is marked as DISCARD-IMERGE-3). This was justifed as the left part of
+ the AND condition is not usable for range access, and the operation of
+ tree_and() guaranteed that there was no way it could changed to make a
+ usable range plan. With new tree_and() and rule A2, this is no longer the
+ case. For example for this query:
+
+ (t.key1part2=const OR t.key2part1=const) AND t.key1part1=const
+
+ it will construct a
+
+ imerge(t.key1part2=const OR t.key2part1=const), range(t.key1part1=const)
+
+ then tree_and() will apply rule A2 to push the range down into index merge
+ and after that we'll have:
+
+ range(t.key1part1=const)
+ imerge(
+ t.key1part2=const AND t.key1part1=const,
+ t.key2part1=const
+ )
+ note that imerge(...) describes a usable index_merge plan and it's possible
+ that it will be the best access path.
+
+O2. "Create index_merge accesses when possible"
+ Current tree_or() will not create index_merge access when it could create
+ non-index merge access (see DISCARD-IMERGE-3 and its example in the "Problems
+ in the current implementation" section). This will be changed to work as
+ follows: we will create index_merge made for index scans that didn't have
+ their match in the other sel_tree.
+ Ilustrating it with an example:
+
+ | sel_tree_A | sel_tree_B | A or B | include in index_merge?
+ ------+------------+------------+--------+------------------------
+ key1 | cond1 | cond2 | condM | no
+ key2 | cond3 | cond4 | NULL | no
+ key3 | cond5 | | | yes, A-side
+ key4 | cond6 | | | yes, A-side
+ key5 | | cond7 | | yes, B-side
+ key6 | | cond8 | | yes, B-side
+
+ here we assume that
+ - (cond1 OR cond2) did produce a combined range. Not including them in
+ index_merge.
+ - (cond3 OR cond4) didn't produce a usable range (e.g. they were
+ t.key1part1=c1 AND t.key1part2=c1, respectively, and combining them
+ didn't yield any range list)
+ - All other scand didn't have their counterparts, so we'll end up with a
+ SEL_TREE of:
+
+ range(condM) AND index_merge((cond5 AND cond6),(cond7 AND cond8))
+ .
+
+O4. There is no O4. DISCARD-INDEX-MERGE-4 will remain there. The idea is
+that although DISCARD-INDEX-MERGE-4 does discard plans, so far we haven
+seen any complaints that could be attributed to it.
+If we face the need to lift DISCARD-INDEX-MERGE-4, our answer will be to
+lift it ,and produce a cross-product:
+
+ ((key1p OR key2p) AND (key3p OR key4p))
+ OR
+ ((key5p OR key6p) AND (key7p OR key8p))
+
+ = (key1p OR key2p OR key5p OR key6p) AND // this part is currently
+ (key3p OR key4p OR key5p OR key6p) AND // produced
+
+ (key1p OR key2p OR key5p OR key6p) AND // this part will be added
+ (key3p OR key4p OR key5p OR key6p) //.
+
+In order to limit the impact of this combinatorial explosion, we will
+introduce a rule that we won't generate more than #defined
+MAX_IMERGE_OPTS options.
-=-=(Guest - Thu, 18 Jun 2009, 14:56)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.15612 2009-06-18 14:56:09.000000000 +0300
+++ /tmp/wklog.24.new.15612 2009-06-18 14:56:09.000000000 +0300
@@ -1 +1,162 @@
+<contents>
+1. Current implementation overview
+1.1. Problems in the current implementation
+2. New implementation
+2.1 New tree_and()
+2.2 New tree_or()
+</contents>
+
+1. Current implementation overview
+==================================
+At the moment, range analyzer works as follows:
+
+SEL_TREE structure represents
+
+ # There are sel_trees, a sel_tree is either range or merge tree
+ sel_tree = range_tree | imerge_tree
+
+ # a range tree has range access options, possibly for several keys
+ range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
+
+ # merge tree represents several way to index_merge
+ imerge_tree = imerge1 AND imerge2 AND ...
+
+ # a way to do index merge == a set to use of different indexes.
+ imergeX = range_tree1 OR range_tree2 OR ..
+ where no pair of range_treeX have ranges over the same index.
+
+
+ tree_and(A, B)
+ {
+ if (both A and B are range trees)
+ return a range_tree with computed intersection for each range;
+ if (only one of A and B is a range tree)
+ return that tree; // DISCARD-IMERGE-1
+ // at this point both trees are index_merge trees
+ return concat_lists( A.imerge1 ... A.imergeN, B.imerge1 ... B.imergeN);
+ }
+
+
+ tree_or(A, B)
+ {
+ if (A and B are range trees)
+ {
+ R = new range_tree;
+ for each index i
+ R.add(range_union(A.range(i), B.range(i)));
+
+ if (R has at least one range access)
+ return R;
+ else
+ {
+ /* could not build any range accesses. construct index_merge */
+ remove non-ranges from A; // DISCARD-IMERGE-2
+ remove non-ranges from B;
+ return new index_merge(A, B);
+ }
+ }
+ else if (A is range tree and B is index_merge tree (or vice versa))
+ {
+ Perform this transformation:
+
+ range_treeA // this is A
+ OR
+ (range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
+ (range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
+ ...
+ (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN) AND
+ =
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+ (range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
+ ...
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+
+ Now each line represents an index_merge..
+ }
+ else if (both A and B are index_merge trees)
+ {
+ Perform this transformation:
+
+ imergeA1 AND imergeA2 AND ... AND imergeAN
+ OR
+ imergeB1 AND imergeB2 AND ... AND imergeBN
+
+ -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-3
+
+ imergeA1
+ OR
+ imergeB1 AND imergeB2 AND ... AND imergeBN =
+
+ = (combine imergeA1 with each of the imergeB{i} ) =
+
+ combine(imergeA1 OR imergeB1) AND
+ combine(imergeA1 OR imergeB2) AND
+ ... AND
+ combine(imergeA1 OR imergeBN)
+ }
+ }
+
+1.1. Problems in the current implementation
+-------------------------------------------
+As marked in the code above:
+
+DISCARD-IMERGE-1 step will cause index_merge option to be discarded when
+the WHERE clause has this form:
+
+ (t.key1=c1 OR t.key2=c2) AND t.badkey < c3
+
+DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
+the WHERE clause has this form (conditions t.badkey may have abritrary form):
+
+ (t.badkey<c1 AND t.key1=c1) OR (t.key1=c2 AND t.badkey < c2)
+
+DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
+two indexes:
+
+ INDEX i1(col1, col2),
+ INDEX i2(col1, col3)
+
+and this WHERE clause:
+
+ col1=c1 AND (col2=c2 OR col3=c3)
+
+The optimizer will generate the plans that only use the "col1=c1" part. The
+right side of the AND will be ignored even if it has good selectivity.
+
+
+2. New implementation
+=====================
+
+<general idea>
+* Don't start fighting combinatorial explosion until we've actually got one.
+</>
+
+SEL_TREE structure will be now able to hold both index_merge and range scan
+candidates at the same time. That is,
+
+ sel_tree2 = range_tree AND imerge_tree
+
+where both parts are optional (i.e. can be empty)
+
+Operations on SEL_ARG trees will be modified to produce/process the trees of
+this kind:
+
+2.1 New tree_and()
+------------------
+In order not to lose plans, we'll make these changes:
+
+1. Don't remove index_merge part of the tree.
+
+2. Push range conditions down into index_merge trees that may support them.
+ if one tree has range(key1) and the other tree has imerge(key1 OR key2)
+ then perform an equvalent of this operation:
+
+ rangeA(key1) AND ( rangeB(key1) OR rangeB(key2)) =
+
+ (rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
+
+3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
+ concatenate them together.
+
+2.2 New tree_or()
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 24
-=-=(Guest - Mon, 01 Jun 2009, 23:30)=-=-
High-Level Specification modified.
--- /tmp/wklog.24.old.21580 2009-06-01 23:30:06.000000000 +0300
+++ /tmp/wklog.24.new.21580 2009-06-01 23:30:06.000000000 +0300
@@ -64,6 +64,9 @@
* How strict is the limitation on the form of the WHERE?
+* Which version should this be based on? 5.1? Which patches are should be in
+ (google's/percona's/maria/etc?)
+
* TODO: The optimizer didn't compare costs of index_merge and range before (ok
it did but that was done for accesses to different tables). Will there be any
possible gotchas here?
-=-=(Guest - Wed, 27 May 2009, 13:59)=-=-
Title modified.
--- /tmp/wklog.24.old.9498 2009-05-27 13:59:23.000000000 +0300
+++ /tmp/wklog.24.new.9498 2009-05-27 13:59:23.000000000 +0300
@@ -1 +1 @@
-index_merge optimizer: dont discard index_merge union strategies when range is available
+index_merge: fair choice between index_merge union and range access
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=24&nolimit=1
DESCRIPTION:
Current range optimizer will discard possible index_merge/[sort]union
strategies when there is a possible range plan. This action is a part of
measures we take to avoid combinatorial explosion of possible range/
index_merge strategies.
A bad side effect of this is that for WHERE clauses in form
t.key1= 'very-frequent-value' AND (t.key2='rare-value1' OR t.key3='rare-value2')
the optimizer will
- discard union(key2,key3) in favor of range(key1)
- consider costs of using range(key1) and discard that plan also
and the overall effect is that possible poor range access will cause possible
good index_merge access not to be considered.
This WL is to about lifting this limitation at least for some subset of WHERE
clauses.
HIGH-LEVEL SPECIFICATION:
(Not a ready HLS but draft)
<contents>
Solution overview
Limitations
TODO
</contents>
Solution overview
=================
The idea is to delay discarding potential index_merge plans until the point
where it is really necessary.
This way, we won't have to do much changes in the range analyzer, but will be
able to keep potential index_merge plan just enough so that it's possible to
take it into consideration together with range access plans.
Since there are no changes in the optimizer, the ability to consider both
range and index_merge options will be limited to WHERE clauses of this form:
WHERE := range_cond(key1_1) AND
range_cond(key2_1) AND
other_cond AND
index_merge_OR_cond1(key3_1, key3_2, ...)
index_merge_OR_cond2(key4_1, key4_2, ...)
where
index_merge_OR_cond{N} := (range_cond(keyN_1) OR
range_cond(keyN_2) OR ...)
range_cond(keyX) := condition that allows to construct range access of keyX
and doesn't allow to construct range/index_merge accesses
for any keys of the table in question.
For such WHERE clauses, the range analyzer will produce SEL_TREE of this form:
SEL_TREE(
range(key1_1),
...
range(key2_1),
SEL_IMERGE( (1)
SEL_TREE(key3_1})
SEL_TREE(key3_2})
...
)
...
)
which can be used to make a cost-based choice between range and index_merge.
Limitations
-----------
This will not be a full solution in a sense that the range analyzer will not
be able to produce sel_tree (1) if the WHERE clause is specified in other form
(e.g. brackets were opened).
TODO
----
* is it a problem if there are keys that are referred to both from
index_merge and from range access?
* How strict is the limitation on the form of the WHERE?
* Which version should this be based on? 5.1? Which patches are should be in
(google's/percona's/maria/etc?)
* TODO: The optimizer didn't compare costs of index_merge and range before (ok
it did but that was done for accesses to different tables). Will there be any
possible gotchas here?
LOW-LEVEL DESIGN:
<contents>
1. Current implementation overview
1.1. Problems in the current implementation
2. New implementation
2.1 New tree_and()
2.2 New tree_or()
3. Testing and required coverage
</contents>
1. Current implementation overview
==================================
At the moment, range analyzer works as follows:
SEL_TREE structure represents
# There are sel_trees, a sel_tree is either range or merge tree
sel_tree = range_tree | imerge_tree
# a range tree has range access options, possibly for several keys
range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
(here range(keyi) may represent ranges not for initial keyi prefixes,
but ranges for any infixes for keyi)
# merge tree represents several way to index_merge
imerge_tree = imerge1 AND imerge2 AND ...
# a way to do index merge == a set to use of different indexes.
imergeX = range_tree1 OR range_tree2 OR ..
where no pair of range_treeX have ranges over the same index.
tree_and(A, B)
{
if (both A and B are range trees)
return a range_tree with computed intersection for each range;
if (only one of A and B is a range tree)
return that tree; // DISCARD-IMERGE-1
// at this point both trees are index_merge trees
return concat_lists( A.imerge1 ... A.imergeN, B.imerge1 ... B.imergeN);
}
tree_or(A, B)
{
if (A and B are range trees)
{
R = new range_tree;
for each index i
R.add(range_union(A.range(i), B.range(i)));
if (R has at least one range access)
return R; // DISCARD-IMERGE-2
else
{
/* could not build any range accesses. construct index_merge */
remove non-ranges from A;
remove non-ranges from B;
return new index_merge(A, B); // DISCARD-IMERGE-3
}
}
else if (A is range tree and B is index_merge tree (or vice versa))
{
Perform this transformation:
range_treeA // this is A
OR
(range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
(range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
...
(range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN)
=
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
(range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
...
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N)
Now each line represents an index_merge..
}
else if (both A and B are index_merge trees)
{
Perform this transformation:
imergeA1 AND imergeA2 AND ... AND imergeAN
OR
imergeB1 AND imergeB2 AND ... AND imergeBN
-> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-4
imergeA1
OR
imergeB1 =
= (combine imergeA1 with each of the range_treeB_1{i} ) =
combine(imergeA1 OR range_treeB_11) AND
combine(imergeA1 OR range_treeB_12) AND
... AND
combine(imergeA1 OR range_treeB_1N)
}
}
1.1. Problems in the current implementation
-------------------------------------------
As marked in the code above:
DISCARD-IMERGE-1 step will cause index_merge option to be discarded when
the WHERE clause has this form:
(t.key1=c1 OR t.key2=c2) AND t.badkey < c3
DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
the WHERE clause has this form (conditions t.badkey may have abritrary form):
(t.badkey<c1 AND t.key1=c1) OR (t.key2=c2 AND t.badkey < c2)
DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
two indexes:
INDEX i1(col1, col2),
INDEX i2(col1, col3)
and this WHERE clause:
col1=c1 AND (col2=c2 OR col3=c3)
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
(Here no imerge for col2=c2 OR col3=c3 will be built since neither col2=c2 nor
col3=c3 represent index ranges.)
2. New implementation
=====================
<general idea>
* Don't start fighting combinatorial explosion until we've actually got one.
</>
SEL_TREE structure will be now able to hold both index_merge and range scan
candidates at the same time. That is,
sel_tree2 = range_tree AND imerge_tree
where both parts are optional (i.e. can be empty)
Operations on SEL_ARG trees will be modified to produce/process the trees of
this kind:
2.1 New tree_and()
------------------
In order not to lose plans, we'll make these changes:
A1. Don't remove index_merge part of the tree (this will take care of
DISCARD-IMERGE-1 problem)
A2. Push range conditions down into index_merge trees that may support them.
if one tree has range(key1) and the other tree has imerge(key1 OR key2)
then perform an equvalent of this operation:
rangeA(key1) AND ( rangeB(key1) OR rangeB(key2)) =
(rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
A3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
concatenate them together.
2.2 New tree_or()
-----------------
O1. Dont remove non-range plans:
Current tree_or() code will refuse to produce index_merge plans for
conditions like
"t.key1part2=const OR t.key2part1=const"
(this is marked as DISCARD-IMERGE-3). This was justifed as the left part of
the AND condition is not usable for range access, and the operation of
tree_and() guaranteed that there was no way it could changed to make a
usable range plan. With new tree_and() and rule A2, this is no longer the
case. For example for this query:
(t.key1part2=const OR t.key2part1=const) AND t.key1part1=const
it will construct a
imerge(t.key1part2=const OR t.key2part1=const), range(t.key1part1=const)
then tree_and() will apply rule A2 to push the range down into index merge
and after that we'll have:
range(t.key1part1=const)
imerge(
t.key1part2=const AND t.key1part1=const,
t.key2part1=const
)
note that imerge(...) describes a usable index_merge plan and it's possible
that it will be the best access path.
O2. "Create index_merge accesses when possible"
Current tree_or() will not create index_merge access when it could create
non-index merge access (see DISCARD-IMERGE-2 and its example in the "Problems
in the current implementation" section). This will be changed to work as
follows: we will create index_merge made for index scans that didn't have
their match in the other sel_tree.
Ilustrating it with an example:
| sel_tree_A | sel_tree_B | A or B | include in index_merge?
------+------------+------------+--------+------------------------
key1 | cond1 | cond2 | condM | no
key2 | cond3 | cond4 | NULL | no
key3 | cond5 | | | yes, A-side
key4 | cond6 | | | yes, A-side
key5 | | cond7 | | yes, B-side
key6 | | cond8 | | yes, B-side
here we assume that
- (cond1 OR cond2) did produce a combined range. Not including them in
index_merge.
- (cond3 OR cond4) didn't produce a usable range (e.g. they were
t.key1part1=c1 AND t.key1part2=c1, respectively, and combining them
didn't yield any range list)
- All other scand didn't have their counterparts, so we'll end up with a
SEL_TREE of:
range(condM) AND index_merge((cond5 AND cond6),(cond7 AND cond8))
.
O4. There is no O4. DISCARD-INDEX-MERGE-4 will remain there. The idea is
that although DISCARD-INDEX-MERGE-4 does discard plans, so far we haven
seen any complaints that could be attributed to it.
If we face the need to lift DISCARD-INDEX-MERGE-4, our answer will be to
lift it ,and produce a cross-product:
((key1p OR key2p) AND (key3p OR key4p))
OR
((key5p OR key6p) AND (key7p OR key8p))
= (key1p OR key2p OR key5p OR key6p) AND // this part is currently
(key3p OR key4p OR key5p OR key6p) AND // produced
(key1p OR key2p OR key5p OR key6p) AND // this part will be added
(key3p OR key4p OR key5p OR key6p) //.
In order to limit the impact of this combinatorial explosion, we will
introduce a rule that we won't generate more than #defined
MAX_IMERGE_OPTS options.
3. Testing and required coverage
================================
So far could find the following user cases:
* BUG#17259: Query optimizer chooses wrong index
* BUG#17673: Optimizer does not use Index Merge optimization in some cases
* BUG#23322: Optimizer sometimes erroniously prefers other index over index merge
* BUG#30151: optimizer is very reluctant to chose index_merge algorithm
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#24 Updated (by Sergei): index_merge: fair choice between index_merge union and range access
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: fair choice between index_merge union and range access
CREATION DATE..: Tue, 26 May 2009, 12:10
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 24 (http://askmonty.org/worklog/?tid=24)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:00)=-=-
Category updated.
--- /tmp/wklog.24.old.31772 2010-06-29 14:00:05.000000000 +0000
+++ /tmp/wklog.24.new.31772 2010-06-29 14:00:05.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-RawIdeaBin
-=-=(Guest - Sun, 16 Aug 2009, 02:13)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.23383 2009-08-16 02:13:54.000000000 +0300
+++ /tmp/wklog.24.new.23383 2009-08-16 02:13:54.000000000 +0300
@@ -125,7 +125,7 @@
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
-(Here an imerge for col2=c2 OR col3=c3 won't be built since neither col2=c2 nor
+(Here no imerge for col2=c2 OR col3=c3 will be built since neither col2=c2 nor
col3=c3 represent index ranges.)
@@ -199,7 +199,7 @@
O2. "Create index_merge accesses when possible"
Current tree_or() will not create index_merge access when it could create
- non-index merge access (see DISCARD-IMERGE-3 and its example in the "Problems
+ non-index merge access (see DISCARD-IMERGE-2 and its example in the "Problems
in the current implementation" section). This will be changed to work as
follows: we will create index_merge made for index scans that didn't have
their match in the other sel_tree.
-=-=(Guest - Sun, 16 Aug 2009, 01:03)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.20767 2009-08-16 01:03:11.000000000 +0300
+++ /tmp/wklog.24.new.20767 2009-08-16 01:03:11.000000000 +0300
@@ -18,6 +18,8 @@
# a range tree has range access options, possibly for several keys
range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
+ (here range(keyi) may represent ranges not for initial keyi prefixes,
+ but ranges for any infixes for keyi)
# merge tree represents several way to index_merge
imerge_tree = imerge1 AND imerge2 AND ...
@@ -47,13 +49,13 @@
R.add(range_union(A.range(i), B.range(i)));
if (R has at least one range access)
- return R;
+ return R; // DISCARD-IMERGE-2
else
{
/* could not build any range accesses. construct index_merge */
- remove non-ranges from A; // DISCARD-IMERGE-2
+ remove non-ranges from A;
remove non-ranges from B;
- return new index_merge(A, B);
+ return new index_merge(A, B); // DISCARD-IMERGE-3
}
}
else if (A is range tree and B is index_merge tree (or vice versa))
@@ -65,12 +67,12 @@
(range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
(range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
...
- (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN) AND
+ (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN)
=
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
(range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
...
- (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N)
Now each line represents an index_merge..
}
@@ -82,18 +84,18 @@
OR
imergeB1 AND imergeB2 AND ... AND imergeBN
- -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-3
+ -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-4
imergeA1
OR
- imergeB1 AND imergeB2 AND ... AND imergeBN =
+ imergeB1 =
- = (combine imergeA1 with each of the imergeB{i} ) =
+ = (combine imergeA1 with each of the range_treeB_1{i} ) =
- combine(imergeA1 OR imergeB1) AND
- combine(imergeA1 OR imergeB2) AND
+ combine(imergeA1 OR range_treeB_11) AND
+ combine(imergeA1 OR range_treeB_12) AND
... AND
- combine(imergeA1 OR imergeBN)
+ combine(imergeA1 OR range_treeB_1N)
}
}
@@ -109,7 +111,7 @@
DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
the WHERE clause has this form (conditions t.badkey may have abritrary form):
- (t.badkey<c1 AND t.key1=c1) OR (t.key1=c2 AND t.badkey < c2)
+ (t.badkey<c1 AND t.key1=c1) OR (t.key2=c2 AND t.badkey < c2)
DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
two indexes:
@@ -123,6 +125,8 @@
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
+(Here an imerge for col2=c2 OR col3=c3 won't be built since neither col2=c2 nor
+col3=c3 represent index ranges.)
2. New implementation
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 24
-=-=(Guest - Sat, 20 Jun 2009, 09:34)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.21663 2009-06-20 09:34:48.000000000 +0300
+++ /tmp/wklog.24.new.21663 2009-06-20 09:34:48.000000000 +0300
@@ -4,6 +4,7 @@
2. New implementation
2.1 New tree_and()
2.2 New tree_or()
+3. Testing and required coverage
</contents>
1. Current implementation overview
@@ -240,3 +241,14 @@
In order to limit the impact of this combinatorial explosion, we will
introduce a rule that we won't generate more than #defined
MAX_IMERGE_OPTS options.
+
+3. Testing and required coverage
+================================
+So far could find the following user cases:
+
+* BUG#17259: Query optimizer chooses wrong index
+* BUG#17673: Optimizer does not use Index Merge optimization in some cases
+* BUG#23322: Optimizer sometimes erroniously prefers other index over index merge
+* BUG#30151: optimizer is very reluctant to chose index_merge algorithm
+
+
-=-=(Guest - Thu, 18 Jun 2009, 16:55)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.19152 2009-06-18 16:55:00.000000000 +0300
+++ /tmp/wklog.24.new.19152 2009-06-18 16:55:00.000000000 +0300
@@ -141,13 +141,15 @@
Operations on SEL_ARG trees will be modified to produce/process the trees of
this kind:
+
2.1 New tree_and()
------------------
In order not to lose plans, we'll make these changes:
-1. Don't remove index_merge part of the tree.
+A1. Don't remove index_merge part of the tree (this will take care of
+ DISCARD-IMERGE-1 problem)
-2. Push range conditions down into index_merge trees that may support them.
+A2. Push range conditions down into index_merge trees that may support them.
if one tree has range(key1) and the other tree has imerge(key1 OR key2)
then perform an equvalent of this operation:
@@ -155,8 +157,86 @@
(rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
-3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
+A3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
concatenate them together.
-2.2 New tree_or()
+2.2 New tree_or()
+-----------------
+O1. Dont remove non-range plans:
+ Current tree_or() code will refuse to produce index_merge plans for
+ conditions like
+
+ "t.key1part2=const OR t.key2part1=const"
+
+ (this is marked as DISCARD-IMERGE-3). This was justifed as the left part of
+ the AND condition is not usable for range access, and the operation of
+ tree_and() guaranteed that there was no way it could changed to make a
+ usable range plan. With new tree_and() and rule A2, this is no longer the
+ case. For example for this query:
+
+ (t.key1part2=const OR t.key2part1=const) AND t.key1part1=const
+
+ it will construct a
+
+ imerge(t.key1part2=const OR t.key2part1=const), range(t.key1part1=const)
+
+ then tree_and() will apply rule A2 to push the range down into index merge
+ and after that we'll have:
+
+ range(t.key1part1=const)
+ imerge(
+ t.key1part2=const AND t.key1part1=const,
+ t.key2part1=const
+ )
+ note that imerge(...) describes a usable index_merge plan and it's possible
+ that it will be the best access path.
+
+O2. "Create index_merge accesses when possible"
+ Current tree_or() will not create index_merge access when it could create
+ non-index merge access (see DISCARD-IMERGE-3 and its example in the "Problems
+ in the current implementation" section). This will be changed to work as
+ follows: we will create index_merge made for index scans that didn't have
+ their match in the other sel_tree.
+ Ilustrating it with an example:
+
+ | sel_tree_A | sel_tree_B | A or B | include in index_merge?
+ ------+------------+------------+--------+------------------------
+ key1 | cond1 | cond2 | condM | no
+ key2 | cond3 | cond4 | NULL | no
+ key3 | cond5 | | | yes, A-side
+ key4 | cond6 | | | yes, A-side
+ key5 | | cond7 | | yes, B-side
+ key6 | | cond8 | | yes, B-side
+
+ here we assume that
+ - (cond1 OR cond2) did produce a combined range. Not including them in
+ index_merge.
+ - (cond3 OR cond4) didn't produce a usable range (e.g. they were
+ t.key1part1=c1 AND t.key1part2=c1, respectively, and combining them
+ didn't yield any range list)
+ - All other scand didn't have their counterparts, so we'll end up with a
+ SEL_TREE of:
+
+ range(condM) AND index_merge((cond5 AND cond6),(cond7 AND cond8))
+ .
+
+O4. There is no O4. DISCARD-INDEX-MERGE-4 will remain there. The idea is
+that although DISCARD-INDEX-MERGE-4 does discard plans, so far we haven
+seen any complaints that could be attributed to it.
+If we face the need to lift DISCARD-INDEX-MERGE-4, our answer will be to
+lift it ,and produce a cross-product:
+
+ ((key1p OR key2p) AND (key3p OR key4p))
+ OR
+ ((key5p OR key6p) AND (key7p OR key8p))
+
+ = (key1p OR key2p OR key5p OR key6p) AND // this part is currently
+ (key3p OR key4p OR key5p OR key6p) AND // produced
+
+ (key1p OR key2p OR key5p OR key6p) AND // this part will be added
+ (key3p OR key4p OR key5p OR key6p) //.
+
+In order to limit the impact of this combinatorial explosion, we will
+introduce a rule that we won't generate more than #defined
+MAX_IMERGE_OPTS options.
-=-=(Guest - Thu, 18 Jun 2009, 14:56)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.15612 2009-06-18 14:56:09.000000000 +0300
+++ /tmp/wklog.24.new.15612 2009-06-18 14:56:09.000000000 +0300
@@ -1 +1,162 @@
+<contents>
+1. Current implementation overview
+1.1. Problems in the current implementation
+2. New implementation
+2.1 New tree_and()
+2.2 New tree_or()
+</contents>
+
+1. Current implementation overview
+==================================
+At the moment, range analyzer works as follows:
+
+SEL_TREE structure represents
+
+ # There are sel_trees, a sel_tree is either range or merge tree
+ sel_tree = range_tree | imerge_tree
+
+ # a range tree has range access options, possibly for several keys
+ range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
+
+ # merge tree represents several way to index_merge
+ imerge_tree = imerge1 AND imerge2 AND ...
+
+ # a way to do index merge == a set to use of different indexes.
+ imergeX = range_tree1 OR range_tree2 OR ..
+ where no pair of range_treeX have ranges over the same index.
+
+
+ tree_and(A, B)
+ {
+ if (both A and B are range trees)
+ return a range_tree with computed intersection for each range;
+ if (only one of A and B is a range tree)
+ return that tree; // DISCARD-IMERGE-1
+ // at this point both trees are index_merge trees
+ return concat_lists( A.imerge1 ... A.imergeN, B.imerge1 ... B.imergeN);
+ }
+
+
+ tree_or(A, B)
+ {
+ if (A and B are range trees)
+ {
+ R = new range_tree;
+ for each index i
+ R.add(range_union(A.range(i), B.range(i)));
+
+ if (R has at least one range access)
+ return R;
+ else
+ {
+ /* could not build any range accesses. construct index_merge */
+ remove non-ranges from A; // DISCARD-IMERGE-2
+ remove non-ranges from B;
+ return new index_merge(A, B);
+ }
+ }
+ else if (A is range tree and B is index_merge tree (or vice versa))
+ {
+ Perform this transformation:
+
+ range_treeA // this is A
+ OR
+ (range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
+ (range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
+ ...
+ (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN) AND
+ =
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+ (range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
+ ...
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+
+ Now each line represents an index_merge..
+ }
+ else if (both A and B are index_merge trees)
+ {
+ Perform this transformation:
+
+ imergeA1 AND imergeA2 AND ... AND imergeAN
+ OR
+ imergeB1 AND imergeB2 AND ... AND imergeBN
+
+ -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-3
+
+ imergeA1
+ OR
+ imergeB1 AND imergeB2 AND ... AND imergeBN =
+
+ = (combine imergeA1 with each of the imergeB{i} ) =
+
+ combine(imergeA1 OR imergeB1) AND
+ combine(imergeA1 OR imergeB2) AND
+ ... AND
+ combine(imergeA1 OR imergeBN)
+ }
+ }
+
+1.1. Problems in the current implementation
+-------------------------------------------
+As marked in the code above:
+
+DISCARD-IMERGE-1 step will cause index_merge option to be discarded when
+the WHERE clause has this form:
+
+ (t.key1=c1 OR t.key2=c2) AND t.badkey < c3
+
+DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
+the WHERE clause has this form (conditions t.badkey may have abritrary form):
+
+ (t.badkey<c1 AND t.key1=c1) OR (t.key1=c2 AND t.badkey < c2)
+
+DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
+two indexes:
+
+ INDEX i1(col1, col2),
+ INDEX i2(col1, col3)
+
+and this WHERE clause:
+
+ col1=c1 AND (col2=c2 OR col3=c3)
+
+The optimizer will generate the plans that only use the "col1=c1" part. The
+right side of the AND will be ignored even if it has good selectivity.
+
+
+2. New implementation
+=====================
+
+<general idea>
+* Don't start fighting combinatorial explosion until we've actually got one.
+</>
+
+SEL_TREE structure will be now able to hold both index_merge and range scan
+candidates at the same time. That is,
+
+ sel_tree2 = range_tree AND imerge_tree
+
+where both parts are optional (i.e. can be empty)
+
+Operations on SEL_ARG trees will be modified to produce/process the trees of
+this kind:
+
+2.1 New tree_and()
+------------------
+In order not to lose plans, we'll make these changes:
+
+1. Don't remove index_merge part of the tree.
+
+2. Push range conditions down into index_merge trees that may support them.
+ if one tree has range(key1) and the other tree has imerge(key1 OR key2)
+ then perform an equvalent of this operation:
+
+ rangeA(key1) AND ( rangeB(key1) OR rangeB(key2)) =
+
+ (rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
+
+3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
+ concatenate them together.
+
+2.2 New tree_or()
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 24
-=-=(Guest - Mon, 01 Jun 2009, 23:30)=-=-
High-Level Specification modified.
--- /tmp/wklog.24.old.21580 2009-06-01 23:30:06.000000000 +0300
+++ /tmp/wklog.24.new.21580 2009-06-01 23:30:06.000000000 +0300
@@ -64,6 +64,9 @@
* How strict is the limitation on the form of the WHERE?
+* Which version should this be based on? 5.1? Which patches are should be in
+ (google's/percona's/maria/etc?)
+
* TODO: The optimizer didn't compare costs of index_merge and range before (ok
it did but that was done for accesses to different tables). Will there be any
possible gotchas here?
-=-=(Guest - Wed, 27 May 2009, 13:59)=-=-
Title modified.
--- /tmp/wklog.24.old.9498 2009-05-27 13:59:23.000000000 +0300
+++ /tmp/wklog.24.new.9498 2009-05-27 13:59:23.000000000 +0300
@@ -1 +1 @@
-index_merge optimizer: dont discard index_merge union strategies when range is available
+index_merge: fair choice between index_merge union and range access
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=24&nolimit=1
DESCRIPTION:
Current range optimizer will discard possible index_merge/[sort]union
strategies when there is a possible range plan. This action is a part of
measures we take to avoid combinatorial explosion of possible range/
index_merge strategies.
A bad side effect of this is that for WHERE clauses in form
t.key1= 'very-frequent-value' AND (t.key2='rare-value1' OR t.key3='rare-value2')
the optimizer will
- discard union(key2,key3) in favor of range(key1)
- consider costs of using range(key1) and discard that plan also
and the overall effect is that possible poor range access will cause possible
good index_merge access not to be considered.
This WL is to about lifting this limitation at least for some subset of WHERE
clauses.
HIGH-LEVEL SPECIFICATION:
(Not a ready HLS but draft)
<contents>
Solution overview
Limitations
TODO
</contents>
Solution overview
=================
The idea is to delay discarding potential index_merge plans until the point
where it is really necessary.
This way, we won't have to do much changes in the range analyzer, but will be
able to keep potential index_merge plan just enough so that it's possible to
take it into consideration together with range access plans.
Since there are no changes in the optimizer, the ability to consider both
range and index_merge options will be limited to WHERE clauses of this form:
WHERE := range_cond(key1_1) AND
range_cond(key2_1) AND
other_cond AND
index_merge_OR_cond1(key3_1, key3_2, ...)
index_merge_OR_cond2(key4_1, key4_2, ...)
where
index_merge_OR_cond{N} := (range_cond(keyN_1) OR
range_cond(keyN_2) OR ...)
range_cond(keyX) := condition that allows to construct range access of keyX
and doesn't allow to construct range/index_merge accesses
for any keys of the table in question.
For such WHERE clauses, the range analyzer will produce SEL_TREE of this form:
SEL_TREE(
range(key1_1),
...
range(key2_1),
SEL_IMERGE( (1)
SEL_TREE(key3_1})
SEL_TREE(key3_2})
...
)
...
)
which can be used to make a cost-based choice between range and index_merge.
Limitations
-----------
This will not be a full solution in a sense that the range analyzer will not
be able to produce sel_tree (1) if the WHERE clause is specified in other form
(e.g. brackets were opened).
TODO
----
* is it a problem if there are keys that are referred to both from
index_merge and from range access?
* How strict is the limitation on the form of the WHERE?
* Which version should this be based on? 5.1? Which patches are should be in
(google's/percona's/maria/etc?)
* TODO: The optimizer didn't compare costs of index_merge and range before (ok
it did but that was done for accesses to different tables). Will there be any
possible gotchas here?
LOW-LEVEL DESIGN:
<contents>
1. Current implementation overview
1.1. Problems in the current implementation
2. New implementation
2.1 New tree_and()
2.2 New tree_or()
3. Testing and required coverage
</contents>
1. Current implementation overview
==================================
At the moment, range analyzer works as follows:
SEL_TREE structure represents
# There are sel_trees, a sel_tree is either range or merge tree
sel_tree = range_tree | imerge_tree
# a range tree has range access options, possibly for several keys
range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
(here range(keyi) may represent ranges not for initial keyi prefixes,
but ranges for any infixes for keyi)
# merge tree represents several way to index_merge
imerge_tree = imerge1 AND imerge2 AND ...
# a way to do index merge == a set to use of different indexes.
imergeX = range_tree1 OR range_tree2 OR ..
where no pair of range_treeX have ranges over the same index.
tree_and(A, B)
{
if (both A and B are range trees)
return a range_tree with computed intersection for each range;
if (only one of A and B is a range tree)
return that tree; // DISCARD-IMERGE-1
// at this point both trees are index_merge trees
return concat_lists( A.imerge1 ... A.imergeN, B.imerge1 ... B.imergeN);
}
tree_or(A, B)
{
if (A and B are range trees)
{
R = new range_tree;
for each index i
R.add(range_union(A.range(i), B.range(i)));
if (R has at least one range access)
return R; // DISCARD-IMERGE-2
else
{
/* could not build any range accesses. construct index_merge */
remove non-ranges from A;
remove non-ranges from B;
return new index_merge(A, B); // DISCARD-IMERGE-3
}
}
else if (A is range tree and B is index_merge tree (or vice versa))
{
Perform this transformation:
range_treeA // this is A
OR
(range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
(range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
...
(range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN)
=
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
(range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
...
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N)
Now each line represents an index_merge..
}
else if (both A and B are index_merge trees)
{
Perform this transformation:
imergeA1 AND imergeA2 AND ... AND imergeAN
OR
imergeB1 AND imergeB2 AND ... AND imergeBN
-> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-4
imergeA1
OR
imergeB1 =
= (combine imergeA1 with each of the range_treeB_1{i} ) =
combine(imergeA1 OR range_treeB_11) AND
combine(imergeA1 OR range_treeB_12) AND
... AND
combine(imergeA1 OR range_treeB_1N)
}
}
1.1. Problems in the current implementation
-------------------------------------------
As marked in the code above:
DISCARD-IMERGE-1 step will cause index_merge option to be discarded when
the WHERE clause has this form:
(t.key1=c1 OR t.key2=c2) AND t.badkey < c3
DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
the WHERE clause has this form (conditions t.badkey may have abritrary form):
(t.badkey<c1 AND t.key1=c1) OR (t.key2=c2 AND t.badkey < c2)
DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
two indexes:
INDEX i1(col1, col2),
INDEX i2(col1, col3)
and this WHERE clause:
col1=c1 AND (col2=c2 OR col3=c3)
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
(Here no imerge for col2=c2 OR col3=c3 will be built since neither col2=c2 nor
col3=c3 represent index ranges.)
2. New implementation
=====================
<general idea>
* Don't start fighting combinatorial explosion until we've actually got one.
</>
SEL_TREE structure will be now able to hold both index_merge and range scan
candidates at the same time. That is,
sel_tree2 = range_tree AND imerge_tree
where both parts are optional (i.e. can be empty)
Operations on SEL_ARG trees will be modified to produce/process the trees of
this kind:
2.1 New tree_and()
------------------
In order not to lose plans, we'll make these changes:
A1. Don't remove index_merge part of the tree (this will take care of
DISCARD-IMERGE-1 problem)
A2. Push range conditions down into index_merge trees that may support them.
if one tree has range(key1) and the other tree has imerge(key1 OR key2)
then perform an equvalent of this operation:
rangeA(key1) AND ( rangeB(key1) OR rangeB(key2)) =
(rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
A3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
concatenate them together.
2.2 New tree_or()
-----------------
O1. Dont remove non-range plans:
Current tree_or() code will refuse to produce index_merge plans for
conditions like
"t.key1part2=const OR t.key2part1=const"
(this is marked as DISCARD-IMERGE-3). This was justifed as the left part of
the AND condition is not usable for range access, and the operation of
tree_and() guaranteed that there was no way it could changed to make a
usable range plan. With new tree_and() and rule A2, this is no longer the
case. For example for this query:
(t.key1part2=const OR t.key2part1=const) AND t.key1part1=const
it will construct a
imerge(t.key1part2=const OR t.key2part1=const), range(t.key1part1=const)
then tree_and() will apply rule A2 to push the range down into index merge
and after that we'll have:
range(t.key1part1=const)
imerge(
t.key1part2=const AND t.key1part1=const,
t.key2part1=const
)
note that imerge(...) describes a usable index_merge plan and it's possible
that it will be the best access path.
O2. "Create index_merge accesses when possible"
Current tree_or() will not create index_merge access when it could create
non-index merge access (see DISCARD-IMERGE-2 and its example in the "Problems
in the current implementation" section). This will be changed to work as
follows: we will create index_merge made for index scans that didn't have
their match in the other sel_tree.
Ilustrating it with an example:
| sel_tree_A | sel_tree_B | A or B | include in index_merge?
------+------------+------------+--------+------------------------
key1 | cond1 | cond2 | condM | no
key2 | cond3 | cond4 | NULL | no
key3 | cond5 | | | yes, A-side
key4 | cond6 | | | yes, A-side
key5 | | cond7 | | yes, B-side
key6 | | cond8 | | yes, B-side
here we assume that
- (cond1 OR cond2) did produce a combined range. Not including them in
index_merge.
- (cond3 OR cond4) didn't produce a usable range (e.g. they were
t.key1part1=c1 AND t.key1part2=c1, respectively, and combining them
didn't yield any range list)
- All other scand didn't have their counterparts, so we'll end up with a
SEL_TREE of:
range(condM) AND index_merge((cond5 AND cond6),(cond7 AND cond8))
.
O4. There is no O4. DISCARD-INDEX-MERGE-4 will remain there. The idea is
that although DISCARD-INDEX-MERGE-4 does discard plans, so far we haven
seen any complaints that could be attributed to it.
If we face the need to lift DISCARD-INDEX-MERGE-4, our answer will be to
lift it ,and produce a cross-product:
((key1p OR key2p) AND (key3p OR key4p))
OR
((key5p OR key6p) AND (key7p OR key8p))
= (key1p OR key2p OR key5p OR key6p) AND // this part is currently
(key3p OR key4p OR key5p OR key6p) AND // produced
(key1p OR key2p OR key5p OR key6p) AND // this part will be added
(key3p OR key4p OR key5p OR key6p) //.
In order to limit the impact of this combinatorial explosion, we will
introduce a rule that we won't generate more than #defined
MAX_IMERGE_OPTS options.
3. Testing and required coverage
================================
So far could find the following user cases:
* BUG#17259: Query optimizer chooses wrong index
* BUG#17673: Optimizer does not use Index Merge optimization in some cases
* BUG#23322: Optimizer sometimes erroniously prefers other index over index merge
* BUG#30151: optimizer is very reluctant to chose index_merge algorithm
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#67 Updated (by Psergey): ICP/MRR backport
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: ICP/MRR backport
CREATION DATE..: Thu, 26 Nov 2009, 15:19
SUPERVISOR.....: Monty
IMPLEMENTOR....: Psergey
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 67 (http://askmonty.org/worklog/?tid=67)
VERSION........: Server-9.x
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Tue, 29 Jun 2010, 13:57)=-=-
Status updated.
--- /tmp/wklog.67.old.31561 2010-06-29 13:57:50.000000000 +0000
+++ /tmp/wklog.67.new.31561 2010-06-29 13:57:50.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Complete
-=-=(Guest - Sun, 13 Jun 2010, 16:57)=-=-
Dependency deleted: 91 no longer depends on 67
-=-=(Igor - Wed, 10 Mar 2010, 19:14)=-=-
High Level Description modified.
--- /tmp/wklog.67.old.25641 2010-03-10 19:14:45.000000000 +0000
+++ /tmp/wklog.67.new.25641 2010-03-10 19:14:45.000000000 +0000
@@ -1,2 +1,2 @@
-Backport DS-MRR into MariaDB-5.2 codebase, also adding certain extra features to
-make it more usable.
+Backport ICP and DS-MRR into MariaDB-5.2 codebase, also adding certain extra
+features to make it more usable.
-=-=(Guest - Wed, 10 Mar 2010, 19:12)=-=-
Title modified.
--- /tmp/wklog.67.old.25456 2010-03-10 19:12:57.000000000 +0000
+++ /tmp/wklog.67.new.25456 2010-03-10 19:12:57.000000000 +0000
@@ -1 +1 @@
-MRR backport
+ICP/MRR backport
-=-=(Psergey - Sun, 28 Feb 2010, 14:56)=-=-
Dependency created: 91 now depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 67
-=-=(Psergey - Thu, 26 Nov 2009, 20:21)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.9329 2009-11-26 20:21:28.000000000 +0200
+++ /tmp/wklog.67.new.9329 2009-11-26 20:21:28.000000000 +0200
@@ -65,17 +65,19 @@
2.5 Make MRR code more of a module
----------------------------------
-Some code in handler.cc can be moved to separate file.
-But changes in opt_range.cc can't.
-TODO: Sort out how much we really can do here. Initial guess is not much as the
-code consists of:
+It is not possible to make MRR to be a totally separate module, as its code
+consists of :
- Default MRR implementation in handler.cc
- Changes in opt_range.cc to use MRR instead of multiple records_in_range()
- calls. These rely on opt_range.cc's internal structures like SEL_ARG trees and
+ calls. These rely on opt_range.cc's internal stuctures like SEL_ARG trees and
so there is not much point in moving them out.
-- DS-MRR implementations which are spread over storage engines.
-and the only modularization we see is to move #1 into a separate file which
-won't achieve much.
+- DS-MRR impelementations which are spread over storage engines.
+
+We'll try to modularize what we can:
+- Move out default MRR implementation from handler.cc
+- Move possible parts out of opt_range.cc into a separate file.
+
+
2.6 Improve the cost model
--------------------------
-=-=(Psergey - Thu, 26 Nov 2009, 19:06)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.6449 2009-11-26 19:06:04.000000000 +0200
+++ /tmp/wklog.67.new.6449 2009-11-26 19:06:04.000000000 +0200
@@ -1,4 +1,3 @@
-
<contents>
1. Requirements
2. Required actions
@@ -44,6 +43,7 @@
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=index_condi…
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=mrr
+http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=icp
2.2 Backport DS-MRR code to MariaDB 5.2
---------------------------------------
-=-=(Psergey - Thu, 26 Nov 2009, 18:15)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.4161 2009-11-26 18:15:36.000000000 +0200
+++ /tmp/wklog.67.new.4161 2009-11-26 18:15:36.000000000 +0200
@@ -1,3 +1,17 @@
+
+<contents>
+1. Requirements
+2. Required actions
+2.1 Fix DS-MRR/InnoDB bugs
+2.2 Backport DS-MRR code to MariaDB 5.2
+2.3 Introduce control variables
+2.4 Other backport issues
+2.5 Make MRR code more of a module
+2.6 Improve the cost model
+2.7 Let DS-MRR support clustered primary keys
+</contents>
+
+
1. Requirements
===============
@@ -63,4 +77,28 @@
and the only modularization we see is to move #1 into a separate file which
won't achieve much.
+2.6 Improve the cost model
+--------------------------
+At the moment DS-MRR cost formula re-uses non-MRR scan costs, which uses
+records_in_range() calls, followed by index_only_read_time() or read_time()
+calls to produce the estimate for read cost.
+
+ We should change this (TODO sort out how exactly)
+
+Note: this means that the query plans will change from MariaDB 5.2.
+
+2.7 Let DS-MRR support clustered primary keys
+---------------------------------------------
+At the moment DS-MRR is not supported for clustered primary keys. It is not
+needed when MRR is used for range access, because range access is done over
+an ordered list of ranges, but it is useful for BKA.
+
+TODO:
+ it's useful for BKA because BKA makes MRR scans over un-orderered
+ non-disjoint lists of ranges. Then we can sort these and do ordered scans.
+ There is still no use for DS-MRR over clustered primary key for range
+ access, where the ranges are disjoint and ordered.
+ How about postponing this item until BKA is backported?
+
+
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=67&nolimit=1
DESCRIPTION:
Backport ICP and DS-MRR into MariaDB-5.2 codebase, also adding certain extra
features to make it more usable.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Requirements
2. Required actions
2.1 Fix DS-MRR/InnoDB bugs
2.2 Backport DS-MRR code to MariaDB 5.2
2.3 Introduce control variables
2.4 Other backport issues
2.5 Make MRR code more of a module
2.6 Improve the cost model
2.7 Let DS-MRR support clustered primary keys
</contents>
1. Requirements
===============
We need the following:
1. Latest MRR interface support, including extensions to support ICP when
using BKA.
2. Let DS-MRR support clustered primary keys (needed when using BKA).
3. Remove conditions used for key access from the condition pushed to index
(ATM this manifests itself as "Using index condition" appearing where there
was no "Using where". TODO: example of this?)
4. Introduce a separate @@optimizer_switch flag for turning on/out ICP (atm it
is switched on/off by @@engine_condition_pushdown)
5. Introduce a separate @@mrr_buffer_size variable to control MRR buffer size
for range+MRR scans. ATM it is controlled by @@read_rnd_size flag and that
makes it unobvious for a number of users.
6. Rename multi_range_read_info_const() to look like it is not a part of MRR
interface.
8. Try to make MRR to be more of a module
7. Improve MRR's cost model.
2. Required actions
===================
Roughly in the order in which it will be done:
2.1 Fix DS-MRR/InnoDB bugs
--------------------------
We need to fix the bugs listed here:
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=index_condi…
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=mrr
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=icp
2.2 Backport DS-MRR code to MariaDB 5.2
---------------------------------------
The easiest way seems to be to to manually move the needed code from mysql-6.0
(or whatever it's called now) to MariaDB.
2.3 Introduce control variables
-------------------------------
Act on items #4 and #5 from the requirements. Should be easy as
@@optimizer_switch is supported in 5.1 codebase.
2.4 Other backport issues
-------------------------
* Figure out what to do with NDB/MRR. 5.1 codebase has "old" NDB/MRR
implementation. mysql-6.0 (and NDB's branch) have the updated NDB/MRR
but merging it into 5.1 can be very labor-intensive.
Will it be ok to disable NDB/MRR altogether?
2.5 Make MRR code more of a module
----------------------------------
It is not possible to make MRR to be a totally separate module, as its code
consists of :
- Default MRR implementation in handler.cc
- Changes in opt_range.cc to use MRR instead of multiple records_in_range()
calls. These rely on opt_range.cc's internal stuctures like SEL_ARG trees and
so there is not much point in moving them out.
- DS-MRR impelementations which are spread over storage engines.
We'll try to modularize what we can:
- Move out default MRR implementation from handler.cc
- Move possible parts out of opt_range.cc into a separate file.
2.6 Improve the cost model
--------------------------
At the moment DS-MRR cost formula re-uses non-MRR scan costs, which uses
records_in_range() calls, followed by index_only_read_time() or read_time()
calls to produce the estimate for read cost.
We should change this (TODO sort out how exactly)
Note: this means that the query plans will change from MariaDB 5.2.
2.7 Let DS-MRR support clustered primary keys
---------------------------------------------
At the moment DS-MRR is not supported for clustered primary keys. It is not
needed when MRR is used for range access, because range access is done over
an ordered list of ranges, but it is useful for BKA.
TODO:
it's useful for BKA because BKA makes MRR scans over un-orderered
non-disjoint lists of ranges. Then we can sort these and do ordered scans.
There is still no use for DS-MRR over clustered primary key for range
access, where the ranges are disjoint and ordered.
How about postponing this item until BKA is backported?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#67 Updated (by Psergey): ICP/MRR backport
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: ICP/MRR backport
CREATION DATE..: Thu, 26 Nov 2009, 15:19
SUPERVISOR.....: Monty
IMPLEMENTOR....: Psergey
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 67 (http://askmonty.org/worklog/?tid=67)
VERSION........: Server-9.x
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Tue, 29 Jun 2010, 13:57)=-=-
Status updated.
--- /tmp/wklog.67.old.31561 2010-06-29 13:57:50.000000000 +0000
+++ /tmp/wklog.67.new.31561 2010-06-29 13:57:50.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Complete
-=-=(Guest - Sun, 13 Jun 2010, 16:57)=-=-
Dependency deleted: 91 no longer depends on 67
-=-=(Igor - Wed, 10 Mar 2010, 19:14)=-=-
High Level Description modified.
--- /tmp/wklog.67.old.25641 2010-03-10 19:14:45.000000000 +0000
+++ /tmp/wklog.67.new.25641 2010-03-10 19:14:45.000000000 +0000
@@ -1,2 +1,2 @@
-Backport DS-MRR into MariaDB-5.2 codebase, also adding certain extra features to
-make it more usable.
+Backport ICP and DS-MRR into MariaDB-5.2 codebase, also adding certain extra
+features to make it more usable.
-=-=(Guest - Wed, 10 Mar 2010, 19:12)=-=-
Title modified.
--- /tmp/wklog.67.old.25456 2010-03-10 19:12:57.000000000 +0000
+++ /tmp/wklog.67.new.25456 2010-03-10 19:12:57.000000000 +0000
@@ -1 +1 @@
-MRR backport
+ICP/MRR backport
-=-=(Psergey - Sun, 28 Feb 2010, 14:56)=-=-
Dependency created: 91 now depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 67
-=-=(Psergey - Thu, 26 Nov 2009, 20:21)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.9329 2009-11-26 20:21:28.000000000 +0200
+++ /tmp/wklog.67.new.9329 2009-11-26 20:21:28.000000000 +0200
@@ -65,17 +65,19 @@
2.5 Make MRR code more of a module
----------------------------------
-Some code in handler.cc can be moved to separate file.
-But changes in opt_range.cc can't.
-TODO: Sort out how much we really can do here. Initial guess is not much as the
-code consists of:
+It is not possible to make MRR to be a totally separate module, as its code
+consists of :
- Default MRR implementation in handler.cc
- Changes in opt_range.cc to use MRR instead of multiple records_in_range()
- calls. These rely on opt_range.cc's internal structures like SEL_ARG trees and
+ calls. These rely on opt_range.cc's internal stuctures like SEL_ARG trees and
so there is not much point in moving them out.
-- DS-MRR implementations which are spread over storage engines.
-and the only modularization we see is to move #1 into a separate file which
-won't achieve much.
+- DS-MRR impelementations which are spread over storage engines.
+
+We'll try to modularize what we can:
+- Move out default MRR implementation from handler.cc
+- Move possible parts out of opt_range.cc into a separate file.
+
+
2.6 Improve the cost model
--------------------------
-=-=(Psergey - Thu, 26 Nov 2009, 19:06)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.6449 2009-11-26 19:06:04.000000000 +0200
+++ /tmp/wklog.67.new.6449 2009-11-26 19:06:04.000000000 +0200
@@ -1,4 +1,3 @@
-
<contents>
1. Requirements
2. Required actions
@@ -44,6 +43,7 @@
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=index_condi…
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=mrr
+http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=icp
2.2 Backport DS-MRR code to MariaDB 5.2
---------------------------------------
-=-=(Psergey - Thu, 26 Nov 2009, 18:15)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.4161 2009-11-26 18:15:36.000000000 +0200
+++ /tmp/wklog.67.new.4161 2009-11-26 18:15:36.000000000 +0200
@@ -1,3 +1,17 @@
+
+<contents>
+1. Requirements
+2. Required actions
+2.1 Fix DS-MRR/InnoDB bugs
+2.2 Backport DS-MRR code to MariaDB 5.2
+2.3 Introduce control variables
+2.4 Other backport issues
+2.5 Make MRR code more of a module
+2.6 Improve the cost model
+2.7 Let DS-MRR support clustered primary keys
+</contents>
+
+
1. Requirements
===============
@@ -63,4 +77,28 @@
and the only modularization we see is to move #1 into a separate file which
won't achieve much.
+2.6 Improve the cost model
+--------------------------
+At the moment DS-MRR cost formula re-uses non-MRR scan costs, which uses
+records_in_range() calls, followed by index_only_read_time() or read_time()
+calls to produce the estimate for read cost.
+
+ We should change this (TODO sort out how exactly)
+
+Note: this means that the query plans will change from MariaDB 5.2.
+
+2.7 Let DS-MRR support clustered primary keys
+---------------------------------------------
+At the moment DS-MRR is not supported for clustered primary keys. It is not
+needed when MRR is used for range access, because range access is done over
+an ordered list of ranges, but it is useful for BKA.
+
+TODO:
+ it's useful for BKA because BKA makes MRR scans over un-orderered
+ non-disjoint lists of ranges. Then we can sort these and do ordered scans.
+ There is still no use for DS-MRR over clustered primary key for range
+ access, where the ranges are disjoint and ordered.
+ How about postponing this item until BKA is backported?
+
+
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=67&nolimit=1
DESCRIPTION:
Backport ICP and DS-MRR into MariaDB-5.2 codebase, also adding certain extra
features to make it more usable.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Requirements
2. Required actions
2.1 Fix DS-MRR/InnoDB bugs
2.2 Backport DS-MRR code to MariaDB 5.2
2.3 Introduce control variables
2.4 Other backport issues
2.5 Make MRR code more of a module
2.6 Improve the cost model
2.7 Let DS-MRR support clustered primary keys
</contents>
1. Requirements
===============
We need the following:
1. Latest MRR interface support, including extensions to support ICP when
using BKA.
2. Let DS-MRR support clustered primary keys (needed when using BKA).
3. Remove conditions used for key access from the condition pushed to index
(ATM this manifests itself as "Using index condition" appearing where there
was no "Using where". TODO: example of this?)
4. Introduce a separate @@optimizer_switch flag for turning on/out ICP (atm it
is switched on/off by @@engine_condition_pushdown)
5. Introduce a separate @@mrr_buffer_size variable to control MRR buffer size
for range+MRR scans. ATM it is controlled by @@read_rnd_size flag and that
makes it unobvious for a number of users.
6. Rename multi_range_read_info_const() to look like it is not a part of MRR
interface.
8. Try to make MRR to be more of a module
7. Improve MRR's cost model.
2. Required actions
===================
Roughly in the order in which it will be done:
2.1 Fix DS-MRR/InnoDB bugs
--------------------------
We need to fix the bugs listed here:
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=index_condi…
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=mrr
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=icp
2.2 Backport DS-MRR code to MariaDB 5.2
---------------------------------------
The easiest way seems to be to to manually move the needed code from mysql-6.0
(or whatever it's called now) to MariaDB.
2.3 Introduce control variables
-------------------------------
Act on items #4 and #5 from the requirements. Should be easy as
@@optimizer_switch is supported in 5.1 codebase.
2.4 Other backport issues
-------------------------
* Figure out what to do with NDB/MRR. 5.1 codebase has "old" NDB/MRR
implementation. mysql-6.0 (and NDB's branch) have the updated NDB/MRR
but merging it into 5.1 can be very labor-intensive.
Will it be ok to disable NDB/MRR altogether?
2.5 Make MRR code more of a module
----------------------------------
It is not possible to make MRR to be a totally separate module, as its code
consists of :
- Default MRR implementation in handler.cc
- Changes in opt_range.cc to use MRR instead of multiple records_in_range()
calls. These rely on opt_range.cc's internal stuctures like SEL_ARG trees and
so there is not much point in moving them out.
- DS-MRR impelementations which are spread over storage engines.
We'll try to modularize what we can:
- Move out default MRR implementation from handler.cc
- Move possible parts out of opt_range.cc into a separate file.
2.6 Improve the cost model
--------------------------
At the moment DS-MRR cost formula re-uses non-MRR scan costs, which uses
records_in_range() calls, followed by index_only_read_time() or read_time()
calls to produce the estimate for read cost.
We should change this (TODO sort out how exactly)
Note: this means that the query plans will change from MariaDB 5.2.
2.7 Let DS-MRR support clustered primary keys
---------------------------------------------
At the moment DS-MRR is not supported for clustered primary keys. It is not
needed when MRR is used for range access, because range access is done over
an ordered list of ranges, but it is useful for BKA.
TODO:
it's useful for BKA because BKA makes MRR scans over un-orderered
non-disjoint lists of ranges. Then we can sort these and do ordered scans.
There is still no use for DS-MRR over clustered primary key for range
access, where the ranges are disjoint and ordered.
How about postponing this item until BKA is backported?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#120 Updated (by Knielsen): Replication API for stacked event generators
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: In-Progress
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Tue, 29 Jun 2010, 13:51)=-=-
Status updated.
--- /tmp/wklog.120.old.31179 2010-06-29 13:51:20.000000000 +0000
+++ /tmp/wklog.120.new.31179 2010-06-29 13:51:20.000000000 +0000
@@ -1 +1 @@
-Assigned
+In-Progress
-=-=(Knielsen - Mon, 28 Jun 2010, 07:11)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.18416 2010-06-28 07:11:09.000000000 +0000
+++ /tmp/wklog.120.new.18416 2010-06-28 07:11:09.000000000 +0000
@@ -14,10 +14,14 @@
An example of an event consumer is the writing of the binlog on a master.
-Event generators are not really plugins. Rather, there are specific points in
-the server where events are generated. However, a generator can be part of a
-plugin, for example a PBXT engine-level replication event generator would be
-part of the PBXT storage engine plugin.
+Some event generators are not really plugins. Rather, there are specific
+points in the server where events are generated. However, a generator can be
+part of a plugin, for example a PBXT engine-level replication event generator
+would be part of the PBXT storage engine plugin. And for example we could
+write a filter plugin, which would be stacked on top of an existing generator
+and provide the same event types and interfaces, but filtered in some way (for
+example by removing certain events on the master side, or by re-writing events
+in certain ways).
Event consumers on the other hand could be a plugin.
@@ -85,6 +89,9 @@
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
+Also, the non-materialising interface should be a good interface on top of
+which to build a materialising interface.
+
The design proposed here aims for as little materialisation as possible.
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Some event generators are not really plugins. Rather, there are specific
points in the server where events are generated. However, a generator can be
part of a plugin, for example a PBXT engine-level replication event generator
would be part of the PBXT storage engine plugin. And for example we could
write a filter plugin, which would be stacked on top of an existing generator
and provide the same event types and interfaces, but filtered in some way (for
example by removing certain events on the master side, or by re-writing events
in certain ways).
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
Also, the non-materialising interface should be a good interface on top of
which to build a materialising interface.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0