developers
Threads by month
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
March 2010
- 23 participants
- 366 discussions
[Maria-developers] Rev 28: Added note about noop IO scheduler for Linux. Refactor variable name. in file:///Users/hakan/work/monty_program/mariadb-tools/
by Hakan Kuecuekyilmaz 25 Mar '10
by Hakan Kuecuekyilmaz 25 Mar '10
25 Mar '10
At file:///Users/hakan/work/monty_program/mariadb-tools/
------------------------------------------------------------
revno: 28
revision-id: hakan(a)askmonty.org-20100325014205-kwsruwixwlymz1ti
parent: hakan(a)askmonty.org-20100310010046-hwv56n4wfn4t4odp
committer: Hakan Kuecuekyilmaz <hakan(a)askmonty.org>
branch nick: mariadb-tools
timestamp: Thu 2010-03-25 02:42:05 +0100
message:
Added note about noop IO scheduler for Linux. Refactor variable name.
For MyISAM related tests added key_cache statistics dump out.
=== modified file 'sysbench/run-sysbench-myisam.sh'
--- a/sysbench/run-sysbench-myisam.sh 2010-03-10 01:00:46 +0000
+++ b/sysbench/run-sysbench-myisam.sh 2010-03-25 01:42:05 +0000
@@ -6,16 +6,20 @@
# * Do not run this script with root privileges. We use
# killall -9, which can cause severe side effects!
# * By bzr pull we mean bzr merge --pull
+# * For reasonable performance set your IO scheduler to noop or deadline, for
+# reference please check
+# http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-lik…
#
# Index sizes for 20 mio rows (--table-size=20000000).
-# * delete.lua: 313M sbtest.MYI
-# * insert.lua: 4.0K sbtest.MYI
-# * oltp_complex_ro.lua: 313M sbtest.MYI
-# * oltp_complex_rw.lua: 313M sbtest.MYI
-# * oltp_simple.lua: 325M sbtest.MYI
-# * select.lua: 313M sbtest.MYI
-# * update_index.lua: 313M sbtest.MYI
-# * update_non_index.lua: 313M sbtest.MYI
+# * delete.lua 313M sbtest.MYI
+# * insert.lua 4.0K sbtest.MYI
+# * oltp_complex_ro.lua 313M sbtest.MYI
+# * oltp_complex_rw.lua 313M sbtest.MYI
+# * oltp_simple.lua 325M sbtest.MYI
+# * select.lua 313M sbtest.MYI
+# * select_random_ranges.lua 313M sbtest.MYI
+# * update_index.lua 313M sbtest.MYI
+# * update_non_index.lua 313M sbtest.MYI
#
# Hakan Kuecuekyilmaz <hakan at askmonty dot org> 2010-02-19.
#
@@ -60,13 +64,14 @@
# change these, except you exactly know what you are doing.
#
MYSQLADMIN='client/mysqladmin'
+MYSQL='client/mysql'
#
# Variables.
#
MY_SOCKET="/tmp/mysql.sock"
MYSQLADMIN_OPTIONS="--no-defaults -uroot --socket=$MY_SOCKET"
-MYSQL_OPTIONS="--no-defaults \
+MYSQLD_OPTIONS="--no-defaults \
--datadir=$DATA_DIR \
--language=./sql/share/english \
--key_buffer_size=32M \
@@ -104,6 +109,7 @@
oltp_complex_rw.lua \
oltp_simple.lua \
select.lua \
+ select_random_ranges.lua \
update_index.lua \
update_non_index.lua"
@@ -240,7 +246,7 @@
}
function start_mysqld {
- sql/mysqld $MYSQL_OPTIONS &
+ sql/mysqld $MYSQLD_OPTIONS &
j=0
STARTED=-1
@@ -269,7 +275,7 @@
#
# Write out configurations used for future refernce.
#
-echo $MYSQL_OPTIONS > ${RESULT_DIR}/${TODAY}/${PRODUCT}/mysqld_options.txt
+echo $MYSQLD_OPTIONS > ${RESULT_DIR}/${TODAY}/${PRODUCT}/mysqld_options.txt
echo $SYSBENCH_OPTIONS > ${RESULT_DIR}/${TODAY}/${PRODUCT}/sysbench_options.txt
echo '' >> ${RESULT_DIR}/${TODAY}/${PRODUCT}/sysbench_options.txt
echo "Warm up time is: $WARM_UP_TIME" >> ${RESULT_DIR}/${TODAY}/${PRODUCT}/sysbench_options.txt
@@ -331,6 +337,7 @@
echo "[$(date "+%Y-%m-%d %H:%M:%S")] Starting warm up of $WARM_UP_TIME seconds."
$SYSBENCH $SYSBENCH_OPTIONS_WARM_UP run
sync
+ echo 'FLUSH STATUS' | $MYSQL -uroot
echo "[$(date "+%Y-%m-%d %H:%M:%S")] Finnished warm up."
echo "[$(date "+%Y-%m-%d %H:%M:%S")] Starting actual sysbench run."
@@ -338,6 +345,8 @@
grep "write requests:" ${THIS_RESULT_DIR}/result${k}.txt | awk '{ print $4 }' | sed -e 's/(//' >> ${THIS_RESULT_DIR}/results.txt
+ echo 'SELECT * FROM INFORMATION_SCHEMA.KEY_CACHES' | $MYSQL -uroot > ${THIS_RESULT_DIR}/key_cache_stats{k}.txt
+
k=$(($k + 1))
done
=== modified file 'sysbench/run-sysbench.sh'
--- a/sysbench/run-sysbench.sh 2010-03-10 00:02:01 +0000
+++ b/sysbench/run-sysbench.sh 2010-03-25 01:42:05 +0000
@@ -6,6 +6,9 @@
# * Do not run this script with root privileges. We use
# killall -9, which can cause severe side effects!
# * By bzr pull we mean bzr merge --pull
+# * For reasonable performance set your IO scheduler to noop or deadline, for
+# reference please check
+# http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-lik…
#
# Hakan Kuecuekyilmaz <hakan at askmonty dot org> 2010-02-19.
#
@@ -56,7 +59,7 @@
#
MY_SOCKET="/tmp/mysql.sock"
MYSQLADMIN_OPTIONS="--no-defaults -uroot --socket=$MY_SOCKET"
-MYSQL_OPTIONS="--no-defaults \
+MYSQLD_OPTIONS="--no-defaults \
--datadir=$DATA_DIR \
--language=./sql/share/english \
--max_connections=256 \
@@ -236,7 +239,7 @@
}
function start_mysqld {
- sql/mysqld $MYSQL_OPTIONS &
+ sql/mysqld $MYSQLD_OPTIONS &
j=0
STARTED=-1
@@ -265,7 +268,7 @@
#
# Write out configurations used for future refernce.
#
-echo $MYSQL_OPTIONS > ${RESULT_DIR}/${TODAY}/${PRODUCT}/mysqld_options.txt
+echo $MYSQLD_OPTIONS > ${RESULT_DIR}/${TODAY}/${PRODUCT}/mysqld_options.txt
echo $SYSBENCH_OPTIONS > ${RESULT_DIR}/${TODAY}/${PRODUCT}/sysbench_options.txt
echo '' >> ${RESULT_DIR}/${TODAY}/${PRODUCT}/sysbench_options.txt
echo "Warm up time is: $WARM_UP_TIME" >> ${RESULT_DIR}/${TODAY}/${PRODUCT}/sysbench_options.txt
1
0
[Maria-developers] [Branch ~maria-captains/maria/5.1] Rev 2833: two crashes in the TC_LOG_MMAP:
by noreply@launchpad.net 24 Mar '10
by noreply@launchpad.net 24 Mar '10
24 Mar '10
------------------------------------------------------------
revno: 2833
fixes bug(s): https://launchpad.net/bugs/544173
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: maria-5.1
timestamp: Wed 2010-03-24 23:12:39 +0100
message:
two crashes in the TC_LOG_MMAP:
1. don't forget to initialize page->ptr
2. don't signal active->cond, if active is NULL
added:
mysql-test/suite/pbxt/r/pbxt_xa.result
mysql-test/suite/pbxt/t/pbxt_xa.test
modified:
sql/log.cc
--
lp:maria
https://code.launchpad.net/~maria-captains/maria/5.1
Your team Maria developers is subscribed to branch lp:maria.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1/+edit-subscription.
1
0
[Maria-developers] Updated (by Psergey): Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE (90)
by worklog-noreply@askmonty.org 24 Mar '10
by worklog-noreply@askmonty.org 24 Mar '10
24 Mar '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Subqueries: Inside-out execution for non-semijoin materialized
subqueries that are AND-parts of the WHERE
CREATION DATE..: Sun, 28 Feb 2010, 13:45
SUPERVISOR.....: Monty
IMPLEMENTOR....: Psergey
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 90 (http://askmonty.org/worklog/?tid=90)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: -1 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Wed, 24 Mar 2010, 14:42)=-=-
Low Level Design modified.
--- /tmp/wklog.90.old.19182 2010-03-24 14:42:54.000000000 +0000
+++ /tmp/wklog.90.new.19182 2010-03-24 14:42:54.000000000 +0000
@@ -1 +1,140 @@
+<contents>
+1. Applicability check
+2. Representation
+2.1 Option #1: Convert to TABLE_LIST
+2.2 On subquery predicate removal
+2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
+2.3 What is expected of the result of conversion
+3. Pre-optimization steps
+3.1 Constant detection
+3.3 update_ref_and_keys
+3.4 JOIN_TAB sorting criteria
+4. Optimization
+5. Execution
+User interface.
+</contents>
+
+We'll call the new execution strategy "jtbm-materialization", for the lack of
+better name.
+
+1. Applicability check
+======================
+The criteria for checking whether a subquery can be processed with
+jtbm-materialization can be checked at JOIN::prepare stage (like it
+happens with semi-join check)
+
+2. Representation
+=================
+
+2.1 Option #1: Convert to TABLE_LIST
+------------------------------------
+Make it work like semi-join nests: each jtbm-predicate is converted into a
+TABLE_LIST object. This will make it
+
+ - uniform with semi-joins (we've stepped on all rakes there)
+ - allow to process JTBM-subqueries in ON expressions
+
+simplify_joins() will handle jtbm TABLE_LISTs as some kinds of opaque base
+tables.
+
+for EXPLAIN EXTENDED, it would be natural to print something semi-join like,
+i.e. for
+
+ SELECT ... FROM ot WHERE oe IN (SELECT ie FROM materialized-non-sj-select)
+
+we'll print
+
+ SELECT ... FROM ot SJ (SELECT ie FROM materialized-non-sj-select) ON oe=XX
+
+the XX part is not clear. we don't want to print 'ie' the second time here?
+
+2.2 On subquery predicate removal
+---------------------------------
+Q: if we remove the subquery predicate permanently, who will run
+fix_fields() for it? For semi-joins we don't have the problem as we
+inject into ON expression (right? or not? we have sj_on_expr, too...
+(Investigation: we the the same Item* pointer both in WHERE and
+as sj_on_expr. fix_fields() is called for the WHERE part and that's
+how sj_on_expr gets fixed. This works as long as
+Item_func_eq::fix_fields() does not try to substitute itself with
+another item).
+A: ?
+
+2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
+------------------------------------------------------------
+JOIN_TABs only live for the duration of one PS re-execution, so we'll have to
+- make conversion fully undoable
+- perform it sufficiently late in the optimization process, at the point
+ where JOIN_TABs are already allocated.
+
+Note that if we don't position JTBM predicates in join's TABLE_LIST tree, then
+it will be impossible to handle JTBM queries inside/outside of outer joins.
+
+2.3 What is expected of the result of conversion
+------------------------------------------------
+Join [pre]optimization relies on each optimized entity to have a bit in
+table_map.
+
+TODO: where do we check if there will be enough bits for everyone? (at the
+ point where we assign them?)
+
+The bit stored in join_tab->table->map, and the apparent problem is that JTBM
+join_tabs do not naturally have TABLE* object.
+
+We could use the the one that will be used for Materialization, but that will
+stop working when we will have to include IN->EXISTS in the choice.
+
+Current approach: don't create a table. create a table_map element in JOIN_TAB
+instead. Evgen has probably done something like that already.
+
+3. Pre-optimization steps
+=========================
+JOIN_TABs are allocated in make_join_statistics(). This where the changes will
+be needed: for JOIN_TABs that correspond to JTBM-tables:
+
+- don't set tab->table, set tab->jtbm_select (or whatever)
+- run subquery's optimizer to get its output cardinality
+
+3.1 Constant detection
+----------------------
+What about subqueries that "are constant"?
+ const_item IN (SELECT uncorrelated) -> is constant, but not something
+ we would want to evaluate.
+ something IN (SELECT from_constant_join) -> is constant
+
+Do we need to mark their JOIN_TABs as constant?
+
+3.3 update_ref_and_keys
+-----------------------
+* Walk through JTBM elements and inject KEYUSE elements for their
+ IN-equalities.
+
+TODO: KEYUSE elements imply presense of KEYs! Which we don't have!
+
+3.4 JOIN_TAB sorting criteria
+-----------------------------
+Q: Where do we put JTBM's join_tab when pre-sorting records?
+A: it should sort as regular table.
+
+TODO: where do we remove the predicates from the WHERE?
+ - remove them like SJ-converter does
+ - remove them with optimizer (like remove_eq_conds does)
+
+4. Optimization
+===============
+Add a branch in best_access_path to account for
+- JTBM-Materialization
+- JTBM-Materialization-Scan.
+
+5. Execution
+============
+* We should be able to reuse item_subselect.cc code for lookups
+* But will have to use our own temptable scan code
+
+TODO: is it possible to have any unification with SJ-Materialization?
+
+User interface
+--------------
+Any @@optimizer_switch flags for all this?
+
-=-=(Igor - Wed, 10 Mar 2010, 22:02)=-=-
High Level Description modified.
--- /tmp/wklog.90.old.2007 2010-03-10 22:02:23.000000000 +0000
+++ /tmp/wklog.90.new.2007 2010-03-10 22:02:23.000000000 +0000
@@ -13,8 +13,8 @@
for each record R2 in big_table such that oe=R1
pass R2 to output
-Semi-join materialization supports such strategy with SJM-Scan strategy. This WL
-entry is about adding support for such strategies for non-semijoin subqueries.
+Semi-join materialization supports the inside-out strategy. This WL entry is
+about adding support for such strategies for non-semijoin subqueries.
Once WL#89 is done, there will be a cost-based choice between
-=-=(Igor - Wed, 10 Mar 2010, 21:52)=-=-
Status updated.
--- /tmp/wklog.90.old.882 2010-03-10 21:52:02.000000000 +0000
+++ /tmp/wklog.90.new.882 2010-03-10 21:52:02.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Psergey - Sun, 28 Feb 2010, 15:37)=-=-
High Level Description modified.
--- /tmp/wklog.90.old.23524 2010-02-28 15:37:47.000000000 +0000
+++ /tmp/wklog.90.new.23524 2010-02-28 15:37:47.000000000 +0000
@@ -15,3 +15,7 @@
Semi-join materialization supports such strategy with SJM-Scan strategy. This WL
entry is about adding support for such strategies for non-semijoin subqueries.
+
+
+Once WL#89 is done, there will be a cost-based choice between
+Materialization+lookup, Materialization+scan, and IN->EXISTS+lookup strategies.
-=-=(Psergey - Sun, 28 Feb 2010, 15:22)=-=-
High-Level Specification modified.
--- /tmp/wklog.90.old.23033 2010-02-28 15:22:09.000000000 +0000
+++ /tmp/wklog.90.new.23033 2010-02-28 15:22:09.000000000 +0000
@@ -1 +1,33 @@
+Basic idea on how this could be achieved:
+
+Pre-optimization phase
+----------------------
+
+The rewrite
+~~~~~~~~~~~
+If we find a subquery predicate that is
+- not processed by current semi-join optimizations
+- is an AND-part of the WHERE/ON clause
+- can be executed with Materialization
+
+then
+- Remove the predicate from WHERE/ON clause
+- Add a special JOIN_TAB object instead.
+
+Plan options
+~~~~~~~~~~~~
+- Use the IN-equality to create KEYUSE elements.
+
+Optimization
+------------
+- Pre-optimize the subquery so we know materialization cost
+- Whenever best_access_path() encounters the "special JOIN_TAB" it should
+ consider two strategies:
+ A. Materialization and making lookups in the materialized table (if applicable)
+ B. Materialization and then scanning the materialized table.
+
+
+EXPLAIN
+-------
+TODO how this will look in EXPLAIN output?
-=-=(Psergey - Sun, 28 Feb 2010, 14:56)=-=-
Dependency created: 91 now depends on 90
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 90
-=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=-
Title modified.
--- /tmp/wklog.90.old.21903 2010-02-28 14:47:54.000000000 +0000
+++ /tmp/wklog.90.new.21903 2010-02-28 14:47:54.000000000 +0000
@@ -1 +1 @@
- Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE
+Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE
-=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=-
High Level Description modified.
--- /tmp/wklog.90.old.21880 2010-02-28 14:47:28.000000000 +0000
+++ /tmp/wklog.90.new.21880 2010-02-28 14:47:28.000000000 +0000
@@ -1,10 +1,17 @@
-For uncorrelated IN subqueries that can't be converted to semi-joins it is
-necessary to make a cost-based choice between IN->EXISTS and Materialization
-strategies.
+Consider the following case:
-Both strategies handle two cases:
-1. A simple case w/o NULLs handling
-2. Handling NULLs.
+SELECT * FROM big_table
+WHERE oe IN (SELECT ie FROM table_with_few_groups
+ WHERE ...
+ GROUP BY group_col) AND ...
-This WL is about making cost-based decision for #1.
+Here the best way to execute the query is:
+ Materialize the subquery;
+ # now run the join:
+ for each record R1 in materialized table
+ for each record R2 in big_table such that oe=R1
+ pass R2 to output
+
+Semi-join materialization supports such strategy with SJM-Scan strategy. This WL
+entry is about adding support for such strategies for non-semijoin subqueries.
-=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=-
Title modified.
--- /tmp/wklog.90.old.21859 2010-02-28 14:47:02.000000000 +0000
+++ /tmp/wklog.90.new.21859 2010-02-28 14:47:02.000000000 +0000
@@ -1 +1 @@
-Subqueries: cost-based choice between Materialization and IN->EXISTS transformation
+ Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=90&nolimit=1
DESCRIPTION:
Consider the following case:
SELECT * FROM big_table
WHERE oe IN (SELECT ie FROM table_with_few_groups
WHERE ...
GROUP BY group_col) AND ...
Here the best way to execute the query is:
Materialize the subquery;
# now run the join:
for each record R1 in materialized table
for each record R2 in big_table such that oe=R1
pass R2 to output
Semi-join materialization supports the inside-out strategy. This WL entry is
about adding support for such strategies for non-semijoin subqueries.
Once WL#89 is done, there will be a cost-based choice between
Materialization+lookup, Materialization+scan, and IN->EXISTS+lookup strategies.
HIGH-LEVEL SPECIFICATION:
Basic idea on how this could be achieved:
Pre-optimization phase
----------------------
The rewrite
~~~~~~~~~~~
If we find a subquery predicate that is
- not processed by current semi-join optimizations
- is an AND-part of the WHERE/ON clause
- can be executed with Materialization
then
- Remove the predicate from WHERE/ON clause
- Add a special JOIN_TAB object instead.
Plan options
~~~~~~~~~~~~
- Use the IN-equality to create KEYUSE elements.
Optimization
------------
- Pre-optimize the subquery so we know materialization cost
- Whenever best_access_path() encounters the "special JOIN_TAB" it should
consider two strategies:
A. Materialization and making lookups in the materialized table (if applicable)
B. Materialization and then scanning the materialized table.
EXPLAIN
-------
TODO how this will look in EXPLAIN output?
LOW-LEVEL DESIGN:
<contents>
1. Applicability check
2. Representation
2.1 Option #1: Convert to TABLE_LIST
2.2 On subquery predicate removal
2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
2.3 What is expected of the result of conversion
3. Pre-optimization steps
3.1 Constant detection
3.3 update_ref_and_keys
3.4 JOIN_TAB sorting criteria
4. Optimization
5. Execution
User interface.
</contents>
We'll call the new execution strategy "jtbm-materialization", for the lack of
better name.
1. Applicability check
======================
The criteria for checking whether a subquery can be processed with
jtbm-materialization can be checked at JOIN::prepare stage (like it
happens with semi-join check)
2. Representation
=================
2.1 Option #1: Convert to TABLE_LIST
------------------------------------
Make it work like semi-join nests: each jtbm-predicate is converted into a
TABLE_LIST object. This will make it
- uniform with semi-joins (we've stepped on all rakes there)
- allow to process JTBM-subqueries in ON expressions
simplify_joins() will handle jtbm TABLE_LISTs as some kinds of opaque base
tables.
for EXPLAIN EXTENDED, it would be natural to print something semi-join like,
i.e. for
SELECT ... FROM ot WHERE oe IN (SELECT ie FROM materialized-non-sj-select)
we'll print
SELECT ... FROM ot SJ (SELECT ie FROM materialized-non-sj-select) ON oe=XX
the XX part is not clear. we don't want to print 'ie' the second time here?
2.2 On subquery predicate removal
---------------------------------
Q: if we remove the subquery predicate permanently, who will run
fix_fields() for it? For semi-joins we don't have the problem as we
inject into ON expression (right? or not? we have sj_on_expr, too...
(Investigation: we the the same Item* pointer both in WHERE and
as sj_on_expr. fix_fields() is called for the WHERE part and that's
how sj_on_expr gets fixed. This works as long as
Item_func_eq::fix_fields() does not try to substitute itself with
another item).
A: ?
2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
------------------------------------------------------------
JOIN_TABs only live for the duration of one PS re-execution, so we'll have to
- make conversion fully undoable
- perform it sufficiently late in the optimization process, at the point
where JOIN_TABs are already allocated.
Note that if we don't position JTBM predicates in join's TABLE_LIST tree, then
it will be impossible to handle JTBM queries inside/outside of outer joins.
2.3 What is expected of the result of conversion
------------------------------------------------
Join [pre]optimization relies on each optimized entity to have a bit in
table_map.
TODO: where do we check if there will be enough bits for everyone? (at the
point where we assign them?)
The bit stored in join_tab->table->map, and the apparent problem is that JTBM
join_tabs do not naturally have TABLE* object.
We could use the the one that will be used for Materialization, but that will
stop working when we will have to include IN->EXISTS in the choice.
Current approach: don't create a table. create a table_map element in JOIN_TAB
instead. Evgen has probably done something like that already.
3. Pre-optimization steps
=========================
JOIN_TABs are allocated in make_join_statistics(). This where the changes will
be needed: for JOIN_TABs that correspond to JTBM-tables:
- don't set tab->table, set tab->jtbm_select (or whatever)
- run subquery's optimizer to get its output cardinality
3.1 Constant detection
----------------------
What about subqueries that "are constant"?
const_item IN (SELECT uncorrelated) -> is constant, but not something
we would want to evaluate.
something IN (SELECT from_constant_join) -> is constant
Do we need to mark their JOIN_TABs as constant?
3.3 update_ref_and_keys
-----------------------
* Walk through JTBM elements and inject KEYUSE elements for their
IN-equalities.
TODO: KEYUSE elements imply presense of KEYs! Which we don't have!
3.4 JOIN_TAB sorting criteria
-----------------------------
Q: Where do we put JTBM's join_tab when pre-sorting records?
A: it should sort as regular table.
TODO: where do we remove the predicates from the WHERE?
- remove them like SJ-converter does
- remove them with optimizer (like remove_eq_conds does)
4. Optimization
===============
Add a branch in best_access_path to account for
- JTBM-Materialization
- JTBM-Materialization-Scan.
5. Execution
============
* We should be able to reuse item_subselect.cc code for lookups
* But will have to use our own temptable scan code
TODO: is it possible to have any unification with SJ-Materialization?
User interface
--------------
Any @@optimizer_switch flags for all this?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Psergey): Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE (90)
by worklog-noreply@askmonty.org 24 Mar '10
by worklog-noreply@askmonty.org 24 Mar '10
24 Mar '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Subqueries: Inside-out execution for non-semijoin materialized
subqueries that are AND-parts of the WHERE
CREATION DATE..: Sun, 28 Feb 2010, 13:45
SUPERVISOR.....: Monty
IMPLEMENTOR....: Psergey
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 90 (http://askmonty.org/worklog/?tid=90)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: -1 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Wed, 24 Mar 2010, 14:42)=-=-
Low Level Design modified.
--- /tmp/wklog.90.old.19182 2010-03-24 14:42:54.000000000 +0000
+++ /tmp/wklog.90.new.19182 2010-03-24 14:42:54.000000000 +0000
@@ -1 +1,140 @@
+<contents>
+1. Applicability check
+2. Representation
+2.1 Option #1: Convert to TABLE_LIST
+2.2 On subquery predicate removal
+2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
+2.3 What is expected of the result of conversion
+3. Pre-optimization steps
+3.1 Constant detection
+3.3 update_ref_and_keys
+3.4 JOIN_TAB sorting criteria
+4. Optimization
+5. Execution
+User interface.
+</contents>
+
+We'll call the new execution strategy "jtbm-materialization", for the lack of
+better name.
+
+1. Applicability check
+======================
+The criteria for checking whether a subquery can be processed with
+jtbm-materialization can be checked at JOIN::prepare stage (like it
+happens with semi-join check)
+
+2. Representation
+=================
+
+2.1 Option #1: Convert to TABLE_LIST
+------------------------------------
+Make it work like semi-join nests: each jtbm-predicate is converted into a
+TABLE_LIST object. This will make it
+
+ - uniform with semi-joins (we've stepped on all rakes there)
+ - allow to process JTBM-subqueries in ON expressions
+
+simplify_joins() will handle jtbm TABLE_LISTs as some kinds of opaque base
+tables.
+
+for EXPLAIN EXTENDED, it would be natural to print something semi-join like,
+i.e. for
+
+ SELECT ... FROM ot WHERE oe IN (SELECT ie FROM materialized-non-sj-select)
+
+we'll print
+
+ SELECT ... FROM ot SJ (SELECT ie FROM materialized-non-sj-select) ON oe=XX
+
+the XX part is not clear. we don't want to print 'ie' the second time here?
+
+2.2 On subquery predicate removal
+---------------------------------
+Q: if we remove the subquery predicate permanently, who will run
+fix_fields() for it? For semi-joins we don't have the problem as we
+inject into ON expression (right? or not? we have sj_on_expr, too...
+(Investigation: we the the same Item* pointer both in WHERE and
+as sj_on_expr. fix_fields() is called for the WHERE part and that's
+how sj_on_expr gets fixed. This works as long as
+Item_func_eq::fix_fields() does not try to substitute itself with
+another item).
+A: ?
+
+2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
+------------------------------------------------------------
+JOIN_TABs only live for the duration of one PS re-execution, so we'll have to
+- make conversion fully undoable
+- perform it sufficiently late in the optimization process, at the point
+ where JOIN_TABs are already allocated.
+
+Note that if we don't position JTBM predicates in join's TABLE_LIST tree, then
+it will be impossible to handle JTBM queries inside/outside of outer joins.
+
+2.3 What is expected of the result of conversion
+------------------------------------------------
+Join [pre]optimization relies on each optimized entity to have a bit in
+table_map.
+
+TODO: where do we check if there will be enough bits for everyone? (at the
+ point where we assign them?)
+
+The bit stored in join_tab->table->map, and the apparent problem is that JTBM
+join_tabs do not naturally have TABLE* object.
+
+We could use the the one that will be used for Materialization, but that will
+stop working when we will have to include IN->EXISTS in the choice.
+
+Current approach: don't create a table. create a table_map element in JOIN_TAB
+instead. Evgen has probably done something like that already.
+
+3. Pre-optimization steps
+=========================
+JOIN_TABs are allocated in make_join_statistics(). This where the changes will
+be needed: for JOIN_TABs that correspond to JTBM-tables:
+
+- don't set tab->table, set tab->jtbm_select (or whatever)
+- run subquery's optimizer to get its output cardinality
+
+3.1 Constant detection
+----------------------
+What about subqueries that "are constant"?
+ const_item IN (SELECT uncorrelated) -> is constant, but not something
+ we would want to evaluate.
+ something IN (SELECT from_constant_join) -> is constant
+
+Do we need to mark their JOIN_TABs as constant?
+
+3.3 update_ref_and_keys
+-----------------------
+* Walk through JTBM elements and inject KEYUSE elements for their
+ IN-equalities.
+
+TODO: KEYUSE elements imply presense of KEYs! Which we don't have!
+
+3.4 JOIN_TAB sorting criteria
+-----------------------------
+Q: Where do we put JTBM's join_tab when pre-sorting records?
+A: it should sort as regular table.
+
+TODO: where do we remove the predicates from the WHERE?
+ - remove them like SJ-converter does
+ - remove them with optimizer (like remove_eq_conds does)
+
+4. Optimization
+===============
+Add a branch in best_access_path to account for
+- JTBM-Materialization
+- JTBM-Materialization-Scan.
+
+5. Execution
+============
+* We should be able to reuse item_subselect.cc code for lookups
+* But will have to use our own temptable scan code
+
+TODO: is it possible to have any unification with SJ-Materialization?
+
+User interface
+--------------
+Any @@optimizer_switch flags for all this?
+
-=-=(Igor - Wed, 10 Mar 2010, 22:02)=-=-
High Level Description modified.
--- /tmp/wklog.90.old.2007 2010-03-10 22:02:23.000000000 +0000
+++ /tmp/wklog.90.new.2007 2010-03-10 22:02:23.000000000 +0000
@@ -13,8 +13,8 @@
for each record R2 in big_table such that oe=R1
pass R2 to output
-Semi-join materialization supports such strategy with SJM-Scan strategy. This WL
-entry is about adding support for such strategies for non-semijoin subqueries.
+Semi-join materialization supports the inside-out strategy. This WL entry is
+about adding support for such strategies for non-semijoin subqueries.
Once WL#89 is done, there will be a cost-based choice between
-=-=(Igor - Wed, 10 Mar 2010, 21:52)=-=-
Status updated.
--- /tmp/wklog.90.old.882 2010-03-10 21:52:02.000000000 +0000
+++ /tmp/wklog.90.new.882 2010-03-10 21:52:02.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Psergey - Sun, 28 Feb 2010, 15:37)=-=-
High Level Description modified.
--- /tmp/wklog.90.old.23524 2010-02-28 15:37:47.000000000 +0000
+++ /tmp/wklog.90.new.23524 2010-02-28 15:37:47.000000000 +0000
@@ -15,3 +15,7 @@
Semi-join materialization supports such strategy with SJM-Scan strategy. This WL
entry is about adding support for such strategies for non-semijoin subqueries.
+
+
+Once WL#89 is done, there will be a cost-based choice between
+Materialization+lookup, Materialization+scan, and IN->EXISTS+lookup strategies.
-=-=(Psergey - Sun, 28 Feb 2010, 15:22)=-=-
High-Level Specification modified.
--- /tmp/wklog.90.old.23033 2010-02-28 15:22:09.000000000 +0000
+++ /tmp/wklog.90.new.23033 2010-02-28 15:22:09.000000000 +0000
@@ -1 +1,33 @@
+Basic idea on how this could be achieved:
+
+Pre-optimization phase
+----------------------
+
+The rewrite
+~~~~~~~~~~~
+If we find a subquery predicate that is
+- not processed by current semi-join optimizations
+- is an AND-part of the WHERE/ON clause
+- can be executed with Materialization
+
+then
+- Remove the predicate from WHERE/ON clause
+- Add a special JOIN_TAB object instead.
+
+Plan options
+~~~~~~~~~~~~
+- Use the IN-equality to create KEYUSE elements.
+
+Optimization
+------------
+- Pre-optimize the subquery so we know materialization cost
+- Whenever best_access_path() encounters the "special JOIN_TAB" it should
+ consider two strategies:
+ A. Materialization and making lookups in the materialized table (if applicable)
+ B. Materialization and then scanning the materialized table.
+
+
+EXPLAIN
+-------
+TODO how this will look in EXPLAIN output?
-=-=(Psergey - Sun, 28 Feb 2010, 14:56)=-=-
Dependency created: 91 now depends on 90
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 90
-=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=-
Title modified.
--- /tmp/wklog.90.old.21903 2010-02-28 14:47:54.000000000 +0000
+++ /tmp/wklog.90.new.21903 2010-02-28 14:47:54.000000000 +0000
@@ -1 +1 @@
- Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE
+Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE
-=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=-
High Level Description modified.
--- /tmp/wklog.90.old.21880 2010-02-28 14:47:28.000000000 +0000
+++ /tmp/wklog.90.new.21880 2010-02-28 14:47:28.000000000 +0000
@@ -1,10 +1,17 @@
-For uncorrelated IN subqueries that can't be converted to semi-joins it is
-necessary to make a cost-based choice between IN->EXISTS and Materialization
-strategies.
+Consider the following case:
-Both strategies handle two cases:
-1. A simple case w/o NULLs handling
-2. Handling NULLs.
+SELECT * FROM big_table
+WHERE oe IN (SELECT ie FROM table_with_few_groups
+ WHERE ...
+ GROUP BY group_col) AND ...
-This WL is about making cost-based decision for #1.
+Here the best way to execute the query is:
+ Materialize the subquery;
+ # now run the join:
+ for each record R1 in materialized table
+ for each record R2 in big_table such that oe=R1
+ pass R2 to output
+
+Semi-join materialization supports such strategy with SJM-Scan strategy. This WL
+entry is about adding support for such strategies for non-semijoin subqueries.
-=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=-
Title modified.
--- /tmp/wklog.90.old.21859 2010-02-28 14:47:02.000000000 +0000
+++ /tmp/wklog.90.new.21859 2010-02-28 14:47:02.000000000 +0000
@@ -1 +1 @@
-Subqueries: cost-based choice between Materialization and IN->EXISTS transformation
+ Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=90&nolimit=1
DESCRIPTION:
Consider the following case:
SELECT * FROM big_table
WHERE oe IN (SELECT ie FROM table_with_few_groups
WHERE ...
GROUP BY group_col) AND ...
Here the best way to execute the query is:
Materialize the subquery;
# now run the join:
for each record R1 in materialized table
for each record R2 in big_table such that oe=R1
pass R2 to output
Semi-join materialization supports the inside-out strategy. This WL entry is
about adding support for such strategies for non-semijoin subqueries.
Once WL#89 is done, there will be a cost-based choice between
Materialization+lookup, Materialization+scan, and IN->EXISTS+lookup strategies.
HIGH-LEVEL SPECIFICATION:
Basic idea on how this could be achieved:
Pre-optimization phase
----------------------
The rewrite
~~~~~~~~~~~
If we find a subquery predicate that is
- not processed by current semi-join optimizations
- is an AND-part of the WHERE/ON clause
- can be executed with Materialization
then
- Remove the predicate from WHERE/ON clause
- Add a special JOIN_TAB object instead.
Plan options
~~~~~~~~~~~~
- Use the IN-equality to create KEYUSE elements.
Optimization
------------
- Pre-optimize the subquery so we know materialization cost
- Whenever best_access_path() encounters the "special JOIN_TAB" it should
consider two strategies:
A. Materialization and making lookups in the materialized table (if applicable)
B. Materialization and then scanning the materialized table.
EXPLAIN
-------
TODO how this will look in EXPLAIN output?
LOW-LEVEL DESIGN:
<contents>
1. Applicability check
2. Representation
2.1 Option #1: Convert to TABLE_LIST
2.2 On subquery predicate removal
2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
2.3 What is expected of the result of conversion
3. Pre-optimization steps
3.1 Constant detection
3.3 update_ref_and_keys
3.4 JOIN_TAB sorting criteria
4. Optimization
5. Execution
User interface.
</contents>
We'll call the new execution strategy "jtbm-materialization", for the lack of
better name.
1. Applicability check
======================
The criteria for checking whether a subquery can be processed with
jtbm-materialization can be checked at JOIN::prepare stage (like it
happens with semi-join check)
2. Representation
=================
2.1 Option #1: Convert to TABLE_LIST
------------------------------------
Make it work like semi-join nests: each jtbm-predicate is converted into a
TABLE_LIST object. This will make it
- uniform with semi-joins (we've stepped on all rakes there)
- allow to process JTBM-subqueries in ON expressions
simplify_joins() will handle jtbm TABLE_LISTs as some kinds of opaque base
tables.
for EXPLAIN EXTENDED, it would be natural to print something semi-join like,
i.e. for
SELECT ... FROM ot WHERE oe IN (SELECT ie FROM materialized-non-sj-select)
we'll print
SELECT ... FROM ot SJ (SELECT ie FROM materialized-non-sj-select) ON oe=XX
the XX part is not clear. we don't want to print 'ie' the second time here?
2.2 On subquery predicate removal
---------------------------------
Q: if we remove the subquery predicate permanently, who will run
fix_fields() for it? For semi-joins we don't have the problem as we
inject into ON expression (right? or not? we have sj_on_expr, too...
(Investigation: we the the same Item* pointer both in WHERE and
as sj_on_expr. fix_fields() is called for the WHERE part and that's
how sj_on_expr gets fixed. This works as long as
Item_func_eq::fix_fields() does not try to substitute itself with
another item).
A: ?
2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
------------------------------------------------------------
JOIN_TABs only live for the duration of one PS re-execution, so we'll have to
- make conversion fully undoable
- perform it sufficiently late in the optimization process, at the point
where JOIN_TABs are already allocated.
Note that if we don't position JTBM predicates in join's TABLE_LIST tree, then
it will be impossible to handle JTBM queries inside/outside of outer joins.
2.3 What is expected of the result of conversion
------------------------------------------------
Join [pre]optimization relies on each optimized entity to have a bit in
table_map.
TODO: where do we check if there will be enough bits for everyone? (at the
point where we assign them?)
The bit stored in join_tab->table->map, and the apparent problem is that JTBM
join_tabs do not naturally have TABLE* object.
We could use the the one that will be used for Materialization, but that will
stop working when we will have to include IN->EXISTS in the choice.
Current approach: don't create a table. create a table_map element in JOIN_TAB
instead. Evgen has probably done something like that already.
3. Pre-optimization steps
=========================
JOIN_TABs are allocated in make_join_statistics(). This where the changes will
be needed: for JOIN_TABs that correspond to JTBM-tables:
- don't set tab->table, set tab->jtbm_select (or whatever)
- run subquery's optimizer to get its output cardinality
3.1 Constant detection
----------------------
What about subqueries that "are constant"?
const_item IN (SELECT uncorrelated) -> is constant, but not something
we would want to evaluate.
something IN (SELECT from_constant_join) -> is constant
Do we need to mark their JOIN_TABs as constant?
3.3 update_ref_and_keys
-----------------------
* Walk through JTBM elements and inject KEYUSE elements for their
IN-equalities.
TODO: KEYUSE elements imply presense of KEYs! Which we don't have!
3.4 JOIN_TAB sorting criteria
-----------------------------
Q: Where do we put JTBM's join_tab when pre-sorting records?
A: it should sort as regular table.
TODO: where do we remove the predicates from the WHERE?
- remove them like SJ-converter does
- remove them with optimizer (like remove_eq_conds does)
4. Optimization
===============
Add a branch in best_access_path to account for
- JTBM-Materialization
- JTBM-Materialization-Scan.
5. Execution
============
* We should be able to reuse item_subselect.cc code for lookups
* But will have to use our own temptable scan code
TODO: is it possible to have any unification with SJ-Materialization?
User interface
--------------
Any @@optimizer_switch flags for all this?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Psergey): Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE (90)
by worklog-noreply@askmonty.org 24 Mar '10
by worklog-noreply@askmonty.org 24 Mar '10
24 Mar '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Subqueries: Inside-out execution for non-semijoin materialized
subqueries that are AND-parts of the WHERE
CREATION DATE..: Sun, 28 Feb 2010, 13:45
SUPERVISOR.....: Monty
IMPLEMENTOR....: Psergey
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 90 (http://askmonty.org/worklog/?tid=90)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: -1 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Wed, 24 Mar 2010, 14:42)=-=-
Low Level Design modified.
--- /tmp/wklog.90.old.19182 2010-03-24 14:42:54.000000000 +0000
+++ /tmp/wklog.90.new.19182 2010-03-24 14:42:54.000000000 +0000
@@ -1 +1,140 @@
+<contents>
+1. Applicability check
+2. Representation
+2.1 Option #1: Convert to TABLE_LIST
+2.2 On subquery predicate removal
+2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
+2.3 What is expected of the result of conversion
+3. Pre-optimization steps
+3.1 Constant detection
+3.3 update_ref_and_keys
+3.4 JOIN_TAB sorting criteria
+4. Optimization
+5. Execution
+User interface.
+</contents>
+
+We'll call the new execution strategy "jtbm-materialization", for the lack of
+better name.
+
+1. Applicability check
+======================
+The criteria for checking whether a subquery can be processed with
+jtbm-materialization can be checked at JOIN::prepare stage (like it
+happens with semi-join check)
+
+2. Representation
+=================
+
+2.1 Option #1: Convert to TABLE_LIST
+------------------------------------
+Make it work like semi-join nests: each jtbm-predicate is converted into a
+TABLE_LIST object. This will make it
+
+ - uniform with semi-joins (we've stepped on all rakes there)
+ - allow to process JTBM-subqueries in ON expressions
+
+simplify_joins() will handle jtbm TABLE_LISTs as some kinds of opaque base
+tables.
+
+for EXPLAIN EXTENDED, it would be natural to print something semi-join like,
+i.e. for
+
+ SELECT ... FROM ot WHERE oe IN (SELECT ie FROM materialized-non-sj-select)
+
+we'll print
+
+ SELECT ... FROM ot SJ (SELECT ie FROM materialized-non-sj-select) ON oe=XX
+
+the XX part is not clear. we don't want to print 'ie' the second time here?
+
+2.2 On subquery predicate removal
+---------------------------------
+Q: if we remove the subquery predicate permanently, who will run
+fix_fields() for it? For semi-joins we don't have the problem as we
+inject into ON expression (right? or not? we have sj_on_expr, too...
+(Investigation: we the the same Item* pointer both in WHERE and
+as sj_on_expr. fix_fields() is called for the WHERE part and that's
+how sj_on_expr gets fixed. This works as long as
+Item_func_eq::fix_fields() does not try to substitute itself with
+another item).
+A: ?
+
+2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
+------------------------------------------------------------
+JOIN_TABs only live for the duration of one PS re-execution, so we'll have to
+- make conversion fully undoable
+- perform it sufficiently late in the optimization process, at the point
+ where JOIN_TABs are already allocated.
+
+Note that if we don't position JTBM predicates in join's TABLE_LIST tree, then
+it will be impossible to handle JTBM queries inside/outside of outer joins.
+
+2.3 What is expected of the result of conversion
+------------------------------------------------
+Join [pre]optimization relies on each optimized entity to have a bit in
+table_map.
+
+TODO: where do we check if there will be enough bits for everyone? (at the
+ point where we assign them?)
+
+The bit stored in join_tab->table->map, and the apparent problem is that JTBM
+join_tabs do not naturally have TABLE* object.
+
+We could use the the one that will be used for Materialization, but that will
+stop working when we will have to include IN->EXISTS in the choice.
+
+Current approach: don't create a table. create a table_map element in JOIN_TAB
+instead. Evgen has probably done something like that already.
+
+3. Pre-optimization steps
+=========================
+JOIN_TABs are allocated in make_join_statistics(). This where the changes will
+be needed: for JOIN_TABs that correspond to JTBM-tables:
+
+- don't set tab->table, set tab->jtbm_select (or whatever)
+- run subquery's optimizer to get its output cardinality
+
+3.1 Constant detection
+----------------------
+What about subqueries that "are constant"?
+ const_item IN (SELECT uncorrelated) -> is constant, but not something
+ we would want to evaluate.
+ something IN (SELECT from_constant_join) -> is constant
+
+Do we need to mark their JOIN_TABs as constant?
+
+3.3 update_ref_and_keys
+-----------------------
+* Walk through JTBM elements and inject KEYUSE elements for their
+ IN-equalities.
+
+TODO: KEYUSE elements imply presense of KEYs! Which we don't have!
+
+3.4 JOIN_TAB sorting criteria
+-----------------------------
+Q: Where do we put JTBM's join_tab when pre-sorting records?
+A: it should sort as regular table.
+
+TODO: where do we remove the predicates from the WHERE?
+ - remove them like SJ-converter does
+ - remove them with optimizer (like remove_eq_conds does)
+
+4. Optimization
+===============
+Add a branch in best_access_path to account for
+- JTBM-Materialization
+- JTBM-Materialization-Scan.
+
+5. Execution
+============
+* We should be able to reuse item_subselect.cc code for lookups
+* But will have to use our own temptable scan code
+
+TODO: is it possible to have any unification with SJ-Materialization?
+
+User interface
+--------------
+Any @@optimizer_switch flags for all this?
+
-=-=(Igor - Wed, 10 Mar 2010, 22:02)=-=-
High Level Description modified.
--- /tmp/wklog.90.old.2007 2010-03-10 22:02:23.000000000 +0000
+++ /tmp/wklog.90.new.2007 2010-03-10 22:02:23.000000000 +0000
@@ -13,8 +13,8 @@
for each record R2 in big_table such that oe=R1
pass R2 to output
-Semi-join materialization supports such strategy with SJM-Scan strategy. This WL
-entry is about adding support for such strategies for non-semijoin subqueries.
+Semi-join materialization supports the inside-out strategy. This WL entry is
+about adding support for such strategies for non-semijoin subqueries.
Once WL#89 is done, there will be a cost-based choice between
-=-=(Igor - Wed, 10 Mar 2010, 21:52)=-=-
Status updated.
--- /tmp/wklog.90.old.882 2010-03-10 21:52:02.000000000 +0000
+++ /tmp/wklog.90.new.882 2010-03-10 21:52:02.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Psergey - Sun, 28 Feb 2010, 15:37)=-=-
High Level Description modified.
--- /tmp/wklog.90.old.23524 2010-02-28 15:37:47.000000000 +0000
+++ /tmp/wklog.90.new.23524 2010-02-28 15:37:47.000000000 +0000
@@ -15,3 +15,7 @@
Semi-join materialization supports such strategy with SJM-Scan strategy. This WL
entry is about adding support for such strategies for non-semijoin subqueries.
+
+
+Once WL#89 is done, there will be a cost-based choice between
+Materialization+lookup, Materialization+scan, and IN->EXISTS+lookup strategies.
-=-=(Psergey - Sun, 28 Feb 2010, 15:22)=-=-
High-Level Specification modified.
--- /tmp/wklog.90.old.23033 2010-02-28 15:22:09.000000000 +0000
+++ /tmp/wklog.90.new.23033 2010-02-28 15:22:09.000000000 +0000
@@ -1 +1,33 @@
+Basic idea on how this could be achieved:
+
+Pre-optimization phase
+----------------------
+
+The rewrite
+~~~~~~~~~~~
+If we find a subquery predicate that is
+- not processed by current semi-join optimizations
+- is an AND-part of the WHERE/ON clause
+- can be executed with Materialization
+
+then
+- Remove the predicate from WHERE/ON clause
+- Add a special JOIN_TAB object instead.
+
+Plan options
+~~~~~~~~~~~~
+- Use the IN-equality to create KEYUSE elements.
+
+Optimization
+------------
+- Pre-optimize the subquery so we know materialization cost
+- Whenever best_access_path() encounters the "special JOIN_TAB" it should
+ consider two strategies:
+ A. Materialization and making lookups in the materialized table (if applicable)
+ B. Materialization and then scanning the materialized table.
+
+
+EXPLAIN
+-------
+TODO how this will look in EXPLAIN output?
-=-=(Psergey - Sun, 28 Feb 2010, 14:56)=-=-
Dependency created: 91 now depends on 90
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 90
-=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=-
Title modified.
--- /tmp/wklog.90.old.21903 2010-02-28 14:47:54.000000000 +0000
+++ /tmp/wklog.90.new.21903 2010-02-28 14:47:54.000000000 +0000
@@ -1 +1 @@
- Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE
+Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE
-=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=-
High Level Description modified.
--- /tmp/wklog.90.old.21880 2010-02-28 14:47:28.000000000 +0000
+++ /tmp/wklog.90.new.21880 2010-02-28 14:47:28.000000000 +0000
@@ -1,10 +1,17 @@
-For uncorrelated IN subqueries that can't be converted to semi-joins it is
-necessary to make a cost-based choice between IN->EXISTS and Materialization
-strategies.
+Consider the following case:
-Both strategies handle two cases:
-1. A simple case w/o NULLs handling
-2. Handling NULLs.
+SELECT * FROM big_table
+WHERE oe IN (SELECT ie FROM table_with_few_groups
+ WHERE ...
+ GROUP BY group_col) AND ...
-This WL is about making cost-based decision for #1.
+Here the best way to execute the query is:
+ Materialize the subquery;
+ # now run the join:
+ for each record R1 in materialized table
+ for each record R2 in big_table such that oe=R1
+ pass R2 to output
+
+Semi-join materialization supports such strategy with SJM-Scan strategy. This WL
+entry is about adding support for such strategies for non-semijoin subqueries.
-=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=-
Title modified.
--- /tmp/wklog.90.old.21859 2010-02-28 14:47:02.000000000 +0000
+++ /tmp/wklog.90.new.21859 2010-02-28 14:47:02.000000000 +0000
@@ -1 +1 @@
-Subqueries: cost-based choice between Materialization and IN->EXISTS transformation
+ Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=90&nolimit=1
DESCRIPTION:
Consider the following case:
SELECT * FROM big_table
WHERE oe IN (SELECT ie FROM table_with_few_groups
WHERE ...
GROUP BY group_col) AND ...
Here the best way to execute the query is:
Materialize the subquery;
# now run the join:
for each record R1 in materialized table
for each record R2 in big_table such that oe=R1
pass R2 to output
Semi-join materialization supports the inside-out strategy. This WL entry is
about adding support for such strategies for non-semijoin subqueries.
Once WL#89 is done, there will be a cost-based choice between
Materialization+lookup, Materialization+scan, and IN->EXISTS+lookup strategies.
HIGH-LEVEL SPECIFICATION:
Basic idea on how this could be achieved:
Pre-optimization phase
----------------------
The rewrite
~~~~~~~~~~~
If we find a subquery predicate that is
- not processed by current semi-join optimizations
- is an AND-part of the WHERE/ON clause
- can be executed with Materialization
then
- Remove the predicate from WHERE/ON clause
- Add a special JOIN_TAB object instead.
Plan options
~~~~~~~~~~~~
- Use the IN-equality to create KEYUSE elements.
Optimization
------------
- Pre-optimize the subquery so we know materialization cost
- Whenever best_access_path() encounters the "special JOIN_TAB" it should
consider two strategies:
A. Materialization and making lookups in the materialized table (if applicable)
B. Materialization and then scanning the materialized table.
EXPLAIN
-------
TODO how this will look in EXPLAIN output?
LOW-LEVEL DESIGN:
<contents>
1. Applicability check
2. Representation
2.1 Option #1: Convert to TABLE_LIST
2.2 On subquery predicate removal
2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
2.3 What is expected of the result of conversion
3. Pre-optimization steps
3.1 Constant detection
3.3 update_ref_and_keys
3.4 JOIN_TAB sorting criteria
4. Optimization
5. Execution
User interface.
</contents>
We'll call the new execution strategy "jtbm-materialization", for the lack of
better name.
1. Applicability check
======================
The criteria for checking whether a subquery can be processed with
jtbm-materialization can be checked at JOIN::prepare stage (like it
happens with semi-join check)
2. Representation
=================
2.1 Option #1: Convert to TABLE_LIST
------------------------------------
Make it work like semi-join nests: each jtbm-predicate is converted into a
TABLE_LIST object. This will make it
- uniform with semi-joins (we've stepped on all rakes there)
- allow to process JTBM-subqueries in ON expressions
simplify_joins() will handle jtbm TABLE_LISTs as some kinds of opaque base
tables.
for EXPLAIN EXTENDED, it would be natural to print something semi-join like,
i.e. for
SELECT ... FROM ot WHERE oe IN (SELECT ie FROM materialized-non-sj-select)
we'll print
SELECT ... FROM ot SJ (SELECT ie FROM materialized-non-sj-select) ON oe=XX
the XX part is not clear. we don't want to print 'ie' the second time here?
2.2 On subquery predicate removal
---------------------------------
Q: if we remove the subquery predicate permanently, who will run
fix_fields() for it? For semi-joins we don't have the problem as we
inject into ON expression (right? or not? we have sj_on_expr, too...
(Investigation: we the the same Item* pointer both in WHERE and
as sj_on_expr. fix_fields() is called for the WHERE part and that's
how sj_on_expr gets fixed. This works as long as
Item_func_eq::fix_fields() does not try to substitute itself with
another item).
A: ?
2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
------------------------------------------------------------
JOIN_TABs only live for the duration of one PS re-execution, so we'll have to
- make conversion fully undoable
- perform it sufficiently late in the optimization process, at the point
where JOIN_TABs are already allocated.
Note that if we don't position JTBM predicates in join's TABLE_LIST tree, then
it will be impossible to handle JTBM queries inside/outside of outer joins.
2.3 What is expected of the result of conversion
------------------------------------------------
Join [pre]optimization relies on each optimized entity to have a bit in
table_map.
TODO: where do we check if there will be enough bits for everyone? (at the
point where we assign them?)
The bit stored in join_tab->table->map, and the apparent problem is that JTBM
join_tabs do not naturally have TABLE* object.
We could use the the one that will be used for Materialization, but that will
stop working when we will have to include IN->EXISTS in the choice.
Current approach: don't create a table. create a table_map element in JOIN_TAB
instead. Evgen has probably done something like that already.
3. Pre-optimization steps
=========================
JOIN_TABs are allocated in make_join_statistics(). This where the changes will
be needed: for JOIN_TABs that correspond to JTBM-tables:
- don't set tab->table, set tab->jtbm_select (or whatever)
- run subquery's optimizer to get its output cardinality
3.1 Constant detection
----------------------
What about subqueries that "are constant"?
const_item IN (SELECT uncorrelated) -> is constant, but not something
we would want to evaluate.
something IN (SELECT from_constant_join) -> is constant
Do we need to mark their JOIN_TABs as constant?
3.3 update_ref_and_keys
-----------------------
* Walk through JTBM elements and inject KEYUSE elements for their
IN-equalities.
TODO: KEYUSE elements imply presense of KEYs! Which we don't have!
3.4 JOIN_TAB sorting criteria
-----------------------------
Q: Where do we put JTBM's join_tab when pre-sorting records?
A: it should sort as regular table.
TODO: where do we remove the predicates from the WHERE?
- remove them like SJ-converter does
- remove them with optimizer (like remove_eq_conds does)
4. Optimization
===============
Add a branch in best_access_path to account for
- JTBM-Materialization
- JTBM-Materialization-Scan.
5. Execution
============
* We should be able to reuse item_subselect.cc code for lookups
* But will have to use our own temptable scan code
TODO: is it possible to have any unification with SJ-Materialization?
User interface
--------------
Any @@optimizer_switch flags for all this?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Psergey): Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE (90)
by worklog-noreply@askmonty.org 24 Mar '10
by worklog-noreply@askmonty.org 24 Mar '10
24 Mar '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Subqueries: Inside-out execution for non-semijoin materialized
subqueries that are AND-parts of the WHERE
CREATION DATE..: Sun, 28 Feb 2010, 13:45
SUPERVISOR.....: Monty
IMPLEMENTOR....: Psergey
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 90 (http://askmonty.org/worklog/?tid=90)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: -1 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Wed, 24 Mar 2010, 14:42)=-=-
Low Level Design modified.
--- /tmp/wklog.90.old.19182 2010-03-24 14:42:54.000000000 +0000
+++ /tmp/wklog.90.new.19182 2010-03-24 14:42:54.000000000 +0000
@@ -1 +1,140 @@
+<contents>
+1. Applicability check
+2. Representation
+2.1 Option #1: Convert to TABLE_LIST
+2.2 On subquery predicate removal
+2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
+2.3 What is expected of the result of conversion
+3. Pre-optimization steps
+3.1 Constant detection
+3.3 update_ref_and_keys
+3.4 JOIN_TAB sorting criteria
+4. Optimization
+5. Execution
+User interface.
+</contents>
+
+We'll call the new execution strategy "jtbm-materialization", for the lack of
+better name.
+
+1. Applicability check
+======================
+The criteria for checking whether a subquery can be processed with
+jtbm-materialization can be checked at JOIN::prepare stage (like it
+happens with semi-join check)
+
+2. Representation
+=================
+
+2.1 Option #1: Convert to TABLE_LIST
+------------------------------------
+Make it work like semi-join nests: each jtbm-predicate is converted into a
+TABLE_LIST object. This will make it
+
+ - uniform with semi-joins (we've stepped on all rakes there)
+ - allow to process JTBM-subqueries in ON expressions
+
+simplify_joins() will handle jtbm TABLE_LISTs as some kinds of opaque base
+tables.
+
+for EXPLAIN EXTENDED, it would be natural to print something semi-join like,
+i.e. for
+
+ SELECT ... FROM ot WHERE oe IN (SELECT ie FROM materialized-non-sj-select)
+
+we'll print
+
+ SELECT ... FROM ot SJ (SELECT ie FROM materialized-non-sj-select) ON oe=XX
+
+the XX part is not clear. we don't want to print 'ie' the second time here?
+
+2.2 On subquery predicate removal
+---------------------------------
+Q: if we remove the subquery predicate permanently, who will run
+fix_fields() for it? For semi-joins we don't have the problem as we
+inject into ON expression (right? or not? we have sj_on_expr, too...
+(Investigation: we the the same Item* pointer both in WHERE and
+as sj_on_expr. fix_fields() is called for the WHERE part and that's
+how sj_on_expr gets fixed. This works as long as
+Item_func_eq::fix_fields() does not try to substitute itself with
+another item).
+A: ?
+
+2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
+------------------------------------------------------------
+JOIN_TABs only live for the duration of one PS re-execution, so we'll have to
+- make conversion fully undoable
+- perform it sufficiently late in the optimization process, at the point
+ where JOIN_TABs are already allocated.
+
+Note that if we don't position JTBM predicates in join's TABLE_LIST tree, then
+it will be impossible to handle JTBM queries inside/outside of outer joins.
+
+2.3 What is expected of the result of conversion
+------------------------------------------------
+Join [pre]optimization relies on each optimized entity to have a bit in
+table_map.
+
+TODO: where do we check if there will be enough bits for everyone? (at the
+ point where we assign them?)
+
+The bit stored in join_tab->table->map, and the apparent problem is that JTBM
+join_tabs do not naturally have TABLE* object.
+
+We could use the the one that will be used for Materialization, but that will
+stop working when we will have to include IN->EXISTS in the choice.
+
+Current approach: don't create a table. create a table_map element in JOIN_TAB
+instead. Evgen has probably done something like that already.
+
+3. Pre-optimization steps
+=========================
+JOIN_TABs are allocated in make_join_statistics(). This where the changes will
+be needed: for JOIN_TABs that correspond to JTBM-tables:
+
+- don't set tab->table, set tab->jtbm_select (or whatever)
+- run subquery's optimizer to get its output cardinality
+
+3.1 Constant detection
+----------------------
+What about subqueries that "are constant"?
+ const_item IN (SELECT uncorrelated) -> is constant, but not something
+ we would want to evaluate.
+ something IN (SELECT from_constant_join) -> is constant
+
+Do we need to mark their JOIN_TABs as constant?
+
+3.3 update_ref_and_keys
+-----------------------
+* Walk through JTBM elements and inject KEYUSE elements for their
+ IN-equalities.
+
+TODO: KEYUSE elements imply presense of KEYs! Which we don't have!
+
+3.4 JOIN_TAB sorting criteria
+-----------------------------
+Q: Where do we put JTBM's join_tab when pre-sorting records?
+A: it should sort as regular table.
+
+TODO: where do we remove the predicates from the WHERE?
+ - remove them like SJ-converter does
+ - remove them with optimizer (like remove_eq_conds does)
+
+4. Optimization
+===============
+Add a branch in best_access_path to account for
+- JTBM-Materialization
+- JTBM-Materialization-Scan.
+
+5. Execution
+============
+* We should be able to reuse item_subselect.cc code for lookups
+* But will have to use our own temptable scan code
+
+TODO: is it possible to have any unification with SJ-Materialization?
+
+User interface
+--------------
+Any @@optimizer_switch flags for all this?
+
-=-=(Igor - Wed, 10 Mar 2010, 22:02)=-=-
High Level Description modified.
--- /tmp/wklog.90.old.2007 2010-03-10 22:02:23.000000000 +0000
+++ /tmp/wklog.90.new.2007 2010-03-10 22:02:23.000000000 +0000
@@ -13,8 +13,8 @@
for each record R2 in big_table such that oe=R1
pass R2 to output
-Semi-join materialization supports such strategy with SJM-Scan strategy. This WL
-entry is about adding support for such strategies for non-semijoin subqueries.
+Semi-join materialization supports the inside-out strategy. This WL entry is
+about adding support for such strategies for non-semijoin subqueries.
Once WL#89 is done, there will be a cost-based choice between
-=-=(Igor - Wed, 10 Mar 2010, 21:52)=-=-
Status updated.
--- /tmp/wklog.90.old.882 2010-03-10 21:52:02.000000000 +0000
+++ /tmp/wklog.90.new.882 2010-03-10 21:52:02.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Psergey - Sun, 28 Feb 2010, 15:37)=-=-
High Level Description modified.
--- /tmp/wklog.90.old.23524 2010-02-28 15:37:47.000000000 +0000
+++ /tmp/wklog.90.new.23524 2010-02-28 15:37:47.000000000 +0000
@@ -15,3 +15,7 @@
Semi-join materialization supports such strategy with SJM-Scan strategy. This WL
entry is about adding support for such strategies for non-semijoin subqueries.
+
+
+Once WL#89 is done, there will be a cost-based choice between
+Materialization+lookup, Materialization+scan, and IN->EXISTS+lookup strategies.
-=-=(Psergey - Sun, 28 Feb 2010, 15:22)=-=-
High-Level Specification modified.
--- /tmp/wklog.90.old.23033 2010-02-28 15:22:09.000000000 +0000
+++ /tmp/wklog.90.new.23033 2010-02-28 15:22:09.000000000 +0000
@@ -1 +1,33 @@
+Basic idea on how this could be achieved:
+
+Pre-optimization phase
+----------------------
+
+The rewrite
+~~~~~~~~~~~
+If we find a subquery predicate that is
+- not processed by current semi-join optimizations
+- is an AND-part of the WHERE/ON clause
+- can be executed with Materialization
+
+then
+- Remove the predicate from WHERE/ON clause
+- Add a special JOIN_TAB object instead.
+
+Plan options
+~~~~~~~~~~~~
+- Use the IN-equality to create KEYUSE elements.
+
+Optimization
+------------
+- Pre-optimize the subquery so we know materialization cost
+- Whenever best_access_path() encounters the "special JOIN_TAB" it should
+ consider two strategies:
+ A. Materialization and making lookups in the materialized table (if applicable)
+ B. Materialization and then scanning the materialized table.
+
+
+EXPLAIN
+-------
+TODO how this will look in EXPLAIN output?
-=-=(Psergey - Sun, 28 Feb 2010, 14:56)=-=-
Dependency created: 91 now depends on 90
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 90
-=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=-
Title modified.
--- /tmp/wklog.90.old.21903 2010-02-28 14:47:54.000000000 +0000
+++ /tmp/wklog.90.new.21903 2010-02-28 14:47:54.000000000 +0000
@@ -1 +1 @@
- Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE
+Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE
-=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=-
High Level Description modified.
--- /tmp/wklog.90.old.21880 2010-02-28 14:47:28.000000000 +0000
+++ /tmp/wklog.90.new.21880 2010-02-28 14:47:28.000000000 +0000
@@ -1,10 +1,17 @@
-For uncorrelated IN subqueries that can't be converted to semi-joins it is
-necessary to make a cost-based choice between IN->EXISTS and Materialization
-strategies.
+Consider the following case:
-Both strategies handle two cases:
-1. A simple case w/o NULLs handling
-2. Handling NULLs.
+SELECT * FROM big_table
+WHERE oe IN (SELECT ie FROM table_with_few_groups
+ WHERE ...
+ GROUP BY group_col) AND ...
-This WL is about making cost-based decision for #1.
+Here the best way to execute the query is:
+ Materialize the subquery;
+ # now run the join:
+ for each record R1 in materialized table
+ for each record R2 in big_table such that oe=R1
+ pass R2 to output
+
+Semi-join materialization supports such strategy with SJM-Scan strategy. This WL
+entry is about adding support for such strategies for non-semijoin subqueries.
-=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=-
Title modified.
--- /tmp/wklog.90.old.21859 2010-02-28 14:47:02.000000000 +0000
+++ /tmp/wklog.90.new.21859 2010-02-28 14:47:02.000000000 +0000
@@ -1 +1 @@
-Subqueries: cost-based choice between Materialization and IN->EXISTS transformation
+ Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=90&nolimit=1
DESCRIPTION:
Consider the following case:
SELECT * FROM big_table
WHERE oe IN (SELECT ie FROM table_with_few_groups
WHERE ...
GROUP BY group_col) AND ...
Here the best way to execute the query is:
Materialize the subquery;
# now run the join:
for each record R1 in materialized table
for each record R2 in big_table such that oe=R1
pass R2 to output
Semi-join materialization supports the inside-out strategy. This WL entry is
about adding support for such strategies for non-semijoin subqueries.
Once WL#89 is done, there will be a cost-based choice between
Materialization+lookup, Materialization+scan, and IN->EXISTS+lookup strategies.
HIGH-LEVEL SPECIFICATION:
Basic idea on how this could be achieved:
Pre-optimization phase
----------------------
The rewrite
~~~~~~~~~~~
If we find a subquery predicate that is
- not processed by current semi-join optimizations
- is an AND-part of the WHERE/ON clause
- can be executed with Materialization
then
- Remove the predicate from WHERE/ON clause
- Add a special JOIN_TAB object instead.
Plan options
~~~~~~~~~~~~
- Use the IN-equality to create KEYUSE elements.
Optimization
------------
- Pre-optimize the subquery so we know materialization cost
- Whenever best_access_path() encounters the "special JOIN_TAB" it should
consider two strategies:
A. Materialization and making lookups in the materialized table (if applicable)
B. Materialization and then scanning the materialized table.
EXPLAIN
-------
TODO how this will look in EXPLAIN output?
LOW-LEVEL DESIGN:
<contents>
1. Applicability check
2. Representation
2.1 Option #1: Convert to TABLE_LIST
2.2 On subquery predicate removal
2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
2.3 What is expected of the result of conversion
3. Pre-optimization steps
3.1 Constant detection
3.3 update_ref_and_keys
3.4 JOIN_TAB sorting criteria
4. Optimization
5. Execution
User interface.
</contents>
We'll call the new execution strategy "jtbm-materialization", for the lack of
better name.
1. Applicability check
======================
The criteria for checking whether a subquery can be processed with
jtbm-materialization can be checked at JOIN::prepare stage (like it
happens with semi-join check)
2. Representation
=================
2.1 Option #1: Convert to TABLE_LIST
------------------------------------
Make it work like semi-join nests: each jtbm-predicate is converted into a
TABLE_LIST object. This will make it
- uniform with semi-joins (we've stepped on all rakes there)
- allow to process JTBM-subqueries in ON expressions
simplify_joins() will handle jtbm TABLE_LISTs as some kinds of opaque base
tables.
for EXPLAIN EXTENDED, it would be natural to print something semi-join like,
i.e. for
SELECT ... FROM ot WHERE oe IN (SELECT ie FROM materialized-non-sj-select)
we'll print
SELECT ... FROM ot SJ (SELECT ie FROM materialized-non-sj-select) ON oe=XX
the XX part is not clear. we don't want to print 'ie' the second time here?
2.2 On subquery predicate removal
---------------------------------
Q: if we remove the subquery predicate permanently, who will run
fix_fields() for it? For semi-joins we don't have the problem as we
inject into ON expression (right? or not? we have sj_on_expr, too...
(Investigation: we the the same Item* pointer both in WHERE and
as sj_on_expr. fix_fields() is called for the WHERE part and that's
how sj_on_expr gets fixed. This works as long as
Item_func_eq::fix_fields() does not try to substitute itself with
another item).
A: ?
2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away
------------------------------------------------------------
JOIN_TABs only live for the duration of one PS re-execution, so we'll have to
- make conversion fully undoable
- perform it sufficiently late in the optimization process, at the point
where JOIN_TABs are already allocated.
Note that if we don't position JTBM predicates in join's TABLE_LIST tree, then
it will be impossible to handle JTBM queries inside/outside of outer joins.
2.3 What is expected of the result of conversion
------------------------------------------------
Join [pre]optimization relies on each optimized entity to have a bit in
table_map.
TODO: where do we check if there will be enough bits for everyone? (at the
point where we assign them?)
The bit stored in join_tab->table->map, and the apparent problem is that JTBM
join_tabs do not naturally have TABLE* object.
We could use the the one that will be used for Materialization, but that will
stop working when we will have to include IN->EXISTS in the choice.
Current approach: don't create a table. create a table_map element in JOIN_TAB
instead. Evgen has probably done something like that already.
3. Pre-optimization steps
=========================
JOIN_TABs are allocated in make_join_statistics(). This where the changes will
be needed: for JOIN_TABs that correspond to JTBM-tables:
- don't set tab->table, set tab->jtbm_select (or whatever)
- run subquery's optimizer to get its output cardinality
3.1 Constant detection
----------------------
What about subqueries that "are constant"?
const_item IN (SELECT uncorrelated) -> is constant, but not something
we would want to evaluate.
something IN (SELECT from_constant_join) -> is constant
Do we need to mark their JOIN_TABs as constant?
3.3 update_ref_and_keys
-----------------------
* Walk through JTBM elements and inject KEYUSE elements for their
IN-equalities.
TODO: KEYUSE elements imply presense of KEYs! Which we don't have!
3.4 JOIN_TAB sorting criteria
-----------------------------
Q: Where do we put JTBM's join_tab when pre-sorting records?
A: it should sort as regular table.
TODO: where do we remove the predicates from the WHERE?
- remove them like SJ-converter does
- remove them with optimizer (like remove_eq_conds does)
4. Optimization
===============
Add a branch in best_access_path to account for
- JTBM-Materialization
- JTBM-Materialization-Scan.
5. Execution
============
* We should be able to reuse item_subselect.cc code for lookups
* But will have to use our own temptable scan code
TODO: is it possible to have any unification with SJ-Materialization?
User interface
--------------
Any @@optimizer_switch flags for all this?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Knielsen): New replication APIs (107)
by worklog-noreply@askmonty.org 24 Mar '10
by worklog-noreply@askmonty.org 24 Mar '10
24 Mar '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: New replication APIs
CREATION DATE..: Mon, 15 Mar 2010, 13:55
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 107 (http://askmonty.org/worklog/?tid=107)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 36
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Wed, 24 Mar 2010, 10:39)=-=-
Design discussions
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Mon, 15 Mar 2010, 14:28)=-=-
Research into the problem, and discussions on phone/mailing list
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Guest - Mon, 15 Mar 2010, 14:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.107.old.9086 2010-03-15 14:18:18.000000000 +0000
+++ /tmp/wklog.107.new.9086 2010-03-15 14:18:18.000000000 +0000
@@ -1 +1,43 @@
+Current ideas/status after discussions on the mailing list:
+
+ - Implement a set of plugin APIs and use them to move all of the existing
+ MySQL replication into a (set of) plugins.
+
+ - Design the APIs so that they can support full MySQL replication, but also
+ so that they do not hardcode assumptions about how this replication
+ implementation is done, and so that they will be suitable for other types of
+ replication (Tungsten, Galera, parallel replication, ...).
+
+ - APIs need to include the concept of a global transaction ID. Need to
+ determine the extent to which the semantics of such ID will be defined
+ by the API, and to which extend it will be defined by the plugin
+ implementations.
+
+ - APIs should properly support reliable crash-recovery with decent
+ performance (eg. not require multiple mandatory fsync()s per commit, and
+ not make group commit impossible).
+
+ - Would be nice if the API provided facilities for implementing good
+ consistency checking support (mainly checking master tables against slave
+ tables is hard here I think, but also applying wrong binlog data and
+ individual event checksums).
+
+
+Steps to make this more concrete:
+
+ - Investigate the current MySQL replication, and list all of the places where
+ a plugin implementation will need to connect/hook into the MySQL server.
+ * handler::{write,update,delete}_row()
+ * Statement execution
+ * Transaction start/commit
+ * Table open
+ * Query safe/not/safe for statement based replication
+ * Statement-based logging details (user variables, random seed, etc.)
+ * ...
+
+ - Use this list to make an initial sketch of the set of APIs we need.
+
+ - Use the list to determine the feasibility of this project and the level of
+ detail in the API needed to support a full replication implementation as a
+ plugin.
-=-=(Serg - Mon, 15 Mar 2010, 14:13)=-=-
Observers changed: Sergei
DESCRIPTION:
This is a top-level task for the project of designing a new set of replication
APIs for MariaDB.
This task is for the initial discussion of what to do and where to focus.
The project is started in this email thread:
https://lists.launchpad.net/maria-developers/msg01998.html
HIGH-LEVEL SPECIFICATION:
Current ideas/status after discussions on the mailing list:
- Implement a set of plugin APIs and use them to move all of the existing
MySQL replication into a (set of) plugins.
- Design the APIs so that they can support full MySQL replication, but also
so that they do not hardcode assumptions about how this replication
implementation is done, and so that they will be suitable for other types of
replication (Tungsten, Galera, parallel replication, ...).
- APIs need to include the concept of a global transaction ID. Need to
determine the extent to which the semantics of such ID will be defined
by the API, and to which extend it will be defined by the plugin
implementations.
- APIs should properly support reliable crash-recovery with decent
performance (eg. not require multiple mandatory fsync()s per commit, and
not make group commit impossible).
- Would be nice if the API provided facilities for implementing good
consistency checking support (mainly checking master tables against slave
tables is hard here I think, but also applying wrong binlog data and
individual event checksums).
Steps to make this more concrete:
- Investigate the current MySQL replication, and list all of the places where
a plugin implementation will need to connect/hook into the MySQL server.
* handler::{write,update,delete}_row()
* Statement execution
* Transaction start/commit
* Table open
* Query safe/not/safe for statement based replication
* Statement-based logging details (user variables, random seed, etc.)
* ...
- Use this list to make an initial sketch of the set of APIs we need.
- Use the list to determine the feasibility of this project and the level of
detail in the API needed to support a full replication implementation as a
plugin.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Knielsen): New replication APIs (107)
by worklog-noreply@askmonty.org 24 Mar '10
by worklog-noreply@askmonty.org 24 Mar '10
24 Mar '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: New replication APIs
CREATION DATE..: Mon, 15 Mar 2010, 13:55
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 107 (http://askmonty.org/worklog/?tid=107)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 36
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Wed, 24 Mar 2010, 10:39)=-=-
Design discussions
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Mon, 15 Mar 2010, 14:28)=-=-
Research into the problem, and discussions on phone/mailing list
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Guest - Mon, 15 Mar 2010, 14:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.107.old.9086 2010-03-15 14:18:18.000000000 +0000
+++ /tmp/wklog.107.new.9086 2010-03-15 14:18:18.000000000 +0000
@@ -1 +1,43 @@
+Current ideas/status after discussions on the mailing list:
+
+ - Implement a set of plugin APIs and use them to move all of the existing
+ MySQL replication into a (set of) plugins.
+
+ - Design the APIs so that they can support full MySQL replication, but also
+ so that they do not hardcode assumptions about how this replication
+ implementation is done, and so that they will be suitable for other types of
+ replication (Tungsten, Galera, parallel replication, ...).
+
+ - APIs need to include the concept of a global transaction ID. Need to
+ determine the extent to which the semantics of such ID will be defined
+ by the API, and to which extend it will be defined by the plugin
+ implementations.
+
+ - APIs should properly support reliable crash-recovery with decent
+ performance (eg. not require multiple mandatory fsync()s per commit, and
+ not make group commit impossible).
+
+ - Would be nice if the API provided facilities for implementing good
+ consistency checking support (mainly checking master tables against slave
+ tables is hard here I think, but also applying wrong binlog data and
+ individual event checksums).
+
+
+Steps to make this more concrete:
+
+ - Investigate the current MySQL replication, and list all of the places where
+ a plugin implementation will need to connect/hook into the MySQL server.
+ * handler::{write,update,delete}_row()
+ * Statement execution
+ * Transaction start/commit
+ * Table open
+ * Query safe/not/safe for statement based replication
+ * Statement-based logging details (user variables, random seed, etc.)
+ * ...
+
+ - Use this list to make an initial sketch of the set of APIs we need.
+
+ - Use the list to determine the feasibility of this project and the level of
+ detail in the API needed to support a full replication implementation as a
+ plugin.
-=-=(Serg - Mon, 15 Mar 2010, 14:13)=-=-
Observers changed: Sergei
DESCRIPTION:
This is a top-level task for the project of designing a new set of replication
APIs for MariaDB.
This task is for the initial discussion of what to do and where to focus.
The project is started in this email thread:
https://lists.launchpad.net/maria-developers/msg01998.html
HIGH-LEVEL SPECIFICATION:
Current ideas/status after discussions on the mailing list:
- Implement a set of plugin APIs and use them to move all of the existing
MySQL replication into a (set of) plugins.
- Design the APIs so that they can support full MySQL replication, but also
so that they do not hardcode assumptions about how this replication
implementation is done, and so that they will be suitable for other types of
replication (Tungsten, Galera, parallel replication, ...).
- APIs need to include the concept of a global transaction ID. Need to
determine the extent to which the semantics of such ID will be defined
by the API, and to which extend it will be defined by the plugin
implementations.
- APIs should properly support reliable crash-recovery with decent
performance (eg. not require multiple mandatory fsync()s per commit, and
not make group commit impossible).
- Would be nice if the API provided facilities for implementing good
consistency checking support (mainly checking master tables against slave
tables is hard here I think, but also applying wrong binlog data and
individual event checksums).
Steps to make this more concrete:
- Investigate the current MySQL replication, and list all of the places where
a plugin implementation will need to connect/hook into the MySQL server.
* handler::{write,update,delete}_row()
* Statement execution
* Transaction start/commit
* Table open
* Query safe/not/safe for statement based replication
* Statement-based logging details (user variables, random seed, etc.)
* ...
- Use this list to make an initial sketch of the set of APIs we need.
- Use the list to determine the feasibility of this project and the level of
detail in the API needed to support a full replication implementation as a
plugin.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Table elimination (17)
by worklog-noreply@askmonty.org 24 Mar '10
by worklog-noreply@askmonty.org 24 Mar '10
24 Mar '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Table elimination
CREATION DATE..: Sun, 10 May 2009, 19:57
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 17 (http://askmonty.org/worklog/?tid=17)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 1
ESTIMATE.......: 3 (hours remain)
ORIG. ESTIMATE.: 3
PROGRESS NOTES:
-=-=(Guest - Wed, 24 Mar 2010, 05:58)=-=-
Status updated.
--- /tmp/wklog.17.old.13223 2010-03-24 05:58:26.000000000 +0000
+++ /tmp/wklog.17.new.13223 2010-03-24 05:58:26.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Assigned
-=-=(Guest - Wed, 24 Mar 2010, 05:58)=-=-
Privacy level updated.
--- /tmp/wklog.17.old.13223 2010-03-24 05:58:26.000000000 +0000
+++ /tmp/wklog.17.new.13223 2010-03-24 05:58:26.000000000 +0000
@@ -1 +1 @@
-n
+y
-=-=(Guest - Wed, 28 Oct 2009, 02:01)=-=-
Version updated.
--- /tmp/wklog.17.old.24041 2009-10-28 02:01:58.000000000 +0200
+++ /tmp/wklog.17.new.24041 2009-10-28 02:01:58.000000000 +0200
@@ -1 +1 @@
-9.x
+Server-9.x
-=-=(Guest - Sun, 16 Aug 2009, 16:16)=-=-
Category updated.
--- /tmp/wklog.17.old.24882 2009-08-16 16:16:49.000000000 +0300
+++ /tmp/wklog.17.new.24882 2009-08-16 16:16:49.000000000 +0300
@@ -1 +1 @@
-Client-BackLog
+Server-Sprint
-=-=(Guest - Sun, 16 Aug 2009, 16:16)=-=-
Version updated.
--- /tmp/wklog.17.old.24882 2009-08-16 16:16:49.000000000 +0300
+++ /tmp/wklog.17.new.24882 2009-08-16 16:16:49.000000000 +0300
@@ -1 +1 @@
-Server-5.1
+9.x
-=-=(Guest - Wed, 29 Jul 2009, 21:41)=-=-
Low Level Design modified.
--- /tmp/wklog.17.old.26011 2009-07-29 21:41:04.000000000 +0300
+++ /tmp/wklog.17.new.26011 2009-07-29 21:41:04.000000000 +0300
@@ -2,163 +2,146 @@
~maria-captains/maria/maria-5.1-table-elimination tree.
<contents>
-1. Conditions for removal
-1.1 Quick check if there are candidates
-2. Removal operation properties
-3. Removal operation
-4. User interface
-5. Tests and benchmarks
-6. Todo, issues to resolve
-6.1 To resolve
-6.2 Resolved
-7. Additional issues
+1. Elimination criteria
+2. No outside references check
+2.1 Quick check if there are tables with no outside references
+3. One-match check
+3.1 Functional dependency source #1: Potential eq_ref access
+3.2 Functional dependency source #2: col2=func(col1)
+3.3 Functional dependency source #3: One or zero records in the table
+3.4 Functional dependency check implementation
+3.4.1 Equality collection: Option1
+3.4.2 Equality collection: Option2
+3.4.3 Functional dependency propagation - option 1
+3.4.4 Functional dependency propagation - option 2
+4. Removal operation properties
+5. Removal operation
+6. User interface
+6.1 @@optimizer_switch flag
+6.2 EXPLAIN [EXTENDED]
+7. Miscellaneous adjustments
+7.1 Fix used_tables() of aggregate functions
+7.2 Make subquery predicates collect their outer references
+8. Other concerns
+8.1 Relationship with outer->inner joins converter
+8.2 Relationship with prepared statements
+8.3 Relationship with constant table detection
+9. Tests and benchmarks
</contents>
It's not really about elimination of tables, it's about elimination of inner
sides of outer joins.
-1. Conditions for removal
--------------------------
-We can eliminate an inner side of outer join if:
-1. For each record combination of outer tables, it will always produce
- exactly one record.
-2. There are no references to columns of the inner tables anywhere else in
+1. Elimination criteria
+=======================
+We can eliminate inner side of an outer join nest if:
+
+1. There are no references to columns of the inner tables anywhere else in
the query.
+2. For each record combination of outer tables, it will always produce
+ exactly one matching record combination.
+
+Most of effort in this WL entry is checking these two conditions.
-#1 means that every table inside the outer join nest is:
- - is a constant table:
- = because it can be accessed via eq_ref(const) access, or
- = it is a zero-rows or one-row MyISAM-like table [MARK1]
- - has an eq_ref access method candidate.
-
-#2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY,
- GROUP BY and HAVING do not refer to the inner tables of the outer join
- nest.
-
-1.1 Quick check if there are candidates
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Before we start to enumerate join nests, here is a quick way to check if
-there *can be* something to be removed:
+2. No outside references check
+==============================
+Criterion #1 means that the WHERE clause, ON clauses of embedding/subsequent
+outer joins, ORDER BY, GROUP BY and HAVING must have no references to inner
+tables of the outer join nest we're trying to remove.
+
+For multi-table UPDATE/DELETE we also must not remove tables that we're
+updating/deleting from or tables that are used in UPDATE's SET clause.
+
+2.1 Quick check if there are tables with no outside references
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Before we start searching for outer join nests that could be eliminated,
+we'll do a quick and cheap check if there possibly could be something that
+could be eliminated:
- if ((tables used in select_list |
+ if (there are outer joins &&
+ (tables used in select_list |
tables used in group/order by UNION |
- tables used in where) != bitmap_of_all_tables)
+ tables used in where) != bitmap_of_all_join_tables)
{
attempt table elimination;
}
-2. Removal operation properties
--------------------------------
-* There is always one way to remove (no choice to remove either this or that)
-* It is always better to remove as much tables as possible (at least within
- our cost model).
-Thus, no need for any cost calculations/etc. It's an unconditional rewrite.
-3. Removal operation
---------------------
-* Remove the outer join nest's nested join structure (i.e. get the
- outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding,
- $OJ->embedding->nested_join. Update table_map's of all ancestor nested
- joins). [MARK2]
+3. One-match check
+==================
+We can eliminate inner side of outer join if it will always generate exactly
+one matching record combination.
-* Move the tables and their JOIN_TABs to front like it is done with const
- tables, with exception that if eliminated outer join nest was within
- another outer join nest, that shouldn't prevent us from moving away the
- eliminated tables.
+By definition of OUTER JOIN, a NULL-complemented record combination will be
+generated when the inner side of outer join has not produced any matches.
-* Update join->table_count and all-join-tables bitmap.
+What remains to be checked is that there is no possiblity that inner side of
+the outer join could produce more than one matching record combination.
-* That's it. Nothing else?
+We'll refer to one-match property as "functional dependency":
-4. User interface
------------------
-* We'll add an @@optimizer switch flag for table elimination. Tentative
- name: 'table_elimination'.
- (Note ^^ utility of the above questioned ^, as table elimination can never
- be worse than no elimination. We're leaning towards not adding the flag)
-
-* EXPLAIN will not show the removed tables at all. This will allow to check
- if tables were removed, and also will behave nicely with anchor model and
- VIEWs: stuff that user doesn't care about just won't be there.
+- A outer join nest is functionally dependent [wrt outer tables] if it will
+ produce one matching record combination per each record combination of
+ outer tables
-5. Tests and benchmarks
------------------------
-Create a benchmark in sql-bench which checks if the DBMS has table
-elimination.
-[According to Monty] Run
- - queries that would use elimination
- - queries that are very similar to one above (so that they would have same
- QEP, execution cost, etc) but cannot use table elimination.
-then compare run times and make a conclusion about whether dbms supports table
-elimination.
+- A table is functionally dependent wrt certain set of dependency tables, if
+ record combination of dependency tables uniquely identifies zero or one
+ matching record in the table
-6. Todo, issues to resolve
---------------------------
+- Definitions of functional dependency of keys (=column tuples) and columns are
+ apparent.
-6.1 To resolve
-~~~~~~~~~~~~~~
-- Relationship with prepared statements.
- On one hand, it's natural to desire to make table elimination a
- once-per-statement operation, like outer->inner join conversion. We'll have
- to limit the applicability by removing [MARK1] as that can change during
- lifetime of the statement.
-
- The other option is to do table elimination every time. This will require to
- rework operation [MARK2] to be undoable.
-
- I'm leaning towards doing the former. With anchor modeling, it is unlikely
- that we'll meet outer joins which have N inner tables of which some are 1-row
- MyISAM tables that do not have primary key.
-
-6.2 Resolved
-~~~~~~~~~~~~
-* outer->inner join conversion is not a problem for table elimination.
- We make outer->inner conversions based on predicates in WHERE. If the WHERE
- referred to an inner table (requirement for OJ->IJ conversion) then table
- elimination would not be applicable anyway.
-
-* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause:
- - affected tables must not be eliminated
- - tables that are used on the right side of the SET x=y assignments must
- not be eliminated either.
+Our goal is to prove that the entire join nest is functionally-dependent.
-* Aggregate functions used to report that they depend on all tables, that is,
+Join nest is functionally dependent (on the otside tables) if each of its
+elements (those can be either base tables or join nests) is functionally
+dependent.
- item_agg_func->used_tables() == (1ULL << join->tables) - 1
+Functional dependency is transitive: if table A is f-dependent on the outer
+tables and table B is f.dependent on {A, outer_tables} then B is functionally
+dependent on the outer tables.
+
+Subsequent sections list cases when we can declare a table to be
+functionally-dependent.
+
+3.1 Functional dependency source #1: Potential eq_ref access
+------------------------------------------------------------
+This is the most practically-important case. Taking the example from the HLD
+of this WL entry:
+
+ select
+ A.colA
+ from
+ tableA A
+ left outer join
+ tableB B
+ on
+ B.id = A.id;
- always. Fixed it, now aggregate function reports it depends on
- tables that its arguments depend on. In particular, COUNT(*) reports
- that it depends on no tables (item_count_star->used_tables()==0).
- One consequence of that is that "item->used_tables()==0" is not
- equivalent to "item->const_item()==true" anymore (not sure if it's
- "anymore" or this has been already happening).
-
-* EXPLAIN EXTENDED warning text was generated after the JOIN object has
- been discarded. This didn't allow to use information about join plan
- when printing the warning. Fixed this by keeping the JOIN objects until
- we've printed the warning (have also an intent to remove the const
- tables from the join output).
-
-7. Additional issues
---------------------
-* We remove ON clauses within outer join nests. If these clauses contain
- subqueries, they probably should be gone from EXPLAIN output also?
- Yes. Current approach: when removing an outer join nest, walk the ON clause
- and mark subselects as eliminated. Then let EXPLAIN code check if the
- SELECT was eliminated before the printing (EXPLAIN is generated by doing
- a recursive descent, so the check will also cause children of eliminated
- selects not to be printed)
-
-* Table elimination is performed after constant table detection (but before
- the range analysis). Constant tables are technically different from
- eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't).
- Considering we've already done the join_read_const_table() call, is there any
- real difference between constant table and eliminated one? If there is, should
- we mark const tables also as eliminated?
- from user/EXPLAIN point of view: no. constant table is the one that we read
- one record from. eliminated table is the one that we don't acccess at all.
+and generalizing it: a table TBL is functionally-dependent if the ON
+expression allows to construct a potential eq_ref access to table TBL that
+uses only outer or functionally-dependent tables.
+
+In other words: table TBL will have one match if the ON expression can be
+converted into this form
+
+ TBL.unique_key=func(one_match_tables) AND .. remainder ...
+
+(with appropriate extension for multi-part keys), where
+
+ one_match_tables= {
+ tables that are not on the inner side of the outer join in question, and
+ functionally dependent tables
+ }
+
+Note that this will cover constant tables, except those that are constant because
+they have 0/1 record or are partitioned and have no used partitions.
+
+
+3.2 Functional dependency source #2: col2=func(col1)
+----------------------------------------------------
+This comes from the second example in the HLS:
-* What is described above will not be able to eliminate this outer join
create unique index idx on tableB (id, fromDate);
...
left outer join
@@ -169,32 +152,331 @@
B.fromDate = (select max(sub.fromDate)
from tableB sub where sub.id = A.id);
- This is because condition "B.fromDate= func(tableB)" cannot be used.
- Reason#1: update_ref_and_keys() does not consider such conditions to
- be of any use (and indeed they are not usable for ref access)
- so they are not put into KEYUSE array.
- Reason#2: even if they were put there, we would need to be able to tell
- between predicates like
- B.fromDate= func(B.id) // guarantees only one matching row as
- // B.id is already bound by B.id=A.id
- // hence B.fromDate becomes bound too.
- and
- "B.fromDate= func(B.*)" // Can potentially have many matching
- // records.
- We need to
- - Have update_ref_and_keys() create KEYUSE elements for such equalities
- - Have eliminate_tables() and friends make a more accurate check.
- The right check is to check whether all parts of a unique key are bound.
- If we have keypartX to be bound, then t.keypartY=func(keypartX) makes
- keypartY to be bound.
- The difficulty here is that correlated subquery predicate cannot tell what
- columns it depends on (it only remembers tables).
- Traversing the predicate is expensive and complicated.
- We're leaning towards making each subquery predicate have a List<Item> with
- items that
- - are in the current select
- - and it depends on.
- This list will be useful in certain other subquery optimizations as well,
- it is cheap to collect it in fix_fields() phase, so it will be collected
- for every subquery predicate.
+Here it is apparent that tableB can be eliminated. It is not possible to
+construct eq_ref access to tableB, though, because for the second part of the
+primary key (fromDate column) we only got a condition in this form:
+
+ B.fromDate= func(tableB)
+
+(we write "func(tableB)" because ref optimizer can only determine which tables
+the right part of the equality depends on).
+
+In general case, equality like this doesn't guarantee functional dependency.
+For example, if func() == { return fromDate;}, i.e the ON expression is
+
+ ... ON B.id = A.id and B.fromDate = B.fromDate
+
+then that would allow table B to have multiple matches per record of table A.
+
+In order to be able to distinguish between these two cases, we'll need to go
+down to column level:
+
+- A table is functionally dependent if it has a unique key that's functionally
+ dependent
+
+- A unique key is functionally dependent when all of its columns are
+ functionally dependent
+
+- A table column is functionally dependent if the ON clause allows to extract
+ an AND-part in this form:
+
+ tbl.column = f(functionally-dependent columns or columns of outer tables)
+
+3.3 Functional dependency source #3: One or zero records in the table
+---------------------------------------------------------------------
+A table with one or zero records cannot generate more than one matching
+record. This source is of lesser importance as one/zero-record tables are only
+MyISAM tables.
+
+3.4 Functional dependency check implementation
+----------------------------------------------
+As shown above, we need something similar to KEYUSE structures, but not
+exactly that (we need things that current ref optimizer considers unusable and
+don't need things that it considers usable).
+
+3.4.1 Equality collection: Option1
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+We could
+- extend KEYUSE structures to store all kinds of equalities we need
+- change update_ref_and_keys() and co. to collect equalities both for ref
+ access and for table elimination
+ = [possibly] Improve [eq_]ref access to be able to use equalities in
+ form keypart2=func(keypart1)
+- process the KEYUSE array both by table elimination and by ref access
+ optimizer.
+
++ This requires less effort.
+- Code will have to be changed all over sql_select.cc
+- update_ref_and_keys() and co. already do several unrelated things. Hooking
+ up table elimination will make it even worse.
+
+3.4.2 Equality collection: Option2
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Alternatively, we could process the WHERE clause totally on our own.
++ Table elimination is standalone and easy to detach module.
+- Some code duplication with update_ref_and_keys() and co.
+
+Having got the equalities, we'll to propagate functional dependency property
+to unique keys, tables and, ultimately, join nests.
+
+3.4.3 Functional dependency propagation - option 1
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Borrow the approach used in constant table detection code:
+
+ do
+ {
+ converted= FALSE;
+ for each table T in join nest
+ {
+ if (check_if_functionally_dependent(T))
+ converted= TRUE;
+ }
+ } while (converted == TRUE);
+
+ check_if_functionally_dependent(T)
+ {
+ if (T has eq_ref access based on func_dep_tables)
+ return TRUE;
+
+ Apply the same do-while loop-based approach to available equalities
+ T.column1=func(other columns)
+ to spread the set of functionally-dependent columns. The goal is to get
+ all columns of a certain unique key to be bound.
+ }
+
+
+3.4.4 Functional dependency propagation - option 2
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Analyze the ON expression(s) and build a list of
+
+ tbl.field = expr(...)
+
+equalities. tbl here is a table that belongs to a join nest that could
+potentially be eliminated.
+
+besides those, add to the list
+ - An element for each unique key in the table that needs to be eliminated
+ - An element for each table that needs to be eliminated
+ - An element for each join nest that can be eliminated (i.e. has no
+ references from outside).
+
+Then, setup "reverse dependencies": each element should have pointers to
+elements that are functionally dependent on it:
+
+- "tbl.field=expr(...)" equality is functionally dependent on all fields that
+ are used in "expr(...)" (here we take into account only fields that belong
+ to tables that can potentially be eliminated).
+- a unique key is dependent on all of its components
+- a table is dependent on all of its unique keys
+- a join nest is dependent on all tables that it contains
+
+These pointers are stored in form of one bitmap, such that:
+
+ "X depends on Y" == test( bitmap[(X's number)*n_objects + (Y's number)] )
+
+Each object also stores a number of dependencies it needs to be satisfied
+before it itself is satisfied:
+
+- "tbl.field=expr(...)" needs all its underlying fields (if a field is
+ referenced many times it is counted only once)
+
+- a unique key needs all of its key parts
+
+- a table needs only one of its unique keys
+
+- a join nest needs all of its tables
+
+(TODO: so what do we do when we've marked a table as constant? We'll need to
+update the "field=expr(....)" elements that use fields of that table. And the
+problem is that we won't know how much to decrement from the counters of those
+elements.
+
+Solution#1: switch to table_map() based approach.
+Solution#2: introduce separate elements for each involved field.
+ field will depend on its table,
+ "field=expr" will depend on fields.
+)
+
+Besides the above, let each element have a pointer to another element, so that
+we can have a linked list of elements.
+
+After the above structures have been created, we start the main algorithm.
+
+The first step is to create a list of functionally-dependent elements. We walk
+across array of dependencies and mark those elements that are already bound
+(i.e. their dependencies are satisfied). At the moment those immediately-bound
+are only "field=expr" dependencies that don't refer to any columns that are
+not bound.
+
+The second step is the loop
+
+ while (bound_list is not empty)
+ {
+ Take the first bound element F off the list.
+ Use the bitmap to find out what other elements depended on it
+ for each such element E
+ {
+ if (E becomes bound after F is bound)
+ add E to the list;
+ }
+ }
+
+The last step is to walk through elements that represent the join nests. Those
+that are bound can be eliminated.
+
+4. Removal operation properties
+===============================
+* There is always one way to remove (no choice to remove either this or that)
+* It is always better to remove as much tables as possible (at least within
+ our cost model).
+Thus, no need for any cost calculations/etc. It's an unconditional rewrite.
+
+
+5. Removal operation
+====================
+(This depends a lot on whether we make table elimination a one-off rewrite or
+conditional)
+
+At the moment table elimination is re-done for each join re-execution, hence
+the removal operation is designed not to modify any statement's permanent
+members.
+
+* Remove the outer join nest's nested join structure (i.e. get the
+ outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding,
+ $OJ->embedding->nested_join. Update table_map's of all ancestor nested
+ joins). [MARK2]
+
+* Move the tables and their JOIN_TABs to the front of join order, like it is
+ done with const tables, with exception that if eliminated outer join nest
+ was within another outer join nest, that shouldn't prevent us from moving
+ away the eliminated tables.
+
+* Update join->table_count and all-join-tables bitmap.
+ ^ TODO: not true anymore ^
+
+* That's it. Nothing else?
+
+6. User interface
+=================
+
+6.1 @@optimizer_switch flag
+---------------------------
+Argument againist adding the flag:
+* It is always better to perform table elimination than not to do it.
+
+Arguments for the flag:
+* It is always theoretically possible that the new code will cause unintended
+ slowdowns.
+* Having the flag is useful for QA and comparative benchmarking.
+
+Decision so far: add the flag under #ifdef. Make the flag be present in debug
+builds.
+
+6.2 EXPLAIN [EXTENDED]
+----------------------
+There are two possible options:
+1. Show eliminated tables, like we do with const tables.
+2. Do not show eliminated tables.
+
+We chose option 2, because:
+- the table is not accessed at all (besides locking it)
+- it is more natural for anchor model user - when he's querying an anchor-
+ and attributes view, he doesn't care about the unused attributes.
+
+EXPLAIN EXTENDED+SHOW WARNINGS won't show the removed table either.
+
+NOTE: Before this WL, the warning text was generated after all JOIN objects
+have been destroyed. This didn't allow to use information about join plan
+when printing the warning. We've fixed this by keeping the JOIN objects until
+the warning text has been generated.
+
+Table elimination removes inner sides of outer join, and logically the ON
+clause is also removed. If this clause has any subqueries, they will be
+also removed from EXPLAIN output.
+
+An exception to the above is that if we eliminate a derived table, it will
+still be shown in EXPLAIN output. This comes from the fact that the FROM
+subqueries are evaluated before table elimination is invoked.
+TODO: Is the above ok or still remove parts of FROM subqueries?
+
+7. Miscellaneous adjustments
+============================
+
+7.1 Fix used_tables() of aggregate functions
+--------------------------------------------
+Aggregate functions used to report that they depend on all tables, that is,
+
+ item_agg_func->used_tables() == (1ULL << join->tables) - 1
+
+always. Fixed it, now aggregate function reports that it depends on the
+tables that its arguments depend on. In particular, COUNT(*) reports that it
+depends on no tables (item_count_star->used_tables()==0). One consequence of
+that is that "item->used_tables()==0" is not equivalent to
+"item->const_item()==true" anymore (not sure if it's "anymore" or this has
+been already so for some items).
+
+7.2 Make subquery predicates collect their outer references
+-----------------------------------------------------------
+Per-column functional dependency analysis requires us to take a
+
+ tbl.field = func(...)
+
+equality and tell which columns of which tables are referred from func(...)
+expression. For scalar expressions, this is accomplished by Item::walk()-based
+traversal. It should be reasonably cheap (the only practical Item that can be
+expensive to traverse seems to be a special case of "col IN (const1,const2,
+...)". check if we traverse the long list for such items).
+
+For correlated subqueries, traversal can be expensive, it is cheaper to make
+each subquery item have a list of its outer references. The list can be
+collected at fix_fields() stage with very little extra cost, and then it could
+be used for other optimizations.
+
+
+8. Other concerns
+=================
+
+8.1 Relationship with outer->inner joins converter
+--------------------------------------------------
+One could suspect that outer->inner join conversion could get in the way
+of table elimination by changing outer joins (which could be eliminated)
+to inner (which we will not try to eliminate).
+This concern is not valid: we make outer->inner conversions based on
+predicates in WHERE. If the WHERE referred to an inner table (this is a
+requirement for the conversion) then table elimination would not be
+applicable anyway.
+
+8.2 Relationship with prepared statements
+-----------------------------------------
+On one hand, it's natural to desire to make table elimination a
+once-per-statement operation, like outer->inner join conversion. We'll have
+to limit the applicability by removing [MARK1] as that can change during
+lifetime of the statement.
+
+The other option is to do table elimination every time. This will require to
+rework operation [MARK2] to be undoable.
+
+
+8.3 Relationship with constant table detection
+----------------------------------------------
+Table elimination is performed after constant table detection (but before
+the range analysis). Constant tables are technically different from
+eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't).
+Considering we've already done the join_read_const_table() call, is there any
+real difference between constant table and eliminated one? If there is, should
+we mark const tables also as eliminated?
+from user/EXPLAIN point of view: no. constant table is the one that we read
+one record from. eliminated table is the one that we don't acccess at all.
+TODO
+
+9. Tests and benchmarks
+=======================
+Create a benchmark in sql-bench which checks if the DBMS has table
+elimination.
+[According to Monty] Run
+ - query Q1 that would use elimination
+ - query Q2 that is very similar to Q1 (so that they would have same
+ QEP, execution cost, etc) but cannot use table elimination.
+then compare run times and make a conclusion about whether the used dbms
+supports table elimination.
-=-=(Guest - Thu, 23 Jul 2009, 20:07)=-=-
Dependency created: 29 now depends on 17
-=-=(Monty - Thu, 23 Jul 2009, 09:19)=-=-
Version updated.
--- /tmp/wklog.17.old.24090 2009-07-23 09:19:32.000000000 +0300
+++ /tmp/wklog.17.new.24090 2009-07-23 09:19:32.000000000 +0300
@@ -1 +1 @@
-Server-9.x
+Server-5.1
-=-=(Guest - Mon, 20 Jul 2009, 14:28)=-=-
deukje weg
Worked 1 hour and estimate 3 hours remain (original estimate increased by 4 hours).
-=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=-
Version updated.
--- /tmp/wklog.17.old.24138 2009-07-17 02:44:49.000000000 +0300
+++ /tmp/wklog.17.new.24138 2009-07-17 02:44:49.000000000 +0300
@@ -1 +1 @@
-9.x
+Server-9.x
------------------------------------------------------------
-=-=(View All Progress Notes, 31 total)=-=-
http://askmonty.org/worklog/index.pl?tid=17&nolimit=1
DESCRIPTION:
Eliminate not needed tables from SELECT queries..
This will speed up some views and automatically generated queries.
Example:
CREATE TABLE B (id int primary key);
select
A.colA
from
tableA A
left outer join
tableB B
on
B.id = A.id;
In this case we can remove table B and the join from the query.
HIGH-LEVEL SPECIFICATION:
Here is an extended explanation of table elimination.
Table elimination is a feature found in some modern query optimizers, of
which Microsoft SQL Server 2005/2008 seems to have the most advanced
implementation. Oracle 11g has also been confirmed to use table
elimination but not to the same extent.
Basically, what table elimination does, is to remove tables from the
execution plan when it is unnecessary to include them. This can, of
course, only happen if the right circumstances arise. Let us for example
look at the following query:
select
A.colA
from
tableA A
left outer join
tableB B
on
B.id = A.id;
When using A as the left table we ensure that the query will return at
least as many rows as there are in that table. For rows where the join
condition (B.id = A.id) is not met the selected column (A.colA) will
still contain it's original value. The not seen B.* row would contain all NULL:s.
However, the result set could actually contain more rows than what is
found in tableA if there are duplicates of the column B.id in tableB. If
A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"]
then two rows will match in the join condition. The only way to know
what the result will look like is to actually touch both tables during
execution.
Instead, let's say that tableB contains rows that make it possible to
place a unique constraint on the column B.id, for example and often the
case a primary key. In this situation we know that we will get exactly
as many rows as there are in tableA, since joining with tableB cannot
introduce any duplicates. If further, as in the example query, we do not
select any columns from tableB, touching that table during execution is
unnecessary. We can remove the whole join operation from the execution
plan.
Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination
in the case described above. Let us look at a more advanced query, where
Oracle fails.
select
A.colA
from
tableA A
left outer join
tableB B
on
B.id = A.id
and
B.fromDate = (
select
max(sub.fromDate)
from
tableB sub
where
sub.id = A.id
);
In this example we have added another join condition, which ensures
that we only pick the matching row from tableB having the latest
fromDate. In this case tableB will contain duplicates of the column
B.id, so in order to ensure uniqueness the primary key has to contain
the fromDate column as well. In other words the primary key of tableB
is (B.id, B.fromDate).
Furthermore, since the subselect ensures that we only pick the latest
B.fromDate for a given B.id we know that at most one row will match
the join condition. We will again have the situation where joining
with tableB cannot affect the number of rows in the result set. Since
we do not select any columns from tableB, the whole join operation can
be eliminated from the execution plan.
SQL Server 2005/2008 will deploy table elimination in this situation as
well. We have not found a way to make Oracle 11g use it for this type of
query. Queries like these arise in two situations. Either when you have
denormalized model consisting of a fact table with several related
dimension tables, or when you have a highly normalized model where each
attribute is stored in its own table. The example with the subselect is
common whenever you store historized/versioned data.
LOW-LEVEL DESIGN:
The code (currently in development) is at lp:
~maria-captains/maria/maria-5.1-table-elimination tree.
<contents>
1. Elimination criteria
2. No outside references check
2.1 Quick check if there are tables with no outside references
3. One-match check
3.1 Functional dependency source #1: Potential eq_ref access
3.2 Functional dependency source #2: col2=func(col1)
3.3 Functional dependency source #3: One or zero records in the table
3.4 Functional dependency check implementation
3.4.1 Equality collection: Option1
3.4.2 Equality collection: Option2
3.4.3 Functional dependency propagation - option 1
3.4.4 Functional dependency propagation - option 2
4. Removal operation properties
5. Removal operation
6. User interface
6.1 @@optimizer_switch flag
6.2 EXPLAIN [EXTENDED]
7. Miscellaneous adjustments
7.1 Fix used_tables() of aggregate functions
7.2 Make subquery predicates collect their outer references
8. Other concerns
8.1 Relationship with outer->inner joins converter
8.2 Relationship with prepared statements
8.3 Relationship with constant table detection
9. Tests and benchmarks
</contents>
It's not really about elimination of tables, it's about elimination of inner
sides of outer joins.
1. Elimination criteria
=======================
We can eliminate inner side of an outer join nest if:
1. There are no references to columns of the inner tables anywhere else in
the query.
2. For each record combination of outer tables, it will always produce
exactly one matching record combination.
Most of effort in this WL entry is checking these two conditions.
2. No outside references check
==============================
Criterion #1 means that the WHERE clause, ON clauses of embedding/subsequent
outer joins, ORDER BY, GROUP BY and HAVING must have no references to inner
tables of the outer join nest we're trying to remove.
For multi-table UPDATE/DELETE we also must not remove tables that we're
updating/deleting from or tables that are used in UPDATE's SET clause.
2.1 Quick check if there are tables with no outside references
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before we start searching for outer join nests that could be eliminated,
we'll do a quick and cheap check if there possibly could be something that
could be eliminated:
if (there are outer joins &&
(tables used in select_list |
tables used in group/order by UNION |
tables used in where) != bitmap_of_all_join_tables)
{
attempt table elimination;
}
3. One-match check
==================
We can eliminate inner side of outer join if it will always generate exactly
one matching record combination.
By definition of OUTER JOIN, a NULL-complemented record combination will be
generated when the inner side of outer join has not produced any matches.
What remains to be checked is that there is no possiblity that inner side of
the outer join could produce more than one matching record combination.
We'll refer to one-match property as "functional dependency":
- A outer join nest is functionally dependent [wrt outer tables] if it will
produce one matching record combination per each record combination of
outer tables
- A table is functionally dependent wrt certain set of dependency tables, if
record combination of dependency tables uniquely identifies zero or one
matching record in the table
- Definitions of functional dependency of keys (=column tuples) and columns are
apparent.
Our goal is to prove that the entire join nest is functionally-dependent.
Join nest is functionally dependent (on the otside tables) if each of its
elements (those can be either base tables or join nests) is functionally
dependent.
Functional dependency is transitive: if table A is f-dependent on the outer
tables and table B is f.dependent on {A, outer_tables} then B is functionally
dependent on the outer tables.
Subsequent sections list cases when we can declare a table to be
functionally-dependent.
3.1 Functional dependency source #1: Potential eq_ref access
------------------------------------------------------------
This is the most practically-important case. Taking the example from the HLD
of this WL entry:
select
A.colA
from
tableA A
left outer join
tableB B
on
B.id = A.id;
and generalizing it: a table TBL is functionally-dependent if the ON
expression allows to construct a potential eq_ref access to table TBL that
uses only outer or functionally-dependent tables.
In other words: table TBL will have one match if the ON expression can be
converted into this form
TBL.unique_key=func(one_match_tables) AND .. remainder ...
(with appropriate extension for multi-part keys), where
one_match_tables= {
tables that are not on the inner side of the outer join in question, and
functionally dependent tables
}
Note that this will cover constant tables, except those that are constant because
they have 0/1 record or are partitioned and have no used partitions.
3.2 Functional dependency source #2: col2=func(col1)
----------------------------------------------------
This comes from the second example in the HLS:
create unique index idx on tableB (id, fromDate);
...
left outer join
tableB B
on
B.id = A.id
and
B.fromDate = (select max(sub.fromDate)
from tableB sub where sub.id = A.id);
Here it is apparent that tableB can be eliminated. It is not possible to
construct eq_ref access to tableB, though, because for the second part of the
primary key (fromDate column) we only got a condition in this form:
B.fromDate= func(tableB)
(we write "func(tableB)" because ref optimizer can only determine which tables
the right part of the equality depends on).
In general case, equality like this doesn't guarantee functional dependency.
For example, if func() == { return fromDate;}, i.e the ON expression is
... ON B.id = A.id and B.fromDate = B.fromDate
then that would allow table B to have multiple matches per record of table A.
In order to be able to distinguish between these two cases, we'll need to go
down to column level:
- A table is functionally dependent if it has a unique key that's functionally
dependent
- A unique key is functionally dependent when all of its columns are
functionally dependent
- A table column is functionally dependent if the ON clause allows to extract
an AND-part in this form:
tbl.column = f(functionally-dependent columns or columns of outer tables)
3.3 Functional dependency source #3: One or zero records in the table
---------------------------------------------------------------------
A table with one or zero records cannot generate more than one matching
record. This source is of lesser importance as one/zero-record tables are only
MyISAM tables.
3.4 Functional dependency check implementation
----------------------------------------------
As shown above, we need something similar to KEYUSE structures, but not
exactly that (we need things that current ref optimizer considers unusable and
don't need things that it considers usable).
3.4.1 Equality collection: Option1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We could
- extend KEYUSE structures to store all kinds of equalities we need
- change update_ref_and_keys() and co. to collect equalities both for ref
access and for table elimination
= [possibly] Improve [eq_]ref access to be able to use equalities in
form keypart2=func(keypart1)
- process the KEYUSE array both by table elimination and by ref access
optimizer.
+ This requires less effort.
- Code will have to be changed all over sql_select.cc
- update_ref_and_keys() and co. already do several unrelated things. Hooking
up table elimination will make it even worse.
3.4.2 Equality collection: Option2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Alternatively, we could process the WHERE clause totally on our own.
+ Table elimination is standalone and easy to detach module.
- Some code duplication with update_ref_and_keys() and co.
Having got the equalities, we'll to propagate functional dependency property
to unique keys, tables and, ultimately, join nests.
3.4.3 Functional dependency propagation - option 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Borrow the approach used in constant table detection code:
do
{
converted= FALSE;
for each table T in join nest
{
if (check_if_functionally_dependent(T))
converted= TRUE;
}
} while (converted == TRUE);
check_if_functionally_dependent(T)
{
if (T has eq_ref access based on func_dep_tables)
return TRUE;
Apply the same do-while loop-based approach to available equalities
T.column1=func(other columns)
to spread the set of functionally-dependent columns. The goal is to get
all columns of a certain unique key to be bound.
}
3.4.4 Functional dependency propagation - option 2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Analyze the ON expression(s) and build a list of
tbl.field = expr(...)
equalities. tbl here is a table that belongs to a join nest that could
potentially be eliminated.
besides those, add to the list
- An element for each unique key in the table that needs to be eliminated
- An element for each table that needs to be eliminated
- An element for each join nest that can be eliminated (i.e. has no
references from outside).
Then, setup "reverse dependencies": each element should have pointers to
elements that are functionally dependent on it:
- "tbl.field=expr(...)" equality is functionally dependent on all fields that
are used in "expr(...)" (here we take into account only fields that belong
to tables that can potentially be eliminated).
- a unique key is dependent on all of its components
- a table is dependent on all of its unique keys
- a join nest is dependent on all tables that it contains
These pointers are stored in form of one bitmap, such that:
"X depends on Y" == test( bitmap[(X's number)*n_objects + (Y's number)] )
Each object also stores a number of dependencies it needs to be satisfied
before it itself is satisfied:
- "tbl.field=expr(...)" needs all its underlying fields (if a field is
referenced many times it is counted only once)
- a unique key needs all of its key parts
- a table needs only one of its unique keys
- a join nest needs all of its tables
(TODO: so what do we do when we've marked a table as constant? We'll need to
update the "field=expr(....)" elements that use fields of that table. And the
problem is that we won't know how much to decrement from the counters of those
elements.
Solution#1: switch to table_map() based approach.
Solution#2: introduce separate elements for each involved field.
field will depend on its table,
"field=expr" will depend on fields.
)
Besides the above, let each element have a pointer to another element, so that
we can have a linked list of elements.
After the above structures have been created, we start the main algorithm.
The first step is to create a list of functionally-dependent elements. We walk
across array of dependencies and mark those elements that are already bound
(i.e. their dependencies are satisfied). At the moment those immediately-bound
are only "field=expr" dependencies that don't refer to any columns that are
not bound.
The second step is the loop
while (bound_list is not empty)
{
Take the first bound element F off the list.
Use the bitmap to find out what other elements depended on it
for each such element E
{
if (E becomes bound after F is bound)
add E to the list;
}
}
The last step is to walk through elements that represent the join nests. Those
that are bound can be eliminated.
4. Removal operation properties
===============================
* There is always one way to remove (no choice to remove either this or that)
* It is always better to remove as much tables as possible (at least within
our cost model).
Thus, no need for any cost calculations/etc. It's an unconditional rewrite.
5. Removal operation
====================
(This depends a lot on whether we make table elimination a one-off rewrite or
conditional)
At the moment table elimination is re-done for each join re-execution, hence
the removal operation is designed not to modify any statement's permanent
members.
* Remove the outer join nest's nested join structure (i.e. get the
outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding,
$OJ->embedding->nested_join. Update table_map's of all ancestor nested
joins). [MARK2]
* Move the tables and their JOIN_TABs to the front of join order, like it is
done with const tables, with exception that if eliminated outer join nest
was within another outer join nest, that shouldn't prevent us from moving
away the eliminated tables.
* Update join->table_count and all-join-tables bitmap.
^ TODO: not true anymore ^
* That's it. Nothing else?
6. User interface
=================
6.1 @@optimizer_switch flag
---------------------------
Argument againist adding the flag:
* It is always better to perform table elimination than not to do it.
Arguments for the flag:
* It is always theoretically possible that the new code will cause unintended
slowdowns.
* Having the flag is useful for QA and comparative benchmarking.
Decision so far: add the flag under #ifdef. Make the flag be present in debug
builds.
6.2 EXPLAIN [EXTENDED]
----------------------
There are two possible options:
1. Show eliminated tables, like we do with const tables.
2. Do not show eliminated tables.
We chose option 2, because:
- the table is not accessed at all (besides locking it)
- it is more natural for anchor model user - when he's querying an anchor-
and attributes view, he doesn't care about the unused attributes.
EXPLAIN EXTENDED+SHOW WARNINGS won't show the removed table either.
NOTE: Before this WL, the warning text was generated after all JOIN objects
have been destroyed. This didn't allow to use information about join plan
when printing the warning. We've fixed this by keeping the JOIN objects until
the warning text has been generated.
Table elimination removes inner sides of outer join, and logically the ON
clause is also removed. If this clause has any subqueries, they will be
also removed from EXPLAIN output.
An exception to the above is that if we eliminate a derived table, it will
still be shown in EXPLAIN output. This comes from the fact that the FROM
subqueries are evaluated before table elimination is invoked.
TODO: Is the above ok or still remove parts of FROM subqueries?
7. Miscellaneous adjustments
============================
7.1 Fix used_tables() of aggregate functions
--------------------------------------------
Aggregate functions used to report that they depend on all tables, that is,
item_agg_func->used_tables() == (1ULL << join->tables) - 1
always. Fixed it, now aggregate function reports that it depends on the
tables that its arguments depend on. In particular, COUNT(*) reports that it
depends on no tables (item_count_star->used_tables()==0). One consequence of
that is that "item->used_tables()==0" is not equivalent to
"item->const_item()==true" anymore (not sure if it's "anymore" or this has
been already so for some items).
7.2 Make subquery predicates collect their outer references
-----------------------------------------------------------
Per-column functional dependency analysis requires us to take a
tbl.field = func(...)
equality and tell which columns of which tables are referred from func(...)
expression. For scalar expressions, this is accomplished by Item::walk()-based
traversal. It should be reasonably cheap (the only practical Item that can be
expensive to traverse seems to be a special case of "col IN (const1,const2,
...)". check if we traverse the long list for such items).
For correlated subqueries, traversal can be expensive, it is cheaper to make
each subquery item have a list of its outer references. The list can be
collected at fix_fields() stage with very little extra cost, and then it could
be used for other optimizations.
8. Other concerns
=================
8.1 Relationship with outer->inner joins converter
--------------------------------------------------
One could suspect that outer->inner join conversion could get in the way
of table elimination by changing outer joins (which could be eliminated)
to inner (which we will not try to eliminate).
This concern is not valid: we make outer->inner conversions based on
predicates in WHERE. If the WHERE referred to an inner table (this is a
requirement for the conversion) then table elimination would not be
applicable anyway.
8.2 Relationship with prepared statements
-----------------------------------------
On one hand, it's natural to desire to make table elimination a
once-per-statement operation, like outer->inner join conversion. We'll have
to limit the applicability by removing [MARK1] as that can change during
lifetime of the statement.
The other option is to do table elimination every time. This will require to
rework operation [MARK2] to be undoable.
8.3 Relationship with constant table detection
----------------------------------------------
Table elimination is performed after constant table detection (but before
the range analysis). Constant tables are technically different from
eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't).
Considering we've already done the join_read_const_table() call, is there any
real difference between constant table and eliminated one? If there is, should
we mark const tables also as eliminated?
from user/EXPLAIN point of view: no. constant table is the one that we read
one record from. eliminated table is the one that we don't acccess at all.
TODO
9. Tests and benchmarks
=======================
Create a benchmark in sql-bench which checks if the DBMS has table
elimination.
[According to Monty] Run
- query Q1 that would use elimination
- query Q2 that is very similar to Q1 (so that they would have same
QEP, execution cost, etc) but cannot use table elimination.
then compare run times and make a conclusion about whether the used dbms
supports table elimination.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Guest): Table elimination (17)
by worklog-noreply@askmonty.org 24 Mar '10
by worklog-noreply@askmonty.org 24 Mar '10
24 Mar '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Table elimination
CREATION DATE..: Sun, 10 May 2009, 19:57
SUPERVISOR.....: Monty
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 17 (http://askmonty.org/worklog/?tid=17)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 1
ESTIMATE.......: 3 (hours remain)
ORIG. ESTIMATE.: 3
PROGRESS NOTES:
-=-=(Guest - Wed, 24 Mar 2010, 05:58)=-=-
Status updated.
--- /tmp/wklog.17.old.13223 2010-03-24 05:58:26.000000000 +0000
+++ /tmp/wklog.17.new.13223 2010-03-24 05:58:26.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Assigned
-=-=(Guest - Wed, 24 Mar 2010, 05:58)=-=-
Privacy level updated.
--- /tmp/wklog.17.old.13223 2010-03-24 05:58:26.000000000 +0000
+++ /tmp/wklog.17.new.13223 2010-03-24 05:58:26.000000000 +0000
@@ -1 +1 @@
-n
+y
-=-=(Guest - Wed, 28 Oct 2009, 02:01)=-=-
Version updated.
--- /tmp/wklog.17.old.24041 2009-10-28 02:01:58.000000000 +0200
+++ /tmp/wklog.17.new.24041 2009-10-28 02:01:58.000000000 +0200
@@ -1 +1 @@
-9.x
+Server-9.x
-=-=(Guest - Sun, 16 Aug 2009, 16:16)=-=-
Category updated.
--- /tmp/wklog.17.old.24882 2009-08-16 16:16:49.000000000 +0300
+++ /tmp/wklog.17.new.24882 2009-08-16 16:16:49.000000000 +0300
@@ -1 +1 @@
-Client-BackLog
+Server-Sprint
-=-=(Guest - Sun, 16 Aug 2009, 16:16)=-=-
Version updated.
--- /tmp/wklog.17.old.24882 2009-08-16 16:16:49.000000000 +0300
+++ /tmp/wklog.17.new.24882 2009-08-16 16:16:49.000000000 +0300
@@ -1 +1 @@
-Server-5.1
+9.x
-=-=(Guest - Wed, 29 Jul 2009, 21:41)=-=-
Low Level Design modified.
--- /tmp/wklog.17.old.26011 2009-07-29 21:41:04.000000000 +0300
+++ /tmp/wklog.17.new.26011 2009-07-29 21:41:04.000000000 +0300
@@ -2,163 +2,146 @@
~maria-captains/maria/maria-5.1-table-elimination tree.
<contents>
-1. Conditions for removal
-1.1 Quick check if there are candidates
-2. Removal operation properties
-3. Removal operation
-4. User interface
-5. Tests and benchmarks
-6. Todo, issues to resolve
-6.1 To resolve
-6.2 Resolved
-7. Additional issues
+1. Elimination criteria
+2. No outside references check
+2.1 Quick check if there are tables with no outside references
+3. One-match check
+3.1 Functional dependency source #1: Potential eq_ref access
+3.2 Functional dependency source #2: col2=func(col1)
+3.3 Functional dependency source #3: One or zero records in the table
+3.4 Functional dependency check implementation
+3.4.1 Equality collection: Option1
+3.4.2 Equality collection: Option2
+3.4.3 Functional dependency propagation - option 1
+3.4.4 Functional dependency propagation - option 2
+4. Removal operation properties
+5. Removal operation
+6. User interface
+6.1 @@optimizer_switch flag
+6.2 EXPLAIN [EXTENDED]
+7. Miscellaneous adjustments
+7.1 Fix used_tables() of aggregate functions
+7.2 Make subquery predicates collect their outer references
+8. Other concerns
+8.1 Relationship with outer->inner joins converter
+8.2 Relationship with prepared statements
+8.3 Relationship with constant table detection
+9. Tests and benchmarks
</contents>
It's not really about elimination of tables, it's about elimination of inner
sides of outer joins.
-1. Conditions for removal
--------------------------
-We can eliminate an inner side of outer join if:
-1. For each record combination of outer tables, it will always produce
- exactly one record.
-2. There are no references to columns of the inner tables anywhere else in
+1. Elimination criteria
+=======================
+We can eliminate inner side of an outer join nest if:
+
+1. There are no references to columns of the inner tables anywhere else in
the query.
+2. For each record combination of outer tables, it will always produce
+ exactly one matching record combination.
+
+Most of effort in this WL entry is checking these two conditions.
-#1 means that every table inside the outer join nest is:
- - is a constant table:
- = because it can be accessed via eq_ref(const) access, or
- = it is a zero-rows or one-row MyISAM-like table [MARK1]
- - has an eq_ref access method candidate.
-
-#2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY,
- GROUP BY and HAVING do not refer to the inner tables of the outer join
- nest.
-
-1.1 Quick check if there are candidates
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Before we start to enumerate join nests, here is a quick way to check if
-there *can be* something to be removed:
+2. No outside references check
+==============================
+Criterion #1 means that the WHERE clause, ON clauses of embedding/subsequent
+outer joins, ORDER BY, GROUP BY and HAVING must have no references to inner
+tables of the outer join nest we're trying to remove.
+
+For multi-table UPDATE/DELETE we also must not remove tables that we're
+updating/deleting from or tables that are used in UPDATE's SET clause.
+
+2.1 Quick check if there are tables with no outside references
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Before we start searching for outer join nests that could be eliminated,
+we'll do a quick and cheap check if there possibly could be something that
+could be eliminated:
- if ((tables used in select_list |
+ if (there are outer joins &&
+ (tables used in select_list |
tables used in group/order by UNION |
- tables used in where) != bitmap_of_all_tables)
+ tables used in where) != bitmap_of_all_join_tables)
{
attempt table elimination;
}
-2. Removal operation properties
--------------------------------
-* There is always one way to remove (no choice to remove either this or that)
-* It is always better to remove as much tables as possible (at least within
- our cost model).
-Thus, no need for any cost calculations/etc. It's an unconditional rewrite.
-3. Removal operation
---------------------
-* Remove the outer join nest's nested join structure (i.e. get the
- outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding,
- $OJ->embedding->nested_join. Update table_map's of all ancestor nested
- joins). [MARK2]
+3. One-match check
+==================
+We can eliminate inner side of outer join if it will always generate exactly
+one matching record combination.
-* Move the tables and their JOIN_TABs to front like it is done with const
- tables, with exception that if eliminated outer join nest was within
- another outer join nest, that shouldn't prevent us from moving away the
- eliminated tables.
+By definition of OUTER JOIN, a NULL-complemented record combination will be
+generated when the inner side of outer join has not produced any matches.
-* Update join->table_count and all-join-tables bitmap.
+What remains to be checked is that there is no possiblity that inner side of
+the outer join could produce more than one matching record combination.
-* That's it. Nothing else?
+We'll refer to one-match property as "functional dependency":
-4. User interface
------------------
-* We'll add an @@optimizer switch flag for table elimination. Tentative
- name: 'table_elimination'.
- (Note ^^ utility of the above questioned ^, as table elimination can never
- be worse than no elimination. We're leaning towards not adding the flag)
-
-* EXPLAIN will not show the removed tables at all. This will allow to check
- if tables were removed, and also will behave nicely with anchor model and
- VIEWs: stuff that user doesn't care about just won't be there.
+- A outer join nest is functionally dependent [wrt outer tables] if it will
+ produce one matching record combination per each record combination of
+ outer tables
-5. Tests and benchmarks
------------------------
-Create a benchmark in sql-bench which checks if the DBMS has table
-elimination.
-[According to Monty] Run
- - queries that would use elimination
- - queries that are very similar to one above (so that they would have same
- QEP, execution cost, etc) but cannot use table elimination.
-then compare run times and make a conclusion about whether dbms supports table
-elimination.
+- A table is functionally dependent wrt certain set of dependency tables, if
+ record combination of dependency tables uniquely identifies zero or one
+ matching record in the table
-6. Todo, issues to resolve
---------------------------
+- Definitions of functional dependency of keys (=column tuples) and columns are
+ apparent.
-6.1 To resolve
-~~~~~~~~~~~~~~
-- Relationship with prepared statements.
- On one hand, it's natural to desire to make table elimination a
- once-per-statement operation, like outer->inner join conversion. We'll have
- to limit the applicability by removing [MARK1] as that can change during
- lifetime of the statement.
-
- The other option is to do table elimination every time. This will require to
- rework operation [MARK2] to be undoable.
-
- I'm leaning towards doing the former. With anchor modeling, it is unlikely
- that we'll meet outer joins which have N inner tables of which some are 1-row
- MyISAM tables that do not have primary key.
-
-6.2 Resolved
-~~~~~~~~~~~~
-* outer->inner join conversion is not a problem for table elimination.
- We make outer->inner conversions based on predicates in WHERE. If the WHERE
- referred to an inner table (requirement for OJ->IJ conversion) then table
- elimination would not be applicable anyway.
-
-* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause:
- - affected tables must not be eliminated
- - tables that are used on the right side of the SET x=y assignments must
- not be eliminated either.
+Our goal is to prove that the entire join nest is functionally-dependent.
-* Aggregate functions used to report that they depend on all tables, that is,
+Join nest is functionally dependent (on the otside tables) if each of its
+elements (those can be either base tables or join nests) is functionally
+dependent.
- item_agg_func->used_tables() == (1ULL << join->tables) - 1
+Functional dependency is transitive: if table A is f-dependent on the outer
+tables and table B is f.dependent on {A, outer_tables} then B is functionally
+dependent on the outer tables.
+
+Subsequent sections list cases when we can declare a table to be
+functionally-dependent.
+
+3.1 Functional dependency source #1: Potential eq_ref access
+------------------------------------------------------------
+This is the most practically-important case. Taking the example from the HLD
+of this WL entry:
+
+ select
+ A.colA
+ from
+ tableA A
+ left outer join
+ tableB B
+ on
+ B.id = A.id;
- always. Fixed it, now aggregate function reports it depends on
- tables that its arguments depend on. In particular, COUNT(*) reports
- that it depends on no tables (item_count_star->used_tables()==0).
- One consequence of that is that "item->used_tables()==0" is not
- equivalent to "item->const_item()==true" anymore (not sure if it's
- "anymore" or this has been already happening).
-
-* EXPLAIN EXTENDED warning text was generated after the JOIN object has
- been discarded. This didn't allow to use information about join plan
- when printing the warning. Fixed this by keeping the JOIN objects until
- we've printed the warning (have also an intent to remove the const
- tables from the join output).
-
-7. Additional issues
---------------------
-* We remove ON clauses within outer join nests. If these clauses contain
- subqueries, they probably should be gone from EXPLAIN output also?
- Yes. Current approach: when removing an outer join nest, walk the ON clause
- and mark subselects as eliminated. Then let EXPLAIN code check if the
- SELECT was eliminated before the printing (EXPLAIN is generated by doing
- a recursive descent, so the check will also cause children of eliminated
- selects not to be printed)
-
-* Table elimination is performed after constant table detection (but before
- the range analysis). Constant tables are technically different from
- eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't).
- Considering we've already done the join_read_const_table() call, is there any
- real difference between constant table and eliminated one? If there is, should
- we mark const tables also as eliminated?
- from user/EXPLAIN point of view: no. constant table is the one that we read
- one record from. eliminated table is the one that we don't acccess at all.
+and generalizing it: a table TBL is functionally-dependent if the ON
+expression allows to construct a potential eq_ref access to table TBL that
+uses only outer or functionally-dependent tables.
+
+In other words: table TBL will have one match if the ON expression can be
+converted into this form
+
+ TBL.unique_key=func(one_match_tables) AND .. remainder ...
+
+(with appropriate extension for multi-part keys), where
+
+ one_match_tables= {
+ tables that are not on the inner side of the outer join in question, and
+ functionally dependent tables
+ }
+
+Note that this will cover constant tables, except those that are constant because
+they have 0/1 record or are partitioned and have no used partitions.
+
+
+3.2 Functional dependency source #2: col2=func(col1)
+----------------------------------------------------
+This comes from the second example in the HLS:
-* What is described above will not be able to eliminate this outer join
create unique index idx on tableB (id, fromDate);
...
left outer join
@@ -169,32 +152,331 @@
B.fromDate = (select max(sub.fromDate)
from tableB sub where sub.id = A.id);
- This is because condition "B.fromDate= func(tableB)" cannot be used.
- Reason#1: update_ref_and_keys() does not consider such conditions to
- be of any use (and indeed they are not usable for ref access)
- so they are not put into KEYUSE array.
- Reason#2: even if they were put there, we would need to be able to tell
- between predicates like
- B.fromDate= func(B.id) // guarantees only one matching row as
- // B.id is already bound by B.id=A.id
- // hence B.fromDate becomes bound too.
- and
- "B.fromDate= func(B.*)" // Can potentially have many matching
- // records.
- We need to
- - Have update_ref_and_keys() create KEYUSE elements for such equalities
- - Have eliminate_tables() and friends make a more accurate check.
- The right check is to check whether all parts of a unique key are bound.
- If we have keypartX to be bound, then t.keypartY=func(keypartX) makes
- keypartY to be bound.
- The difficulty here is that correlated subquery predicate cannot tell what
- columns it depends on (it only remembers tables).
- Traversing the predicate is expensive and complicated.
- We're leaning towards making each subquery predicate have a List<Item> with
- items that
- - are in the current select
- - and it depends on.
- This list will be useful in certain other subquery optimizations as well,
- it is cheap to collect it in fix_fields() phase, so it will be collected
- for every subquery predicate.
+Here it is apparent that tableB can be eliminated. It is not possible to
+construct eq_ref access to tableB, though, because for the second part of the
+primary key (fromDate column) we only got a condition in this form:
+
+ B.fromDate= func(tableB)
+
+(we write "func(tableB)" because ref optimizer can only determine which tables
+the right part of the equality depends on).
+
+In general case, equality like this doesn't guarantee functional dependency.
+For example, if func() == { return fromDate;}, i.e the ON expression is
+
+ ... ON B.id = A.id and B.fromDate = B.fromDate
+
+then that would allow table B to have multiple matches per record of table A.
+
+In order to be able to distinguish between these two cases, we'll need to go
+down to column level:
+
+- A table is functionally dependent if it has a unique key that's functionally
+ dependent
+
+- A unique key is functionally dependent when all of its columns are
+ functionally dependent
+
+- A table column is functionally dependent if the ON clause allows to extract
+ an AND-part in this form:
+
+ tbl.column = f(functionally-dependent columns or columns of outer tables)
+
+3.3 Functional dependency source #3: One or zero records in the table
+---------------------------------------------------------------------
+A table with one or zero records cannot generate more than one matching
+record. This source is of lesser importance as one/zero-record tables are only
+MyISAM tables.
+
+3.4 Functional dependency check implementation
+----------------------------------------------
+As shown above, we need something similar to KEYUSE structures, but not
+exactly that (we need things that current ref optimizer considers unusable and
+don't need things that it considers usable).
+
+3.4.1 Equality collection: Option1
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+We could
+- extend KEYUSE structures to store all kinds of equalities we need
+- change update_ref_and_keys() and co. to collect equalities both for ref
+ access and for table elimination
+ = [possibly] Improve [eq_]ref access to be able to use equalities in
+ form keypart2=func(keypart1)
+- process the KEYUSE array both by table elimination and by ref access
+ optimizer.
+
++ This requires less effort.
+- Code will have to be changed all over sql_select.cc
+- update_ref_and_keys() and co. already do several unrelated things. Hooking
+ up table elimination will make it even worse.
+
+3.4.2 Equality collection: Option2
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Alternatively, we could process the WHERE clause totally on our own.
++ Table elimination is standalone and easy to detach module.
+- Some code duplication with update_ref_and_keys() and co.
+
+Having got the equalities, we'll to propagate functional dependency property
+to unique keys, tables and, ultimately, join nests.
+
+3.4.3 Functional dependency propagation - option 1
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Borrow the approach used in constant table detection code:
+
+ do
+ {
+ converted= FALSE;
+ for each table T in join nest
+ {
+ if (check_if_functionally_dependent(T))
+ converted= TRUE;
+ }
+ } while (converted == TRUE);
+
+ check_if_functionally_dependent(T)
+ {
+ if (T has eq_ref access based on func_dep_tables)
+ return TRUE;
+
+ Apply the same do-while loop-based approach to available equalities
+ T.column1=func(other columns)
+ to spread the set of functionally-dependent columns. The goal is to get
+ all columns of a certain unique key to be bound.
+ }
+
+
+3.4.4 Functional dependency propagation - option 2
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Analyze the ON expression(s) and build a list of
+
+ tbl.field = expr(...)
+
+equalities. tbl here is a table that belongs to a join nest that could
+potentially be eliminated.
+
+besides those, add to the list
+ - An element for each unique key in the table that needs to be eliminated
+ - An element for each table that needs to be eliminated
+ - An element for each join nest that can be eliminated (i.e. has no
+ references from outside).
+
+Then, setup "reverse dependencies": each element should have pointers to
+elements that are functionally dependent on it:
+
+- "tbl.field=expr(...)" equality is functionally dependent on all fields that
+ are used in "expr(...)" (here we take into account only fields that belong
+ to tables that can potentially be eliminated).
+- a unique key is dependent on all of its components
+- a table is dependent on all of its unique keys
+- a join nest is dependent on all tables that it contains
+
+These pointers are stored in form of one bitmap, such that:
+
+ "X depends on Y" == test( bitmap[(X's number)*n_objects + (Y's number)] )
+
+Each object also stores a number of dependencies it needs to be satisfied
+before it itself is satisfied:
+
+- "tbl.field=expr(...)" needs all its underlying fields (if a field is
+ referenced many times it is counted only once)
+
+- a unique key needs all of its key parts
+
+- a table needs only one of its unique keys
+
+- a join nest needs all of its tables
+
+(TODO: so what do we do when we've marked a table as constant? We'll need to
+update the "field=expr(....)" elements that use fields of that table. And the
+problem is that we won't know how much to decrement from the counters of those
+elements.
+
+Solution#1: switch to table_map() based approach.
+Solution#2: introduce separate elements for each involved field.
+ field will depend on its table,
+ "field=expr" will depend on fields.
+)
+
+Besides the above, let each element have a pointer to another element, so that
+we can have a linked list of elements.
+
+After the above structures have been created, we start the main algorithm.
+
+The first step is to create a list of functionally-dependent elements. We walk
+across array of dependencies and mark those elements that are already bound
+(i.e. their dependencies are satisfied). At the moment those immediately-bound
+are only "field=expr" dependencies that don't refer to any columns that are
+not bound.
+
+The second step is the loop
+
+ while (bound_list is not empty)
+ {
+ Take the first bound element F off the list.
+ Use the bitmap to find out what other elements depended on it
+ for each such element E
+ {
+ if (E becomes bound after F is bound)
+ add E to the list;
+ }
+ }
+
+The last step is to walk through elements that represent the join nests. Those
+that are bound can be eliminated.
+
+4. Removal operation properties
+===============================
+* There is always one way to remove (no choice to remove either this or that)
+* It is always better to remove as much tables as possible (at least within
+ our cost model).
+Thus, no need for any cost calculations/etc. It's an unconditional rewrite.
+
+
+5. Removal operation
+====================
+(This depends a lot on whether we make table elimination a one-off rewrite or
+conditional)
+
+At the moment table elimination is re-done for each join re-execution, hence
+the removal operation is designed not to modify any statement's permanent
+members.
+
+* Remove the outer join nest's nested join structure (i.e. get the
+ outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding,
+ $OJ->embedding->nested_join. Update table_map's of all ancestor nested
+ joins). [MARK2]
+
+* Move the tables and their JOIN_TABs to the front of join order, like it is
+ done with const tables, with exception that if eliminated outer join nest
+ was within another outer join nest, that shouldn't prevent us from moving
+ away the eliminated tables.
+
+* Update join->table_count and all-join-tables bitmap.
+ ^ TODO: not true anymore ^
+
+* That's it. Nothing else?
+
+6. User interface
+=================
+
+6.1 @@optimizer_switch flag
+---------------------------
+Argument againist adding the flag:
+* It is always better to perform table elimination than not to do it.
+
+Arguments for the flag:
+* It is always theoretically possible that the new code will cause unintended
+ slowdowns.
+* Having the flag is useful for QA and comparative benchmarking.
+
+Decision so far: add the flag under #ifdef. Make the flag be present in debug
+builds.
+
+6.2 EXPLAIN [EXTENDED]
+----------------------
+There are two possible options:
+1. Show eliminated tables, like we do with const tables.
+2. Do not show eliminated tables.
+
+We chose option 2, because:
+- the table is not accessed at all (besides locking it)
+- it is more natural for anchor model user - when he's querying an anchor-
+ and attributes view, he doesn't care about the unused attributes.
+
+EXPLAIN EXTENDED+SHOW WARNINGS won't show the removed table either.
+
+NOTE: Before this WL, the warning text was generated after all JOIN objects
+have been destroyed. This didn't allow to use information about join plan
+when printing the warning. We've fixed this by keeping the JOIN objects until
+the warning text has been generated.
+
+Table elimination removes inner sides of outer join, and logically the ON
+clause is also removed. If this clause has any subqueries, they will be
+also removed from EXPLAIN output.
+
+An exception to the above is that if we eliminate a derived table, it will
+still be shown in EXPLAIN output. This comes from the fact that the FROM
+subqueries are evaluated before table elimination is invoked.
+TODO: Is the above ok or still remove parts of FROM subqueries?
+
+7. Miscellaneous adjustments
+============================
+
+7.1 Fix used_tables() of aggregate functions
+--------------------------------------------
+Aggregate functions used to report that they depend on all tables, that is,
+
+ item_agg_func->used_tables() == (1ULL << join->tables) - 1
+
+always. Fixed it, now aggregate function reports that it depends on the
+tables that its arguments depend on. In particular, COUNT(*) reports that it
+depends on no tables (item_count_star->used_tables()==0). One consequence of
+that is that "item->used_tables()==0" is not equivalent to
+"item->const_item()==true" anymore (not sure if it's "anymore" or this has
+been already so for some items).
+
+7.2 Make subquery predicates collect their outer references
+-----------------------------------------------------------
+Per-column functional dependency analysis requires us to take a
+
+ tbl.field = func(...)
+
+equality and tell which columns of which tables are referred from func(...)
+expression. For scalar expressions, this is accomplished by Item::walk()-based
+traversal. It should be reasonably cheap (the only practical Item that can be
+expensive to traverse seems to be a special case of "col IN (const1,const2,
+...)". check if we traverse the long list for such items).
+
+For correlated subqueries, traversal can be expensive, it is cheaper to make
+each subquery item have a list of its outer references. The list can be
+collected at fix_fields() stage with very little extra cost, and then it could
+be used for other optimizations.
+
+
+8. Other concerns
+=================
+
+8.1 Relationship with outer->inner joins converter
+--------------------------------------------------
+One could suspect that outer->inner join conversion could get in the way
+of table elimination by changing outer joins (which could be eliminated)
+to inner (which we will not try to eliminate).
+This concern is not valid: we make outer->inner conversions based on
+predicates in WHERE. If the WHERE referred to an inner table (this is a
+requirement for the conversion) then table elimination would not be
+applicable anyway.
+
+8.2 Relationship with prepared statements
+-----------------------------------------
+On one hand, it's natural to desire to make table elimination a
+once-per-statement operation, like outer->inner join conversion. We'll have
+to limit the applicability by removing [MARK1] as that can change during
+lifetime of the statement.
+
+The other option is to do table elimination every time. This will require to
+rework operation [MARK2] to be undoable.
+
+
+8.3 Relationship with constant table detection
+----------------------------------------------
+Table elimination is performed after constant table detection (but before
+the range analysis). Constant tables are technically different from
+eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't).
+Considering we've already done the join_read_const_table() call, is there any
+real difference between constant table and eliminated one? If there is, should
+we mark const tables also as eliminated?
+from user/EXPLAIN point of view: no. constant table is the one that we read
+one record from. eliminated table is the one that we don't acccess at all.
+TODO
+
+9. Tests and benchmarks
+=======================
+Create a benchmark in sql-bench which checks if the DBMS has table
+elimination.
+[According to Monty] Run
+ - query Q1 that would use elimination
+ - query Q2 that is very similar to Q1 (so that they would have same
+ QEP, execution cost, etc) but cannot use table elimination.
+then compare run times and make a conclusion about whether the used dbms
+supports table elimination.
-=-=(Guest - Thu, 23 Jul 2009, 20:07)=-=-
Dependency created: 29 now depends on 17
-=-=(Monty - Thu, 23 Jul 2009, 09:19)=-=-
Version updated.
--- /tmp/wklog.17.old.24090 2009-07-23 09:19:32.000000000 +0300
+++ /tmp/wklog.17.new.24090 2009-07-23 09:19:32.000000000 +0300
@@ -1 +1 @@
-Server-9.x
+Server-5.1
-=-=(Guest - Mon, 20 Jul 2009, 14:28)=-=-
deukje weg
Worked 1 hour and estimate 3 hours remain (original estimate increased by 4 hours).
-=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=-
Version updated.
--- /tmp/wklog.17.old.24138 2009-07-17 02:44:49.000000000 +0300
+++ /tmp/wklog.17.new.24138 2009-07-17 02:44:49.000000000 +0300
@@ -1 +1 @@
-9.x
+Server-9.x
------------------------------------------------------------
-=-=(View All Progress Notes, 31 total)=-=-
http://askmonty.org/worklog/index.pl?tid=17&nolimit=1
DESCRIPTION:
Eliminate not needed tables from SELECT queries..
This will speed up some views and automatically generated queries.
Example:
CREATE TABLE B (id int primary key);
select
A.colA
from
tableA A
left outer join
tableB B
on
B.id = A.id;
In this case we can remove table B and the join from the query.
HIGH-LEVEL SPECIFICATION:
Here is an extended explanation of table elimination.
Table elimination is a feature found in some modern query optimizers, of
which Microsoft SQL Server 2005/2008 seems to have the most advanced
implementation. Oracle 11g has also been confirmed to use table
elimination but not to the same extent.
Basically, what table elimination does, is to remove tables from the
execution plan when it is unnecessary to include them. This can, of
course, only happen if the right circumstances arise. Let us for example
look at the following query:
select
A.colA
from
tableA A
left outer join
tableB B
on
B.id = A.id;
When using A as the left table we ensure that the query will return at
least as many rows as there are in that table. For rows where the join
condition (B.id = A.id) is not met the selected column (A.colA) will
still contain it's original value. The not seen B.* row would contain all NULL:s.
However, the result set could actually contain more rows than what is
found in tableA if there are duplicates of the column B.id in tableB. If
A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"]
then two rows will match in the join condition. The only way to know
what the result will look like is to actually touch both tables during
execution.
Instead, let's say that tableB contains rows that make it possible to
place a unique constraint on the column B.id, for example and often the
case a primary key. In this situation we know that we will get exactly
as many rows as there are in tableA, since joining with tableB cannot
introduce any duplicates. If further, as in the example query, we do not
select any columns from tableB, touching that table during execution is
unnecessary. We can remove the whole join operation from the execution
plan.
Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination
in the case described above. Let us look at a more advanced query, where
Oracle fails.
select
A.colA
from
tableA A
left outer join
tableB B
on
B.id = A.id
and
B.fromDate = (
select
max(sub.fromDate)
from
tableB sub
where
sub.id = A.id
);
In this example we have added another join condition, which ensures
that we only pick the matching row from tableB having the latest
fromDate. In this case tableB will contain duplicates of the column
B.id, so in order to ensure uniqueness the primary key has to contain
the fromDate column as well. In other words the primary key of tableB
is (B.id, B.fromDate).
Furthermore, since the subselect ensures that we only pick the latest
B.fromDate for a given B.id we know that at most one row will match
the join condition. We will again have the situation where joining
with tableB cannot affect the number of rows in the result set. Since
we do not select any columns from tableB, the whole join operation can
be eliminated from the execution plan.
SQL Server 2005/2008 will deploy table elimination in this situation as
well. We have not found a way to make Oracle 11g use it for this type of
query. Queries like these arise in two situations. Either when you have
denormalized model consisting of a fact table with several related
dimension tables, or when you have a highly normalized model where each
attribute is stored in its own table. The example with the subselect is
common whenever you store historized/versioned data.
LOW-LEVEL DESIGN:
The code (currently in development) is at lp:
~maria-captains/maria/maria-5.1-table-elimination tree.
<contents>
1. Elimination criteria
2. No outside references check
2.1 Quick check if there are tables with no outside references
3. One-match check
3.1 Functional dependency source #1: Potential eq_ref access
3.2 Functional dependency source #2: col2=func(col1)
3.3 Functional dependency source #3: One or zero records in the table
3.4 Functional dependency check implementation
3.4.1 Equality collection: Option1
3.4.2 Equality collection: Option2
3.4.3 Functional dependency propagation - option 1
3.4.4 Functional dependency propagation - option 2
4. Removal operation properties
5. Removal operation
6. User interface
6.1 @@optimizer_switch flag
6.2 EXPLAIN [EXTENDED]
7. Miscellaneous adjustments
7.1 Fix used_tables() of aggregate functions
7.2 Make subquery predicates collect their outer references
8. Other concerns
8.1 Relationship with outer->inner joins converter
8.2 Relationship with prepared statements
8.3 Relationship with constant table detection
9. Tests and benchmarks
</contents>
It's not really about elimination of tables, it's about elimination of inner
sides of outer joins.
1. Elimination criteria
=======================
We can eliminate inner side of an outer join nest if:
1. There are no references to columns of the inner tables anywhere else in
the query.
2. For each record combination of outer tables, it will always produce
exactly one matching record combination.
Most of effort in this WL entry is checking these two conditions.
2. No outside references check
==============================
Criterion #1 means that the WHERE clause, ON clauses of embedding/subsequent
outer joins, ORDER BY, GROUP BY and HAVING must have no references to inner
tables of the outer join nest we're trying to remove.
For multi-table UPDATE/DELETE we also must not remove tables that we're
updating/deleting from or tables that are used in UPDATE's SET clause.
2.1 Quick check if there are tables with no outside references
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before we start searching for outer join nests that could be eliminated,
we'll do a quick and cheap check if there possibly could be something that
could be eliminated:
if (there are outer joins &&
(tables used in select_list |
tables used in group/order by UNION |
tables used in where) != bitmap_of_all_join_tables)
{
attempt table elimination;
}
3. One-match check
==================
We can eliminate inner side of outer join if it will always generate exactly
one matching record combination.
By definition of OUTER JOIN, a NULL-complemented record combination will be
generated when the inner side of outer join has not produced any matches.
What remains to be checked is that there is no possiblity that inner side of
the outer join could produce more than one matching record combination.
We'll refer to one-match property as "functional dependency":
- A outer join nest is functionally dependent [wrt outer tables] if it will
produce one matching record combination per each record combination of
outer tables
- A table is functionally dependent wrt certain set of dependency tables, if
record combination of dependency tables uniquely identifies zero or one
matching record in the table
- Definitions of functional dependency of keys (=column tuples) and columns are
apparent.
Our goal is to prove that the entire join nest is functionally-dependent.
Join nest is functionally dependent (on the otside tables) if each of its
elements (those can be either base tables or join nests) is functionally
dependent.
Functional dependency is transitive: if table A is f-dependent on the outer
tables and table B is f.dependent on {A, outer_tables} then B is functionally
dependent on the outer tables.
Subsequent sections list cases when we can declare a table to be
functionally-dependent.
3.1 Functional dependency source #1: Potential eq_ref access
------------------------------------------------------------
This is the most practically-important case. Taking the example from the HLD
of this WL entry:
select
A.colA
from
tableA A
left outer join
tableB B
on
B.id = A.id;
and generalizing it: a table TBL is functionally-dependent if the ON
expression allows to construct a potential eq_ref access to table TBL that
uses only outer or functionally-dependent tables.
In other words: table TBL will have one match if the ON expression can be
converted into this form
TBL.unique_key=func(one_match_tables) AND .. remainder ...
(with appropriate extension for multi-part keys), where
one_match_tables= {
tables that are not on the inner side of the outer join in question, and
functionally dependent tables
}
Note that this will cover constant tables, except those that are constant because
they have 0/1 record or are partitioned and have no used partitions.
3.2 Functional dependency source #2: col2=func(col1)
----------------------------------------------------
This comes from the second example in the HLS:
create unique index idx on tableB (id, fromDate);
...
left outer join
tableB B
on
B.id = A.id
and
B.fromDate = (select max(sub.fromDate)
from tableB sub where sub.id = A.id);
Here it is apparent that tableB can be eliminated. It is not possible to
construct eq_ref access to tableB, though, because for the second part of the
primary key (fromDate column) we only got a condition in this form:
B.fromDate= func(tableB)
(we write "func(tableB)" because ref optimizer can only determine which tables
the right part of the equality depends on).
In general case, equality like this doesn't guarantee functional dependency.
For example, if func() == { return fromDate;}, i.e the ON expression is
... ON B.id = A.id and B.fromDate = B.fromDate
then that would allow table B to have multiple matches per record of table A.
In order to be able to distinguish between these two cases, we'll need to go
down to column level:
- A table is functionally dependent if it has a unique key that's functionally
dependent
- A unique key is functionally dependent when all of its columns are
functionally dependent
- A table column is functionally dependent if the ON clause allows to extract
an AND-part in this form:
tbl.column = f(functionally-dependent columns or columns of outer tables)
3.3 Functional dependency source #3: One or zero records in the table
---------------------------------------------------------------------
A table with one or zero records cannot generate more than one matching
record. This source is of lesser importance as one/zero-record tables are only
MyISAM tables.
3.4 Functional dependency check implementation
----------------------------------------------
As shown above, we need something similar to KEYUSE structures, but not
exactly that (we need things that current ref optimizer considers unusable and
don't need things that it considers usable).
3.4.1 Equality collection: Option1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We could
- extend KEYUSE structures to store all kinds of equalities we need
- change update_ref_and_keys() and co. to collect equalities both for ref
access and for table elimination
= [possibly] Improve [eq_]ref access to be able to use equalities in
form keypart2=func(keypart1)
- process the KEYUSE array both by table elimination and by ref access
optimizer.
+ This requires less effort.
- Code will have to be changed all over sql_select.cc
- update_ref_and_keys() and co. already do several unrelated things. Hooking
up table elimination will make it even worse.
3.4.2 Equality collection: Option2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Alternatively, we could process the WHERE clause totally on our own.
+ Table elimination is standalone and easy to detach module.
- Some code duplication with update_ref_and_keys() and co.
Having got the equalities, we'll to propagate functional dependency property
to unique keys, tables and, ultimately, join nests.
3.4.3 Functional dependency propagation - option 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Borrow the approach used in constant table detection code:
do
{
converted= FALSE;
for each table T in join nest
{
if (check_if_functionally_dependent(T))
converted= TRUE;
}
} while (converted == TRUE);
check_if_functionally_dependent(T)
{
if (T has eq_ref access based on func_dep_tables)
return TRUE;
Apply the same do-while loop-based approach to available equalities
T.column1=func(other columns)
to spread the set of functionally-dependent columns. The goal is to get
all columns of a certain unique key to be bound.
}
3.4.4 Functional dependency propagation - option 2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Analyze the ON expression(s) and build a list of
tbl.field = expr(...)
equalities. tbl here is a table that belongs to a join nest that could
potentially be eliminated.
besides those, add to the list
- An element for each unique key in the table that needs to be eliminated
- An element for each table that needs to be eliminated
- An element for each join nest that can be eliminated (i.e. has no
references from outside).
Then, setup "reverse dependencies": each element should have pointers to
elements that are functionally dependent on it:
- "tbl.field=expr(...)" equality is functionally dependent on all fields that
are used in "expr(...)" (here we take into account only fields that belong
to tables that can potentially be eliminated).
- a unique key is dependent on all of its components
- a table is dependent on all of its unique keys
- a join nest is dependent on all tables that it contains
These pointers are stored in form of one bitmap, such that:
"X depends on Y" == test( bitmap[(X's number)*n_objects + (Y's number)] )
Each object also stores a number of dependencies it needs to be satisfied
before it itself is satisfied:
- "tbl.field=expr(...)" needs all its underlying fields (if a field is
referenced many times it is counted only once)
- a unique key needs all of its key parts
- a table needs only one of its unique keys
- a join nest needs all of its tables
(TODO: so what do we do when we've marked a table as constant? We'll need to
update the "field=expr(....)" elements that use fields of that table. And the
problem is that we won't know how much to decrement from the counters of those
elements.
Solution#1: switch to table_map() based approach.
Solution#2: introduce separate elements for each involved field.
field will depend on its table,
"field=expr" will depend on fields.
)
Besides the above, let each element have a pointer to another element, so that
we can have a linked list of elements.
After the above structures have been created, we start the main algorithm.
The first step is to create a list of functionally-dependent elements. We walk
across array of dependencies and mark those elements that are already bound
(i.e. their dependencies are satisfied). At the moment those immediately-bound
are only "field=expr" dependencies that don't refer to any columns that are
not bound.
The second step is the loop
while (bound_list is not empty)
{
Take the first bound element F off the list.
Use the bitmap to find out what other elements depended on it
for each such element E
{
if (E becomes bound after F is bound)
add E to the list;
}
}
The last step is to walk through elements that represent the join nests. Those
that are bound can be eliminated.
4. Removal operation properties
===============================
* There is always one way to remove (no choice to remove either this or that)
* It is always better to remove as much tables as possible (at least within
our cost model).
Thus, no need for any cost calculations/etc. It's an unconditional rewrite.
5. Removal operation
====================
(This depends a lot on whether we make table elimination a one-off rewrite or
conditional)
At the moment table elimination is re-done for each join re-execution, hence
the removal operation is designed not to modify any statement's permanent
members.
* Remove the outer join nest's nested join structure (i.e. get the
outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding,
$OJ->embedding->nested_join. Update table_map's of all ancestor nested
joins). [MARK2]
* Move the tables and their JOIN_TABs to the front of join order, like it is
done with const tables, with exception that if eliminated outer join nest
was within another outer join nest, that shouldn't prevent us from moving
away the eliminated tables.
* Update join->table_count and all-join-tables bitmap.
^ TODO: not true anymore ^
* That's it. Nothing else?
6. User interface
=================
6.1 @@optimizer_switch flag
---------------------------
Argument againist adding the flag:
* It is always better to perform table elimination than not to do it.
Arguments for the flag:
* It is always theoretically possible that the new code will cause unintended
slowdowns.
* Having the flag is useful for QA and comparative benchmarking.
Decision so far: add the flag under #ifdef. Make the flag be present in debug
builds.
6.2 EXPLAIN [EXTENDED]
----------------------
There are two possible options:
1. Show eliminated tables, like we do with const tables.
2. Do not show eliminated tables.
We chose option 2, because:
- the table is not accessed at all (besides locking it)
- it is more natural for anchor model user - when he's querying an anchor-
and attributes view, he doesn't care about the unused attributes.
EXPLAIN EXTENDED+SHOW WARNINGS won't show the removed table either.
NOTE: Before this WL, the warning text was generated after all JOIN objects
have been destroyed. This didn't allow to use information about join plan
when printing the warning. We've fixed this by keeping the JOIN objects until
the warning text has been generated.
Table elimination removes inner sides of outer join, and logically the ON
clause is also removed. If this clause has any subqueries, they will be
also removed from EXPLAIN output.
An exception to the above is that if we eliminate a derived table, it will
still be shown in EXPLAIN output. This comes from the fact that the FROM
subqueries are evaluated before table elimination is invoked.
TODO: Is the above ok or still remove parts of FROM subqueries?
7. Miscellaneous adjustments
============================
7.1 Fix used_tables() of aggregate functions
--------------------------------------------
Aggregate functions used to report that they depend on all tables, that is,
item_agg_func->used_tables() == (1ULL << join->tables) - 1
always. Fixed it, now aggregate function reports that it depends on the
tables that its arguments depend on. In particular, COUNT(*) reports that it
depends on no tables (item_count_star->used_tables()==0). One consequence of
that is that "item->used_tables()==0" is not equivalent to
"item->const_item()==true" anymore (not sure if it's "anymore" or this has
been already so for some items).
7.2 Make subquery predicates collect their outer references
-----------------------------------------------------------
Per-column functional dependency analysis requires us to take a
tbl.field = func(...)
equality and tell which columns of which tables are referred from func(...)
expression. For scalar expressions, this is accomplished by Item::walk()-based
traversal. It should be reasonably cheap (the only practical Item that can be
expensive to traverse seems to be a special case of "col IN (const1,const2,
...)". check if we traverse the long list for such items).
For correlated subqueries, traversal can be expensive, it is cheaper to make
each subquery item have a list of its outer references. The list can be
collected at fix_fields() stage with very little extra cost, and then it could
be used for other optimizations.
8. Other concerns
=================
8.1 Relationship with outer->inner joins converter
--------------------------------------------------
One could suspect that outer->inner join conversion could get in the way
of table elimination by changing outer joins (which could be eliminated)
to inner (which we will not try to eliminate).
This concern is not valid: we make outer->inner conversions based on
predicates in WHERE. If the WHERE referred to an inner table (this is a
requirement for the conversion) then table elimination would not be
applicable anyway.
8.2 Relationship with prepared statements
-----------------------------------------
On one hand, it's natural to desire to make table elimination a
once-per-statement operation, like outer->inner join conversion. We'll have
to limit the applicability by removing [MARK1] as that can change during
lifetime of the statement.
The other option is to do table elimination every time. This will require to
rework operation [MARK2] to be undoable.
8.3 Relationship with constant table detection
----------------------------------------------
Table elimination is performed after constant table detection (but before
the range analysis). Constant tables are technically different from
eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't).
Considering we've already done the join_read_const_table() call, is there any
real difference between constant table and eliminated one? If there is, should
we mark const tables also as eliminated?
from user/EXPLAIN point of view: no. constant table is the one that we read
one record from. eliminated table is the one that we don't acccess at all.
TODO
9. Tests and benchmarks
=======================
Create a benchmark in sql-bench which checks if the DBMS has table
elimination.
[According to Monty] Run
- query Q1 that would use elimination
- query Q2 that is very similar to Q1 (so that they would have same
QEP, execution cost, etc) but cannot use table elimination.
then compare run times and make a conclusion about whether the used dbms
supports table elimination.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0