----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE CREATION DATE..: Sun, 28 Feb 2010, 13:45 SUPERVISOR.....: Monty IMPLEMENTOR....: Psergey COPIES TO......: Igor, Psergey, Timour CATEGORY.......: Server-RawIdeaBin TASK ID........: 90 (http://askmonty.org/worklog/?tid=90) VERSION........: Server-5.3 STATUS.........: Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: -1 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Wed, 24 Mar 2010, 14:42)=-=- Low Level Design modified. --- /tmp/wklog.90.old.19182 2010-03-24 14:42:54.000000000 +0000 +++ /tmp/wklog.90.new.19182 2010-03-24 14:42:54.000000000 +0000 @@ -1 +1,140 @@ +<contents> +1. Applicability check +2. Representation +2.1 Option #1: Convert to TABLE_LIST +2.2 On subquery predicate removal +2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away +2.3 What is expected of the result of conversion +3. Pre-optimization steps +3.1 Constant detection +3.3 update_ref_and_keys +3.4 JOIN_TAB sorting criteria +4. Optimization +5. Execution +User interface. +</contents> + +We'll call the new execution strategy "jtbm-materialization", for the lack of +better name. + +1. Applicability check +====================== +The criteria for checking whether a subquery can be processed with +jtbm-materialization can be checked at JOIN::prepare stage (like it +happens with semi-join check) + +2. Representation +================= + +2.1 Option #1: Convert to TABLE_LIST +------------------------------------ +Make it work like semi-join nests: each jtbm-predicate is converted into a +TABLE_LIST object. This will make it + + - uniform with semi-joins (we've stepped on all rakes there) + - allow to process JTBM-subqueries in ON expressions + +simplify_joins() will handle jtbm TABLE_LISTs as some kinds of opaque base +tables. + +for EXPLAIN EXTENDED, it would be natural to print something semi-join like, +i.e. for + + SELECT ... FROM ot WHERE oe IN (SELECT ie FROM materialized-non-sj-select) + +we'll print + + SELECT ... FROM ot SJ (SELECT ie FROM materialized-non-sj-select) ON oe=XX + +the XX part is not clear. we don't want to print 'ie' the second time here? + +2.2 On subquery predicate removal +--------------------------------- +Q: if we remove the subquery predicate permanently, who will run +fix_fields() for it? For semi-joins we don't have the problem as we +inject into ON expression (right? or not? we have sj_on_expr, too... +(Investigation: we the the same Item* pointer both in WHERE and +as sj_on_expr. fix_fields() is called for the WHERE part and that's +how sj_on_expr gets fixed. This works as long as +Item_func_eq::fix_fields() does not try to substitute itself with +another item). +A: ? + +2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away +------------------------------------------------------------ +JOIN_TABs only live for the duration of one PS re-execution, so we'll have to +- make conversion fully undoable +- perform it sufficiently late in the optimization process, at the point + where JOIN_TABs are already allocated. + +Note that if we don't position JTBM predicates in join's TABLE_LIST tree, then +it will be impossible to handle JTBM queries inside/outside of outer joins. + +2.3 What is expected of the result of conversion +------------------------------------------------ +Join [pre]optimization relies on each optimized entity to have a bit in +table_map. + +TODO: where do we check if there will be enough bits for everyone? (at the + point where we assign them?) + +The bit stored in join_tab->table->map, and the apparent problem is that JTBM +join_tabs do not naturally have TABLE* object. + +We could use the the one that will be used for Materialization, but that will +stop working when we will have to include IN->EXISTS in the choice. + +Current approach: don't create a table. create a table_map element in JOIN_TAB +instead. Evgen has probably done something like that already. + +3. Pre-optimization steps +========================= +JOIN_TABs are allocated in make_join_statistics(). This where the changes will +be needed: for JOIN_TABs that correspond to JTBM-tables: + +- don't set tab->table, set tab->jtbm_select (or whatever) +- run subquery's optimizer to get its output cardinality + +3.1 Constant detection +---------------------- +What about subqueries that "are constant"? + const_item IN (SELECT uncorrelated) -> is constant, but not something + we would want to evaluate. + something IN (SELECT from_constant_join) -> is constant + +Do we need to mark their JOIN_TABs as constant? + +3.3 update_ref_and_keys +----------------------- +* Walk through JTBM elements and inject KEYUSE elements for their + IN-equalities. + +TODO: KEYUSE elements imply presense of KEYs! Which we don't have! + +3.4 JOIN_TAB sorting criteria +----------------------------- +Q: Where do we put JTBM's join_tab when pre-sorting records? +A: it should sort as regular table. + +TODO: where do we remove the predicates from the WHERE? + - remove them like SJ-converter does + - remove them with optimizer (like remove_eq_conds does) + +4. Optimization +=============== +Add a branch in best_access_path to account for +- JTBM-Materialization +- JTBM-Materialization-Scan. + +5. Execution +============ +* We should be able to reuse item_subselect.cc code for lookups +* But will have to use our own temptable scan code + +TODO: is it possible to have any unification with SJ-Materialization? + +User interface +-------------- +Any @@optimizer_switch flags for all this? + -=-=(Igor - Wed, 10 Mar 2010, 22:02)=-=- High Level Description modified. --- /tmp/wklog.90.old.2007 2010-03-10 22:02:23.000000000 +0000 +++ /tmp/wklog.90.new.2007 2010-03-10 22:02:23.000000000 +0000 @@ -13,8 +13,8 @@ for each record R2 in big_table such that oe=R1 pass R2 to output -Semi-join materialization supports such strategy with SJM-Scan strategy. This WL -entry is about adding support for such strategies for non-semijoin subqueries. +Semi-join materialization supports the inside-out strategy. This WL entry is +about adding support for such strategies for non-semijoin subqueries. Once WL#89 is done, there will be a cost-based choice between -=-=(Igor - Wed, 10 Mar 2010, 21:52)=-=- Status updated. --- /tmp/wklog.90.old.882 2010-03-10 21:52:02.000000000 +0000 +++ /tmp/wklog.90.new.882 2010-03-10 21:52:02.000000000 +0000 @@ -1 +1 @@ -Un-Assigned +Assigned -=-=(Psergey - Sun, 28 Feb 2010, 15:37)=-=- High Level Description modified. --- /tmp/wklog.90.old.23524 2010-02-28 15:37:47.000000000 +0000 +++ /tmp/wklog.90.new.23524 2010-02-28 15:37:47.000000000 +0000 @@ -15,3 +15,7 @@ Semi-join materialization supports such strategy with SJM-Scan strategy. This WL entry is about adding support for such strategies for non-semijoin subqueries. + + +Once WL#89 is done, there will be a cost-based choice between +Materialization+lookup, Materialization+scan, and IN->EXISTS+lookup strategies. -=-=(Psergey - Sun, 28 Feb 2010, 15:22)=-=- High-Level Specification modified. --- /tmp/wklog.90.old.23033 2010-02-28 15:22:09.000000000 +0000 +++ /tmp/wklog.90.new.23033 2010-02-28 15:22:09.000000000 +0000 @@ -1 +1,33 @@ +Basic idea on how this could be achieved: + +Pre-optimization phase +---------------------- + +The rewrite +~~~~~~~~~~~ +If we find a subquery predicate that is +- not processed by current semi-join optimizations +- is an AND-part of the WHERE/ON clause +- can be executed with Materialization + +then +- Remove the predicate from WHERE/ON clause +- Add a special JOIN_TAB object instead. + +Plan options +~~~~~~~~~~~~ +- Use the IN-equality to create KEYUSE elements. + +Optimization +------------ +- Pre-optimize the subquery so we know materialization cost +- Whenever best_access_path() encounters the "special JOIN_TAB" it should + consider two strategies: + A. Materialization and making lookups in the materialized table (if applicable) + B. Materialization and then scanning the materialized table. + + +EXPLAIN +------- +TODO how this will look in EXPLAIN output? -=-=(Psergey - Sun, 28 Feb 2010, 14:56)=-=- Dependency created: 91 now depends on 90 -=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=- Dependency deleted: 94 no longer depends on 90 -=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=- Title modified. --- /tmp/wklog.90.old.21903 2010-02-28 14:47:54.000000000 +0000 +++ /tmp/wklog.90.new.21903 2010-02-28 14:47:54.000000000 +0000 @@ -1 +1 @@ - Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE +Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE -=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=- High Level Description modified. --- /tmp/wklog.90.old.21880 2010-02-28 14:47:28.000000000 +0000 +++ /tmp/wklog.90.new.21880 2010-02-28 14:47:28.000000000 +0000 @@ -1,10 +1,17 @@ -For uncorrelated IN subqueries that can't be converted to semi-joins it is -necessary to make a cost-based choice between IN->EXISTS and Materialization -strategies. +Consider the following case: -Both strategies handle two cases: -1. A simple case w/o NULLs handling -2. Handling NULLs. +SELECT * FROM big_table +WHERE oe IN (SELECT ie FROM table_with_few_groups + WHERE ... + GROUP BY group_col) AND ... -This WL is about making cost-based decision for #1. +Here the best way to execute the query is: + Materialize the subquery; + # now run the join: + for each record R1 in materialized table + for each record R2 in big_table such that oe=R1 + pass R2 to output + +Semi-join materialization supports such strategy with SJM-Scan strategy. This WL +entry is about adding support for such strategies for non-semijoin subqueries. -=-=(Psergey - Sun, 28 Feb 2010, 14:47)=-=- Title modified. --- /tmp/wklog.90.old.21859 2010-02-28 14:47:02.000000000 +0000 +++ /tmp/wklog.90.new.21859 2010-02-28 14:47:02.000000000 +0000 @@ -1 +1 @@ -Subqueries: cost-based choice between Materialization and IN->EXISTS transformation + Subqueries: Inside-out execution for non-semijoin materialized subqueries that are AND-parts of the WHERE ------------------------------------------------------------ -=-=(View All Progress Notes, 11 total)=-=- http://askmonty.org/worklog/index.pl?tid=90&nolimit=1 DESCRIPTION: Consider the following case: SELECT * FROM big_table WHERE oe IN (SELECT ie FROM table_with_few_groups WHERE ... GROUP BY group_col) AND ... Here the best way to execute the query is: Materialize the subquery; # now run the join: for each record R1 in materialized table for each record R2 in big_table such that oe=R1 pass R2 to output Semi-join materialization supports the inside-out strategy. This WL entry is about adding support for such strategies for non-semijoin subqueries. Once WL#89 is done, there will be a cost-based choice between Materialization+lookup, Materialization+scan, and IN->EXISTS+lookup strategies. HIGH-LEVEL SPECIFICATION: Basic idea on how this could be achieved: Pre-optimization phase ---------------------- The rewrite ~~~~~~~~~~~ If we find a subquery predicate that is - not processed by current semi-join optimizations - is an AND-part of the WHERE/ON clause - can be executed with Materialization then - Remove the predicate from WHERE/ON clause - Add a special JOIN_TAB object instead. Plan options ~~~~~~~~~~~~ - Use the IN-equality to create KEYUSE elements. Optimization ------------ - Pre-optimize the subquery so we know materialization cost - Whenever best_access_path() encounters the "special JOIN_TAB" it should consider two strategies: A. Materialization and making lookups in the materialized table (if applicable) B. Materialization and then scanning the materialized table. EXPLAIN ------- TODO how this will look in EXPLAIN output? LOW-LEVEL DESIGN: <contents> 1. Applicability check 2. Representation 2.1 Option #1: Convert to TABLE_LIST 2.2 On subquery predicate removal 2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away 2.3 What is expected of the result of conversion 3. Pre-optimization steps 3.1 Constant detection 3.3 update_ref_and_keys 3.4 JOIN_TAB sorting criteria 4. Optimization 5. Execution User interface. </contents> We'll call the new execution strategy "jtbm-materialization", for the lack of better name. 1. Applicability check ====================== The criteria for checking whether a subquery can be processed with jtbm-materialization can be checked at JOIN::prepare stage (like it happens with semi-join check) 2. Representation ================= 2.1 Option #1: Convert to TABLE_LIST ------------------------------------ Make it work like semi-join nests: each jtbm-predicate is converted into a TABLE_LIST object. This will make it - uniform with semi-joins (we've stepped on all rakes there) - allow to process JTBM-subqueries in ON expressions simplify_joins() will handle jtbm TABLE_LISTs as some kinds of opaque base tables. for EXPLAIN EXTENDED, it would be natural to print something semi-join like, i.e. for SELECT ... FROM ot WHERE oe IN (SELECT ie FROM materialized-non-sj-select) we'll print SELECT ... FROM ot SJ (SELECT ie FROM materialized-non-sj-select) ON oe=XX the XX part is not clear. we don't want to print 'ie' the second time here? 2.2 On subquery predicate removal --------------------------------- Q: if we remove the subquery predicate permanently, who will run fix_fields() for it? For semi-joins we don't have the problem as we inject into ON expression (right? or not? we have sj_on_expr, too... (Investigation: we the the same Item* pointer both in WHERE and as sj_on_expr. fix_fields() is called for the WHERE part and that's how sj_on_expr gets fixed. This works as long as Item_func_eq::fix_fields() does not try to substitute itself with another item). A: ? 2.3 Option #2: No TABLE_LIST, convert to JOIN_TAB right away ------------------------------------------------------------ JOIN_TABs only live for the duration of one PS re-execution, so we'll have to - make conversion fully undoable - perform it sufficiently late in the optimization process, at the point where JOIN_TABs are already allocated. Note that if we don't position JTBM predicates in join's TABLE_LIST tree, then it will be impossible to handle JTBM queries inside/outside of outer joins. 2.3 What is expected of the result of conversion ------------------------------------------------ Join [pre]optimization relies on each optimized entity to have a bit in table_map. TODO: where do we check if there will be enough bits for everyone? (at the point where we assign them?) The bit stored in join_tab->table->map, and the apparent problem is that JTBM join_tabs do not naturally have TABLE* object. We could use the the one that will be used for Materialization, but that will stop working when we will have to include IN->EXISTS in the choice. Current approach: don't create a table. create a table_map element in JOIN_TAB instead. Evgen has probably done something like that already. 3. Pre-optimization steps ========================= JOIN_TABs are allocated in make_join_statistics(). This where the changes will be needed: for JOIN_TABs that correspond to JTBM-tables: - don't set tab->table, set tab->jtbm_select (or whatever) - run subquery's optimizer to get its output cardinality 3.1 Constant detection ---------------------- What about subqueries that "are constant"? const_item IN (SELECT uncorrelated) -> is constant, but not something we would want to evaluate. something IN (SELECT from_constant_join) -> is constant Do we need to mark their JOIN_TABs as constant? 3.3 update_ref_and_keys ----------------------- * Walk through JTBM elements and inject KEYUSE elements for their IN-equalities. TODO: KEYUSE elements imply presense of KEYs! Which we don't have! 3.4 JOIN_TAB sorting criteria ----------------------------- Q: Where do we put JTBM's join_tab when pre-sorting records? A: it should sort as regular table. TODO: where do we remove the predicates from the WHERE? - remove them like SJ-converter does - remove them with optimizer (like remove_eq_conds does) 4. Optimization =============== Add a branch in best_access_path to account for - JTBM-Materialization - JTBM-Materialization-Scan. 5. Execution ============ * We should be able to reuse item_subselect.cc code for lookups * But will have to use our own temptable scan code TODO: is it possible to have any unification with SJ-Materialization? User interface -------------- Any @@optimizer_switch flags for all this? ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)