[Maria-developers] Review input for: MDEV-22534 Decorrelate IN subquery
Hi Yuchen, Please find below review input for
commit ba9cf8efeb0b1e1c132d565adb79c98156783f15 Author: Yuchen Pei <yuchen.pei@mariadb.com> Date: Fri May 19 20:06:31 2023 +1000
MDEV-22534 Decorrelate IN subquery
First, general questions: Q1: It seems there is some overlap between Item_in_subselect::exists2in_processor() and Item_exists_subselect::exists2in_processor() Did you consider factoring out common code rather than copying it? (if yes, I'd like to know why it couldn't be done). Q2: Where is the coe that removes the correlated equalities from the subquery's WHERE? (I assume Item_exists_subselect code does it?)
diff --git a/sql/item_subselect.cc b/sql/item_subselect.cc index 9e6c205ca76..8d893794da3 100644 --- a/sql/item_subselect.cc +++ b/sql/item_subselect.cc @@ -3099,6 +3099,132 @@ static bool find_inner_outer_equalities(Item **conds, return TRUE; }
+/* Checks whether item tree intersects with the free list */ +static bool intersects_free_list(Item *item, THD *thd) +{ + for (const Item *to_find= thd->free_list; to_find; to_find= to_find->next) + if (item->walk(&Item::find_item_processor, 1, (void *) to_find)) + return true; + return false; +} +
Please try moving this function into a separate file. First name that came to my mind: opt_subq_decorrelate.cc Feel free to suggest better names. The idea is that we should aim for better structure. (This request is only valid if the factoring out suggested above doesn't work out).
+/* De-correlate where conditions in an IN subquery */
This needs a bigger comment describing what the rewrite does.
+bool Item_in_subselect::exists2in_processor(void *opt_arg) +{ + THD *thd= (THD *)opt_arg; + SELECT_LEX *first_select= unit->first_select(); + JOIN *join= first_select->join; + bool will_be_correlated; + Dynamic_array<EQ_FIELD_OUTER> eqs(PSI_INSTRUMENT_MEM, 5, 5); + List<Item> outer; + Query_arena *arena= NULL, backup; + int res= FALSE; + DBUG_ENTER("Item_in_subselect::exists2in_processor"); +
Please move comments out of the if statement and number the conditions. For examples, see best_access_path(), large if-statement starting with "Don't test table scan if it can't be better." or "EXISTS-to-IN coversion and ORDER BY ... LIMIT clause" in Item_exists_subselect::exists2in_processor().
+ if (!optimizer_flag(thd, OPTIMIZER_SWITCH_EXISTS_TO_IN) || + /* proceed only if I'm a toplevel IN or a toplevel NOT IN */ + !(is_top_level_item() || + (upper_not && upper_not->is_top_level_item())) ||
what is the reason behind this condition? If we're able to handle top-level NOT IN, why can't we handle subquery in any context? Can we really handle the NOT IN? (I suspect not, it will suffer from the semantics of IN subqueries with NULLs.. I can't find the name for this, it's described e.g. here: https://petrunia.net/2006/11/16/inany-subqueries-null-woes/ )
+ first_select->is_part_of_union() || /* skip if part of a union */ + first_select->group_list.elements || /* skip if with group by */ + first_select->with_sum_func || /* skip if aggregation */ + join->having || /* skip if with having */ + !join->conds || /* skip if no conds */ + /* skip if left expr is a single nonscalar subselect */ + (left_expr->type() == Item::SUBSELECT_ITEM && + !left_expr->type_handler()->is_scalar_type()))
The above seems to be some exotic SQL. Is it something like (select col1,lol2 from t1) IN (select col3,col4 from t2 where ..) and does MariaDB really support this? Please add a comment about this.
+ DBUG_RETURN(FALSE); + /* iterate over conditions, and check whether they can be moved out. */ + if (find_inner_outer_equalities(&join->conds, eqs)) + DBUG_RETURN(FALSE); + /* If we are in the execution of a prepared statement, check for + intersection with the free list to avoid segfault. Note that the + check for prepared statement execution is necessary, otherwise it + will likely always find intersection thus abort the + transformation. + + fixme: this only works when HAVE_PSI_STATEMENT_INTERFACE is + defined. It causes CI's like amd64-debian-10-debug-embedded to + fail. Are there other ways to find out we are in the execution of a + prepared statement? */
I'm looking at the code of activate_stmt_arena_if_needed(). It seems "stmt_arena->is_conventional()" is the check you're looking for.
+ if (thd->m_statement_state.m_parent_prepared_stmt) + { + for (uint i= 0; i < (uint) eqs.elements(); i++) + { + if (intersects_free_list(*eqs.at(i).eq_ref, thd)) + DBUG_RETURN(FALSE);
This check doesn't look good. As far as I understand, it is a temporary measure until related re-execution bug(s) are fixed by Igor (and/or Sanja)?
+ } + }
+ /* Determines whether the result will be correlated */ + { + List<Item> unused; + Collect_deps_prm prm= {&unused, // parameters + unit->first_select()->nest_level_base, // nest_level_base + 0, // count + unit->first_select()->nest_level, // nest_level + FALSE // collect + }; + walk(&Item::collect_outer_ref_processor, TRUE, &prm); + DBUG_ASSERT(prm.count > 0); + DBUG_ASSERT(prm.count >= (uint)eqs.elements()); + will_be_correlated= prm.count > (uint)eqs.elements(); See the example pasted in the MDEV: https://jira.mariadb.org/browse/MDEV-22534?focusedCommentId=259705&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-259705
Somehow, I get here will_be_correlated==FALSE but Materialization is still not considered for the subquery?
+ } + /* Back up the free list */ + arena= thd->activate_stmt_arena_if_needed(&backup); + /* Construct the items for left_expr */ + if (left_expr->type() == Item::ROW_ITEM) + for (uint i= 0; i < left_expr->cols(); i++) + outer.push_back(left_expr->element_index(i)); + else + outer.push_back(left_expr); + const uint offset= first_select->item_list.elements; + /* Move items to outer and select item list */ + for (uint i= 0; i < (uint)eqs.elements(); i++) + { + Item **eq_ref= eqs.at(i).eq_ref; + Item_ident *local_field= eqs.at(i).local_field; + Item *outer_exp= eqs.at(i).outer_exp; + first_select->item_list.push_back(local_field, thd->mem_root); + first_select->ref_pointer_array[offset + i]= (Item *)local_field;
How do we know that ref_pointer_array has enough room for the new elements? (We can discuss this with Sanja, he's an expert on the topic). I see this check in Item_exists_subselect::exists2in_processor if ((uint)eqs.elements() > (first_select->item_list.elements + first_select->select_n_reserved)) Is it how Item_exists_subselect handles this issue?
+ outer.push_back(outer_exp); + *eq_ref= new (thd->mem_root) Item_int(thd, 1); + if((*eq_ref)->fix_fields(thd, (Item **)eq_ref)) + { + res= TRUE; + goto out; + } + } + /* Update the left_expr */ + left_expr= new (thd->mem_root) Item_row(thd, outer); + if (left_expr->fix_fields(thd, &left_expr)) + { + res= TRUE; + goto out; + } + left_expr_orig= left_expr; + is_correlated= will_be_correlated; + /* Update any Item_in_optimizer wrapper if exists */ + if (optimizer) + { + optimizer->reset_cache(); + if (optimizer->fix_left(thd)) + { + res= TRUE; + goto out; + } + } + { + OPT_TRACE_TRANSFORM(thd, trace_wrapper, trace_transform, + get_select_lex()->select_number, "IN (SELECT)", + "decorrelation"); + } +out: + /* Restores the free list */ + if (arena) + thd->restore_active_arena(arena, &backup); + DBUG_RETURN(res); +}
On Mon, Dec 20, 2021 at 09:29:47PM +0300, Sergey Petrunia wrote:
Hello Igor,
I'm trying examples with this patch:
create table t1 (a int not null, key(a)); insert into t1 select seq from seq_1_to_1000;
First, I debug this example:
explain select * from t1 where a <10 or a>=10;
The execution enters key_or(), passes through the new code this patch has added and key_or() returns NULL. Good.
But if I modify the example slightly:
explain select * from t1 where a <20 or a>=10 ;
Here, key_or() still returns a SEL_ARG object that represents an infinite range:
(gdb) fini Run till exit from #0 key_or (...) 0x0000555555f1fca2 in tree_or (...) Value returned is $24 = (SEL_ARG *) 0x7fff70064d88 (gdb) p dbug_print_sel_arg($24) $26 = 0x7fff70024390 "SEL_ARG(-inf<=a<=+inf)"
Should this case be fixed as well?
On Fri, Dec 17, 2021 at 02:11:40PM -0800, IgorBabaev wrote:
revision-id: 677643a80986107491b7886441f2828384f0494b (mariadb-10.2.31-1286-g677643a) parent(s): 8bb55633699612279744c055e22eeca8d4058273 author: Igor Babaev committer: Igor Babaev timestamp: 2021-12-17 14:11:39 -0800 message:
MDEV-27262 Unexpected index intersection with full index scan for an index
If when extracting a range condition foran index from the WHERE condition
I think the comment wording could be improved also.
Range Optimizer sees that the range condition covers the whole index then such condition should be discarded because cannot it be used in any range scan. In some cases Range Optimizer really does it, but there remained some conditions for which it was not done. As a result the optimizer could produce index merge plans with the full index scan for one of the indexes participating in the index merge. This could be observed in one of the test cases from index_merge1.inc where a plan with index_merge_sort_union was produced and in the test case reported for this bug where a plan with index_merge_sort_intersect was produced. In both cases one of two index scans participating in index merge ran over the whole index. The patch slightly changes the original above mentioned test case from index_merge1.inc to be able to produce an intended plan employing index_merge_sort_union. The original query was left to show that index merge is not used for it anymore. It should be noted that for the plan with index_merge_sort_intersect could be chosen for execution only due to a defect in the InnoDB code that returns wrong estimates for the cardinality of big ranges.
This bug led to serious problems in 10.4+ where the optimization using Rowid filters is employed (see mdev-26446).
Approved by Oleksandr Byelkin <sanja@mariadb.com>
--- mysql-test/include/index_merge1.inc | 8 +++- mysql-test/r/index_merge_myisam.result | 12 +++-- mysql-test/r/range_innodb.result | 81 ++++++++++++++++++++++++++++++++++ mysql-test/t/range_innodb.test | 78 ++++++++++++++++++++++++++++++++ sql/opt_range.cc | 7 +++ 5 files changed, 181 insertions(+), 5 deletions(-)
diff --git a/mysql-test/include/index_merge1.inc b/mysql-test/include/index_merge1.inc index b168a76..440f1f7 100644 --- a/mysql-test/include/index_merge1.inc +++ b/mysql-test/include/index_merge1.inc @@ -150,15 +150,19 @@ explain select * from t0 where (((key3 <7 and key7 < 6) or key5 < 2) and (key5 < 5 or key6 < 6));
explain select * from t0 where - ((key3 <5 or key5 < 4) and (key1 < 4 or key2 < 4)) + ((key3 < 4 or key5 < 4) and (key1 < 4 or key2 < 4)) or ((key3 >=5 or key5 < 2) and (key5 < 5 or key6 < 6));
explain select * from t0 force index(i1, i2, i3, i4, i5, i6 ) where - ((key3 <5 or key5 < 4) and (key1 < 4 or key2 < 4)) + ((key3 < 4 or key5 < 4) and (key1 < 4 or key2 < 4)) or ((key3 >=5 or key5 < 2) and (key5 < 5 or key6 < 6));
+explain select * from t0 force index(i1, i2, i3, i4, i5, i6 ) where + ((key3 < 5 or key5 < 4) and (key1 < 4 or key2 < 4)) + or + ((key3 >=5 or key5 < 2) and (key5 < 5 or key6 < 6)); # 8. Verify that "order by" after index merge uses filesort select * from t0 where key1 < 5 or key8 < 4 order by key1;
diff --git a/mysql-test/r/index_merge_myisam.result b/mysql-test/r/index_merge_myisam.result index 5a23092..a096c34 100644 --- a/mysql-test/r/index_merge_myisam.result +++ b/mysql-test/r/index_merge_myisam.result @@ -173,17 +173,23 @@ or id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE t0 index_merge i1,i2,i3,i5,i6,i7 i3,i5 4,4 NULL 11 Using sort_union(i3,i5); Using where explain select * from t0 where -((key3 <5 or key5 < 4) and (key1 < 4 or key2 < 4)) +((key3 < 4 or key5 < 4) and (key1 < 4 or key2 < 4)) or ((key3 >=5 or key5 < 2) and (key5 < 5 or key6 < 6)); id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE t0 ALL i1,i2,i3,i5,i6 NULL NULL NULL 1024 Using where explain select * from t0 force index(i1, i2, i3, i4, i5, i6 ) where -((key3 <5 or key5 < 4) and (key1 < 4 or key2 < 4)) +((key3 < 4 or key5 < 4) and (key1 < 4 or key2 < 4)) or ((key3 >=5 or key5 < 2) and (key5 < 5 or key6 < 6)); id select_type table type possible_keys key key_len ref rows Extra -1 SIMPLE t0 index_merge i1,i2,i3,i5,i6 i3,i5 0,4 NULL 1024 Using sort_union(i3,i5); Using where +1 SIMPLE t0 index_merge i1,i2,i3,i5,i6 i3,i5 4,4 NULL 1024 Using sort_union(i3,i5); Using where +explain select * from t0 force index(i1, i2, i3, i4, i5, i6 ) where +((key3 < 5 or key5 < 4) and (key1 < 4 or key2 < 4)) +or +((key3 >=5 or key5 < 2) and (key5 < 5 or key6 < 6)); +id select_type table type possible_keys key key_len ref rows Extra +1 SIMPLE t0 ALL i1,i2,i3,i5,i6 NULL NULL NULL 1024 Using where select * from t0 where key1 < 5 or key8 < 4 order by key1; key1 key2 key3 key4 key5 key6 key7 key8 1 1 1 1 1 1 1 1023 diff --git a/mysql-test/r/range_innodb.result b/mysql-test/r/range_innodb.result index f2349f2..ccb6da3 100644 --- a/mysql-test/r/range_innodb.result +++ b/mysql-test/r/range_innodb.result @@ -108,3 +108,84 @@ DROP TABLE t0,t1; SET @@GLOBAL.debug_dbug = @saved_dbug; set @@optimizer_switch= @optimizer_switch_save; # End of 10.1 tests +# +# MDEV-27262: Index intersection with full scan over an index +# +CREATE TABLE t1 ( +id int(10) unsigned NOT NULL AUTO_INCREMENT, +p char(32) DEFAULT NULL, +es tinyint(3) unsigned NOT NULL DEFAULT 0, +er tinyint(3) unsigned NOT NULL DEFAULT 0, +x mediumint(8) unsigned NOT NULL DEFAULT 0, +PRIMARY KEY (id), +INDEX es (es), +INDEX x (x), +INDEX er (er,x), +INDEX p (p) +) ENGINE=InnoDB DEFAULT CHARSET=latin1; +insert into t1(es,er) select 0, 1 from seq_1_to_45; +insert into t1(es,er) select 0, 2 from seq_1_to_49; +insert into t1(es,er) select 0, 3 from seq_1_to_951; +insert into t1(es,er) select 0, 3 from seq_1_to_1054; +insert into t1(es,er) select 0, 6 from seq_1_to_25; +insert into t1(es,er) select 0, 11 from seq_1_to_1; +insert into t1(es,er) select 1, 1 from seq_1_to_45; +insert into t1(es,er) select 1, 2 from seq_1_to_16; +insert into t1(es,er) select 1, 3 from seq_1_to_511; +insert into t1(es,er) select 1, 4 from seq_1_to_687; +insert into t1(es,er) select 1, 6 from seq_1_to_50; +insert into t1(es,er) select 1, 7 from seq_1_to_4; +insert into t1(es,er) select 1, 11 from seq_1_to_1; +insert into t1(es,er) select 2, 1 from seq_1_to_82; +insert into t1(es,er) select 2, 2 from seq_1_to_82; +insert into t1(es,er) select 2, 3 from seq_1_to_1626; +insert into t1(es,er) select 2, 4 from seq_1_to_977; +insert into t1(es,er) select 2, 6 from seq_1_to_33; +insert into t1(es,er) select 2, 11 from seq_1_to_1; +insert into t1(es,er) select 3, 1 from seq_1_to_245; +insert into t1(es,er) select 3, 2 from seq_1_to_81; +insert into t1(es,er) select 3, 3 from seq_1_to_852; +insert into t1(es,er) select 3, 4 from seq_1_to_2243; +insert into t1(es,er) select 3, 6 from seq_1_to_44; +insert into t1(es,er) select 3, 11 from seq_1_to_1; +insert into t1(es,er) select 4, 1 from seq_1_to_91; +insert into t1(es,er) select 4, 2 from seq_1_to_83; +insert into t1(es,er) select 4, 3 from seq_1_to_297; +insert into t1(es,er) select 4, 4 from seq_1_to_2456; +insert into t1(es,er) select 4, 6 from seq_1_to_19; +insert into t1(es,er) select 4, 11 from seq_1_to_1; +update t1 set p='foobar'; +update t1 set x=0; +set @save_isp=@@innodb_stats_persistent; +set global innodb_stats_persistent= 1; +analyze table t1; +Table Op Msg_type Msg_text +test.t1 analyze status OK +set optimizer_switch='index_merge_sort_intersection=on'; +SELECT * FROM t1 +WHERE ((p = 'foo' AND er != 4) OR er = 4 ) AND (es >= 4) LIMIT 2; +id p es er x +14645 foobar 4 4 0 +14646 foobar 4 4 0 +EXPLAIN EXTENDED SELECT * FROM t1 +WHERE ((p = 'foo' AND er != 4) OR er = 4 ) AND (es >= 4) LIMIT 2; +id select_type table type possible_keys key key_len ref rows filtered Extra +1 SIMPLE t1 range es,er,p es 1 NULL # 100.00 Using index condition; Using where +Warnings: +Note 1003 select `test`.`t1`.`id` AS `id`,`test`.`t1`.`p` AS `p`,`test`.`t1`.`es` AS `es`,`test`.`t1`.`er` AS `er`,`test`.`t1`.`x` AS `x` from `test`.`t1` where (`test`.`t1`.`p` = 'foo' and `test`.`t1`.`er` <> 4 or `test`.`t1`.`er` = 4) and `test`.`t1`.`es` >= 4 limit 2 +set optimizer_switch='index_merge_sort_intersection=off'; +SELECT * FROM t1 +WHERE ((p = 'foo' AND er != 4) OR er = 4 ) AND (es >= 4) LIMIT 2; +id p es er x +14645 foobar 4 4 0 +14646 foobar 4 4 0 +EXPLAIN EXTENDED SELECT * FROM t1 +WHERE ((p = 'foo' AND er != 4) OR er = 4 ) AND (es >= 4) LIMIT 2; +id select_type table type possible_keys key key_len ref rows filtered Extra +1 SIMPLE t1 range es,er,p es 1 NULL # 100.00 Using index condition; Using where +Warnings: +Note 1003 select `test`.`t1`.`id` AS `id`,`test`.`t1`.`p` AS `p`,`test`.`t1`.`es` AS `es`,`test`.`t1`.`er` AS `er`,`test`.`t1`.`x` AS `x` from `test`.`t1` where (`test`.`t1`.`p` = 'foo' and `test`.`t1`.`er` <> 4 or `test`.`t1`.`er` = 4) and `test`.`t1`.`es` >= 4 limit 2 +set optimizer_switch='index_merge_sort_intersection=default'; +set global innodb_stats_persistent= @save_isp; +DROP TABLE t1; +# End of 10.2 tests diff --git a/mysql-test/t/range_innodb.test b/mysql-test/t/range_innodb.test index 428e5c2..7420e72 100644 --- a/mysql-test/t/range_innodb.test +++ b/mysql-test/t/range_innodb.test @@ -116,3 +116,81 @@ SET @@GLOBAL.debug_dbug = @saved_dbug; set @@optimizer_switch= @optimizer_switch_save;
--echo # End of 10.1 tests + +--echo # +--echo # MDEV-27262: Index intersection with full scan over an index +--echo # + +--source include/have_sequence.inc + +CREATE TABLE t1 ( + id int(10) unsigned NOT NULL AUTO_INCREMENT, + p char(32) DEFAULT NULL, + es tinyint(3) unsigned NOT NULL DEFAULT 0, + er tinyint(3) unsigned NOT NULL DEFAULT 0, + x mediumint(8) unsigned NOT NULL DEFAULT 0, + PRIMARY KEY (id), + INDEX es (es), + INDEX x (x), + INDEX er (er,x), + INDEX p (p) +) ENGINE=InnoDB DEFAULT CHARSET=latin1; + +insert into t1(es,er) select 0, 1 from seq_1_to_45; +insert into t1(es,er) select 0, 2 from seq_1_to_49; +insert into t1(es,er) select 0, 3 from seq_1_to_951; +insert into t1(es,er) select 0, 3 from seq_1_to_1054; +insert into t1(es,er) select 0, 6 from seq_1_to_25; +insert into t1(es,er) select 0, 11 from seq_1_to_1; +insert into t1(es,er) select 1, 1 from seq_1_to_45; +insert into t1(es,er) select 1, 2 from seq_1_to_16; +insert into t1(es,er) select 1, 3 from seq_1_to_511; +insert into t1(es,er) select 1, 4 from seq_1_to_687; +insert into t1(es,er) select 1, 6 from seq_1_to_50; +insert into t1(es,er) select 1, 7 from seq_1_to_4; +insert into t1(es,er) select 1, 11 from seq_1_to_1; +insert into t1(es,er) select 2, 1 from seq_1_to_82; +insert into t1(es,er) select 2, 2 from seq_1_to_82; +insert into t1(es,er) select 2, 3 from seq_1_to_1626; +insert into t1(es,er) select 2, 4 from seq_1_to_977; +insert into t1(es,er) select 2, 6 from seq_1_to_33; +insert into t1(es,er) select 2, 11 from seq_1_to_1; +insert into t1(es,er) select 3, 1 from seq_1_to_245; +insert into t1(es,er) select 3, 2 from seq_1_to_81; +insert into t1(es,er) select 3, 3 from seq_1_to_852; +insert into t1(es,er) select 3, 4 from seq_1_to_2243; +insert into t1(es,er) select 3, 6 from seq_1_to_44; +insert into t1(es,er) select 3, 11 from seq_1_to_1; +insert into t1(es,er) select 4, 1 from seq_1_to_91; +insert into t1(es,er) select 4, 2 from seq_1_to_83; +insert into t1(es,er) select 4, 3 from seq_1_to_297; +insert into t1(es,er) select 4, 4 from seq_1_to_2456; +insert into t1(es,er) select 4, 6 from seq_1_to_19; +insert into t1(es,er) select 4, 11 from seq_1_to_1; +update t1 set p='foobar'; +update t1 set x=0; +set @save_isp=@@innodb_stats_persistent; +set global innodb_stats_persistent= 1; +analyze table t1; + +let $q= +SELECT * FROM t1 + WHERE ((p = 'foo' AND er != 4) OR er = 4 ) AND (es >= 4) LIMIT 2; + +set optimizer_switch='index_merge_sort_intersection=on'; +eval $q; +--replace_column 9 # +eval EXPLAIN EXTENDED $q; + +set optimizer_switch='index_merge_sort_intersection=off'; +# execution of $q and explain for it led to an assertion failure in 10.4 +# (with the optimizer switch rowid_filter set to 'on') +eval $q; +--replace_column 9 # +eval EXPLAIN EXTENDED $q; +set optimizer_switch='index_merge_sort_intersection=default'; + +set global innodb_stats_persistent= @save_isp; +DROP TABLE t1; + +--echo # End of 10.2 tests diff --git a/sql/opt_range.cc b/sql/opt_range.cc index f3f1843..2e05b88 100644 --- a/sql/opt_range.cc +++ b/sql/opt_range.cc @@ -9413,6 +9413,13 @@ key_or(RANGE_OPT_PARAM *param, SEL_ARG *key1,SEL_ARG *key2) key2->copy_min(tmp); if (!(key1=key1->tree_delete(tmp))) { // Only one key in tree + if (key2->min_flag & NO_MIN_RANGE && + key2->max_flag & NO_MAX_RANGE) + { + if (key2->maybe_flag) + return new SEL_ARG(SEL_ARG::MAYBE_KEY); + return 0; // Always true OR + } key1=key2; key1->make_root(); key2=key2_next;
BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
-- BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
Hi Sergey, Thanks for your comments. Please find some quick responses to some of the comments while they are still fresh in my mind before the weekend comes, and more to follow next week. On Thu 2023-05-25 22:14:13 +0300, Sergey Petrunia wrote:
Hi Yuchen,
Please find below review input for
commit ba9cf8efeb0b1e1c132d565adb79c98156783f15 Author: Yuchen Pei <yuchen.pei@mariadb.com> Date: Fri May 19 20:06:31 2023 +1000
MDEV-22534 Decorrelate IN subquery
First, general questions:
Q1: It seems there is some overlap between Item_in_subselect::exists2in_processor() and Item_exists_subselect::exists2in_processor() Did you consider factoring out common code rather than copying it? (if yes, I'd like to know why it couldn't be done).
Indeed Item_in_subselect::exists2in_processor() is modeled after Item_exists_subselect::exists2in_processor(). The issue with the latter is that it is a large function that could have grown over time and I want to build Item_in_subselect::exists2in_processor() with only what is necessary. I will try to factor out common code. Also please take a look at the added test file in the patch - I tried to come up with tests for every line, including the skip condition involving upper_not mentioned below.
Q2: Where is the coe that removes the correlated equalities from the subquery's WHERE? (I assume Item_exists_subselect code does it?)
Like the processor for `exists`, I replaces the correlated equalities in `where` with int item 1 rather than removing them. Is it better to remove them?
diff --git a/sql/item_subselect.cc b/sql/item_subselect.cc index 9e6c205ca76..8d893794da3 100644 --- a/sql/item_subselect.cc +++ b/sql/item_subselect.cc @@ -3099,6 +3099,132 @@ static bool find_inner_outer_equalities(Item **conds, return TRUE; }
+/* Checks whether item tree intersects with the free list */ +static bool intersects_free_list(Item *item, THD *thd) +{ + for (const Item *to_find= thd->free_list; to_find; to_find= to_find->next) + if (item->walk(&Item::find_item_processor, 1, (void *) to_find)) + return true; + return false; +} +
Please try moving this function into a separate file. First name that came to my mind: opt_subq_decorrelate.cc Feel free to suggest better names. The idea is that we should aim for better structure. (This request is only valid if the factoring out suggested above doesn't work out).
Sanja commented on a similar commit to MDEV-31269 that this way of finding intersection is too costly. What do you think? Before moving this function, either we need to agree this is an acceptable function or figure out an acceptable way of finding free items in the item tree from the equalities. Failing both, then I suggest (as I did in a comment[1] on MDEV-22534), we mark this function as out of scope for the review of MDEV-22534 and as part of MDEV-31269 which may be fixed by MDEV-30073 (see a comment[2] in MDEV-31269). [1] https://jira.mariadb.org/browse/MDEV-22534?focusedCommentId=259406&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-259406 [2] https://jira.mariadb.org/browse/MDEV-31269?focusedCommentId=259516&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-259516
+/* De-correlate where conditions in an IN subquery */
This needs a bigger comment describing what the rewrite does.
+bool Item_in_subselect::exists2in_processor(void *opt_arg) +{ + THD *thd= (THD *)opt_arg; + SELECT_LEX *first_select= unit->first_select(); + JOIN *join= first_select->join; + bool will_be_correlated; + Dynamic_array<EQ_FIELD_OUTER> eqs(PSI_INSTRUMENT_MEM, 5, 5); + List<Item> outer; + Query_arena *arena= NULL, backup; + int res= FALSE; + DBUG_ENTER("Item_in_subselect::exists2in_processor"); +
Please move comments out of the if statement and number the conditions.
For examples, see best_access_path(), large if-statement starting with "Don't test table scan if it can't be better." or "EXISTS-to-IN coversion and ORDER BY ... LIMIT clause" in Item_exists_subselect::exists2in_processor().
+ if (!optimizer_flag(thd, OPTIMIZER_SWITCH_EXISTS_TO_IN) || + /* proceed only if I'm a toplevel IN or a toplevel NOT IN */ + !(is_top_level_item() || + (upper_not && upper_not->is_top_level_item())) ||
what is the reason behind this condition?
Without these conditions, the following testcase in `subslect_decorrelate_in.test` will fail (produce wrong results IIRC): --8<---------------cut here---------------start------------->8--- --echo # skip transformation when neither toplevel IN nor toplevel NOT IN; testcase adapated from main.subslect4 CREATE TABLE t1 ( a INT, b INT ); INSERT INTO t1 VALUES ( 1, NULL ), ( 2, NULL ); CREATE TABLE t2 ( c INT, d INT ); INSERT INTO t2 VALUES ( NULL, 3 ), ( NULL, 4 ); # not top level, upper_not is not top level SELECT * FROM t1 WHERE a NOT IN ( SELECT c FROM t2 where b = d ) IS NULL; select json_extract(trace, '$**.join_optimization.steps[*].transformation.to') from information_schema.OPTIMIZER_TRACE; # not top level, upper_not is null SELECT * FROM t1 WHERE a IN ( SELECT c FROM t2 where b = d ) IS NULL; select json_extract(trace, '$**.join_optimization.steps[*].transformation.to') from information_schema.OPTIMIZER_TRACE; # not top level, upper_not is not top level SELECT * FROM t1 WHERE a NOT IN ( SELECT c FROM t2 where b = d) IS unknown; select json_extract(trace, '$**.join_optimization.steps[*].transformation.to') from information_schema.OPTIMIZER_TRACE; # not top level, upper_not is null SELECT * FROM t1 WHERE a IN ( SELECT c FROM t2 where b = d ) IS unknown; select json_extract(trace, '$**.join_optimization.steps[*].transformation.to') from information_schema.OPTIMIZER_TRACE; drop table t1, t2; --8<---------------cut here---------------end--------------->8---
If we're able to handle top-level NOT IN, why can't we handle subquery in any context?
Can we really handle the NOT IN? (I suspect not, it will suffer from the semantics of IN subqueries with NULLs.. I can't find the name for this, it's described e.g. here: https://petrunia.net/2006/11/16/inany-subqueries-null-woes/ )
+ first_select->is_part_of_union() || /* skip if part of a union */ + first_select->group_list.elements || /* skip if with group by */ + first_select->with_sum_func || /* skip if aggregation */ + join->having || /* skip if with having */ + !join->conds || /* skip if no conds */ + /* skip if left expr is a single nonscalar subselect */ + (left_expr->type() == Item::SUBSELECT_ITEM && + !left_expr->type_handler()->is_scalar_type()))
The above seems to be some exotic SQL. Is it something like
(select col1,lol2 from t1) IN (select col3,col4 from t2 where ..)
and does MariaDB really support this? Please add a comment about this.
It is covered by the following testcase - I will add a comment: --8<---------------cut here---------------start------------->8--- --echo # check for subselect in left_expr; adapted from main.ps create table t1 (a int, b int, c int); create table t2 (x int, y int, z int); create table t3 as select * from t1; insert into t1 values (1,2,3),(4,5,6),(100,200,300),(400,500,600); insert into t2 values (1,2,3),(7,8,9),(100,200,300),(400,500,600); insert into t3 values (1,2,3),(11,12,13),(100,0,0),(400,500,600); # scalar subselect - ok let $query= select * from t1 where (select a from t3 where t3.c=t1.c) in (select x from t2 where t1.c= t2.z); eval explain $query; eval $query; # a row containing a subselect - ok let $query= select * from t1 where ((select a from t3 where t3.c=t1.c),b) in (select x, y from t2 where t1.c= t2.z); eval explain $query; eval $query; # a single non-scalar subselect - skip transformation let $query= select * from t1 where (select a,b from t3 where t3.c=t1.c) in (select x,y from t2 where t1.c= t2.z); eval explain $query; eval $query; drop table t1, t2, t3; --8<---------------cut here---------------end--------------->8---
+ DBUG_RETURN(FALSE); + /* iterate over conditions, and check whether they can be moved out. */ + if (find_inner_outer_equalities(&join->conds, eqs)) + DBUG_RETURN(FALSE); + /* If we are in the execution of a prepared statement, check for + intersection with the free list to avoid segfault. Note that the + check for prepared statement execution is necessary, otherwise it + will likely always find intersection thus abort the + transformation. + + fixme: this only works when HAVE_PSI_STATEMENT_INTERFACE is + defined. It causes CI's like amd64-debian-10-debug-embedded to + fail. Are there other ways to find out we are in the execution of a + prepared statement? */
I'm looking at the code of activate_stmt_arena_if_needed(). It seems "stmt_arena->is_conventional()" is the check you're looking for.
That's right. I have a more up-to-date patch for MDEV-31269 that uses this which I will adapt to this patch: https://github.com/MariaDB/server/commit/ffba2a85948
+ if (thd->m_statement_state.m_parent_prepared_stmt) + { + for (uint i= 0; i < (uint) eqs.elements(); i++) + { + if (intersects_free_list(*eqs.at(i).eq_ref, thd)) + DBUG_RETURN(FALSE);
This check doesn't look good. As far as I understand, it is a temporary measure until related re-execution bug(s) are fixed by Igor (and/or Sanja)?
sanja mentioned this is too costly. See my comment above about `intersects_free_list()`. I think an acceptable temporary measure would be nice, as MDEV-31269 is a segfault rather than wrong results. But I don't know yet how to make it less costly - let me know if you have any ideas. Otherwise I suggest this is out of scope for the review of this patch. Best, Yuchen
On Thu 2023-05-25 22:14:13 +0300, Sergey Petrunia wrote:
+ if (!optimizer_flag(thd, OPTIMIZER_SWITCH_EXISTS_TO_IN) || + /* proceed only if I'm a toplevel IN or a toplevel NOT IN */ + !(is_top_level_item() || + (upper_not && upper_not->is_top_level_item())) ||
what is the reason behind this condition? If we're able to handle top-level NOT IN, why can't we handle subquery in any context?
Can we really handle the NOT IN? (I suspect not, it will suffer from the semantics of IN subqueries with NULLs.. I can't find the name for this, it's described e.g. here: https://petrunia.net/2006/11/16/inany-subqueries-null-woes/ )
Thanks for the pointer. It seems to me from this post that the server follows a standard 3-value logic (3vl) where `null op foo` evals to the unknown value for comparison operator op. So indeed the transformation cannot handle NOT IN without some surgery. I studied the exists2in transformation more and 3vl and came to the conclusion that the fix there for the same situation could help with our case too, that is, replacing the equalities with `inner_field is not null`, and from outside AND the whole thing with `outer_expr is not null`. More specifically, consider the simple example of --8<---------------cut here---------------start------------->8--- create table t1 (a1 int, b1 int); create table t2 (a2 int, b2 int); insert into t1 values (0, null), (1, 1); insert into t2 values (0, 1), (1, null); select * from t1 where a1 in (select a2 from t2 where b1 = b2); --8<---------------cut here---------------end--------------->8--- Focusing on the where condition of the outer select, basically what we are trying to do is (\V denotes "OR over" i.e. ⋁) transforming the lhs to the rhs from the following identity: \V_{i: b1 = b2i} (a1 = a2i) = \V_i (a1 = a2i and b1 = b2i) This identity holds in classical logic, but not 3vl which adds an unknown (abbreviated u) to t and f. In mariadb, unknown seems to be an alias to null, but it has a different semantics so in the following I write u as an unknown logic value (where we can use comparison with other logical expressions) and null as a sql missing value. The 3vl version as in my broken implementation is \V_{i: (b1 = b2i) = t} (a1 = a2i) = \V_i (a1 = a2i and b1 = b2i) # WRONG! This identity holds when lhs is t or u, but not f. It breaks, i.e. lhs = f and rhs = u if and only if both of the following hold - there exists i s.t. (b1 = b2i) = u and (a1 = a2i) <> f - forall i s.t. (b1 = b2i) = t, (a1 = a2i) = f One can verify that the following modified version holds \V_{i: (b1 = b2i) = t} (a1 = a2i) = (\V_{i: b2i is not null} (a1 = a2i and b1 = b2i)) and (b1 is not null) Translating back to the query language, we have select * from t1 where (b1 is not null and (a1, b1) in (select a2, b2 from t2 where b2 is not null)); With this modification, theoretically we don't have to care about whether the IN subselect is top level, however, there are two issues: 1. When the subquery is indeed top level, do the extra terms lower performance? 2. When the subquery is not top level, how do we substitute `this` which is an Item_in_subselect with the Item_func_and? I suspect this is why the exists2in transformation only does the modification when there is an upper_not, and the transformation is skipped when the subquery is not toplevel EXISTS nor toplevel NOT EXISTS. More to follow. Best, Yuchen
Hi Sergei, Thanks again for your comments. I have added a comment on the jira ticket including the link to a patch that addresses your comments. I will take one more look at it tomorrow morning before sending it to you for review, but I thought maybe it makes sense to respond now (see below) given the timezone difference.
Hi Yuchen,
Please find below review input for
commit ba9cf8efeb0b1e1c132d565adb79c98156783f15 Author: Yuchen Pei <yuchen.pei@mariadb.com> Date: Fri May 19 20:06:31 2023 +1000
MDEV-22534 Decorrelate IN subquery
First, general questions:
Q1: It seems there is some overlap between Item_in_subselect::exists2in_processor() and Item_exists_subselect::exists2in_processor() Did you consider factoring out common code rather than copying it? (if yes, I'd like to know why it couldn't be done).
Done. I factored out the common code into three methods: - Item_exists_subselect::exists2in_prepare() - Item_exists_subselect::exists2in_create_or_update_in() - Item_exists_subselect::exists2in_and_is_not_nulls() In doing so I tried to not alter the logic of the exists2in transformation, while keeping the decorrelate-in transformation simple. For things done in exists2in but not in my decorrelate-in implementation, I add a substype() check.
Q2: Where is the coe that removes the correlated equalities from the subquery's WHERE? (I assume Item_exists_subselect code does it?)
Naively, the best place to remove the equalities would be in find_inner_outer_equalities(), when it iterates over the conds and one could simply call the remove() method on the iterator. But we can't just do that because after the call to find_inner_outer_equalities() we need to do more checks, as well as replacing the equalities with `IS NOT NULL`s in case of upper_not to ensure the 3vl value of the subquery remains the same (see my previous response about three value logic).
diff --git a/sql/item_subselect.cc b/sql/item_subselect.cc index 9e6c205ca76..8d893794da3 100644 --- a/sql/item_subselect.cc +++ b/sql/item_subselect.cc @@ -3099,6 +3099,132 @@ static bool find_inner_outer_equalities(Item **conds, return TRUE; }
+/* Checks whether item tree intersects with the free list */ +static bool intersects_free_list(Item *item, THD *thd) +{ + for (const Item *to_find= thd->free_list; to_find; to_find= to_find->next) + if (item->walk(&Item::find_item_processor, 1, (void *) to_find)) + return true; + return false; +} +
Please try moving this function into a separate file. First name that came to my mind: opt_subq_decorrelate.cc Feel free to suggest better names. The idea is that we should aim for better structure. (This request is only valid if the factoring out suggested above doesn't work out).
Replied previously. BTW why should this function be in a separate file? Is it because it is not a method of any of the Item_subselect classes?
+/* De-correlate where conditions in an IN subquery */
This needs a bigger comment describing what the rewrite does.
Done and replied previously
+bool Item_in_subselect::exists2in_processor(void *opt_arg) +{ + THD *thd= (THD *)opt_arg; + SELECT_LEX *first_select= unit->first_select(); + JOIN *join= first_select->join; + bool will_be_correlated; + Dynamic_array<EQ_FIELD_OUTER> eqs(PSI_INSTRUMENT_MEM, 5, 5); + List<Item> outer; + Query_arena *arena= NULL, backup; + int res= FALSE; + DBUG_ENTER("Item_in_subselect::exists2in_processor"); +
Please move comments out of the if statement and number the conditions.
For examples, see best_access_path(), large if-statement starting with "Don't test table scan if it can't be better." or "EXISTS-to-IN coversion and ORDER BY ... LIMIT clause" in Item_exists_subselect::exists2in_processor().
Done and replied previously
+ if (!optimizer_flag(thd, OPTIMIZER_SWITCH_EXISTS_TO_IN) || + /* proceed only if I'm a toplevel IN or a toplevel NOT IN */ + !(is_top_level_item() || + (upper_not && upper_not->is_top_level_item())) ||
what is the reason behind this condition? If we're able to handle top-level NOT IN, why can't we handle subquery in any context?
Can we really handle the NOT IN? (I suspect not, it will suffer from the semantics of IN subqueries with NULLs.. I can't find the name for this, it's described e.g. here: https://petrunia.net/2006/11/16/inany-subqueries-null-woes/ )
Done and replied previously
+ first_select->is_part_of_union() || /* skip if part of a union */ + first_select->group_list.elements || /* skip if with group by */ + first_select->with_sum_func || /* skip if aggregation */ + join->having || /* skip if with having */ + !join->conds || /* skip if no conds */ + /* skip if left expr is a single nonscalar subselect */ + (left_expr->type() == Item::SUBSELECT_ITEM && + !left_expr->type_handler()->is_scalar_type()))
The above seems to be some exotic SQL. Is it something like
(select col1,lol2 from t1) IN (select col3,col4 from t2 where ..)
and does MariaDB really support this? Please add a comment about this.
Done and replied previously
+ DBUG_RETURN(FALSE); + /* iterate over conditions, and check whether they can be moved out. */ + if (find_inner_outer_equalities(&join->conds, eqs)) + DBUG_RETURN(FALSE); + /* If we are in the execution of a prepared statement, check for + intersection with the free list to avoid segfault. Note that the + check for prepared statement execution is necessary, otherwise it + will likely always find intersection thus abort the + transformation. + + fixme: this only works when HAVE_PSI_STATEMENT_INTERFACE is + defined. It causes CI's like amd64-debian-10-debug-embedded to + fail. Are there other ways to find out we are in the execution of a + prepared statement? */
I'm looking at the code of activate_stmt_arena_if_needed(). It seems "stmt_arena->is_conventional()" is the check you're looking for.
Done and replied previously
+ if (thd->m_statement_state.m_parent_prepared_stmt) + { + for (uint i= 0; i < (uint) eqs.elements(); i++) + { + if (intersects_free_list(*eqs.at(i).eq_ref, thd)) + DBUG_RETURN(FALSE);
This check doesn't look good. As far as I understand, it is a temporary measure until related re-execution bug(s) are fixed by Igor (and/or Sanja)?
Replied previously
+ } + }
+ /* Determines whether the result will be correlated */ + { + List<Item> unused; + Collect_deps_prm prm= {&unused, // parameters + unit->first_select()->nest_level_base, // nest_level_base + 0, // count + unit->first_select()->nest_level, // nest_level + FALSE // collect + }; + walk(&Item::collect_outer_ref_processor, TRUE, &prm); + DBUG_ASSERT(prm.count > 0); + DBUG_ASSERT(prm.count >= (uint)eqs.elements()); + will_be_correlated= prm.count > (uint)eqs.elements(); See the example pasted in the MDEV: https://jira.mariadb.org/browse/MDEV-22534?focusedCommentId=259705&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-259705
Somehow, I get here will_be_correlated==FALSE but Materialization is still not considered for the subquery?
Done and replied in the ticket
+ } + /* Back up the free list */ + arena= thd->activate_stmt_arena_if_needed(&backup); + /* Construct the items for left_expr */ + if (left_expr->type() == Item::ROW_ITEM) + for (uint i= 0; i < left_expr->cols(); i++) + outer.push_back(left_expr->element_index(i)); + else + outer.push_back(left_expr); + const uint offset= first_select->item_list.elements; + /* Move items to outer and select item list */ + for (uint i= 0; i < (uint)eqs.elements(); i++) + { + Item **eq_ref= eqs.at(i).eq_ref; + Item_ident *local_field= eqs.at(i).local_field; + Item *outer_exp= eqs.at(i).outer_exp; + first_select->item_list.push_back(local_field, thd->mem_root); + first_select->ref_pointer_array[offset + i]= (Item *)local_field;
How do we know that ref_pointer_array has enough room for the new elements? (We can discuss this with Sanja, he's an expert on the topic).
I see this check in Item_exists_subselect::exists2in_processor
if ((uint)eqs.elements() > (first_select->item_list.elements + first_select->select_n_reserved))
Is it how Item_exists_subselect handles this issue?
Yes, I think so, done Best, Yuchen
participants (2)
-
Sergey Petrunia
-
Yuchen Pei