Hi Sergei, Thanks again for your comments. I have added a comment on the jira ticket including the link to a patch that addresses your comments. I will take one more look at it tomorrow morning before sending it to you for review, but I thought maybe it makes sense to respond now (see below) given the timezone difference.
Hi Yuchen,
Please find below review input for
commit ba9cf8efeb0b1e1c132d565adb79c98156783f15 Author: Yuchen Pei <yuchen.pei@mariadb.com> Date: Fri May 19 20:06:31 2023 +1000
MDEV-22534 Decorrelate IN subquery
First, general questions:
Q1: It seems there is some overlap between Item_in_subselect::exists2in_processor() and Item_exists_subselect::exists2in_processor() Did you consider factoring out common code rather than copying it? (if yes, I'd like to know why it couldn't be done).
Done. I factored out the common code into three methods: - Item_exists_subselect::exists2in_prepare() - Item_exists_subselect::exists2in_create_or_update_in() - Item_exists_subselect::exists2in_and_is_not_nulls() In doing so I tried to not alter the logic of the exists2in transformation, while keeping the decorrelate-in transformation simple. For things done in exists2in but not in my decorrelate-in implementation, I add a substype() check.
Q2: Where is the coe that removes the correlated equalities from the subquery's WHERE? (I assume Item_exists_subselect code does it?)
Naively, the best place to remove the equalities would be in find_inner_outer_equalities(), when it iterates over the conds and one could simply call the remove() method on the iterator. But we can't just do that because after the call to find_inner_outer_equalities() we need to do more checks, as well as replacing the equalities with `IS NOT NULL`s in case of upper_not to ensure the 3vl value of the subquery remains the same (see my previous response about three value logic).
diff --git a/sql/item_subselect.cc b/sql/item_subselect.cc index 9e6c205ca76..8d893794da3 100644 --- a/sql/item_subselect.cc +++ b/sql/item_subselect.cc @@ -3099,6 +3099,132 @@ static bool find_inner_outer_equalities(Item **conds, return TRUE; }
+/* Checks whether item tree intersects with the free list */ +static bool intersects_free_list(Item *item, THD *thd) +{ + for (const Item *to_find= thd->free_list; to_find; to_find= to_find->next) + if (item->walk(&Item::find_item_processor, 1, (void *) to_find)) + return true; + return false; +} +
Please try moving this function into a separate file. First name that came to my mind: opt_subq_decorrelate.cc Feel free to suggest better names. The idea is that we should aim for better structure. (This request is only valid if the factoring out suggested above doesn't work out).
Replied previously. BTW why should this function be in a separate file? Is it because it is not a method of any of the Item_subselect classes?
+/* De-correlate where conditions in an IN subquery */
This needs a bigger comment describing what the rewrite does.
Done and replied previously
+bool Item_in_subselect::exists2in_processor(void *opt_arg) +{ + THD *thd= (THD *)opt_arg; + SELECT_LEX *first_select= unit->first_select(); + JOIN *join= first_select->join; + bool will_be_correlated; + Dynamic_array<EQ_FIELD_OUTER> eqs(PSI_INSTRUMENT_MEM, 5, 5); + List<Item> outer; + Query_arena *arena= NULL, backup; + int res= FALSE; + DBUG_ENTER("Item_in_subselect::exists2in_processor"); +
Please move comments out of the if statement and number the conditions.
For examples, see best_access_path(), large if-statement starting with "Don't test table scan if it can't be better." or "EXISTS-to-IN coversion and ORDER BY ... LIMIT clause" in Item_exists_subselect::exists2in_processor().
Done and replied previously
+ if (!optimizer_flag(thd, OPTIMIZER_SWITCH_EXISTS_TO_IN) || + /* proceed only if I'm a toplevel IN or a toplevel NOT IN */ + !(is_top_level_item() || + (upper_not && upper_not->is_top_level_item())) ||
what is the reason behind this condition? If we're able to handle top-level NOT IN, why can't we handle subquery in any context?
Can we really handle the NOT IN? (I suspect not, it will suffer from the semantics of IN subqueries with NULLs.. I can't find the name for this, it's described e.g. here: https://petrunia.net/2006/11/16/inany-subqueries-null-woes/ )
Done and replied previously
+ first_select->is_part_of_union() || /* skip if part of a union */ + first_select->group_list.elements || /* skip if with group by */ + first_select->with_sum_func || /* skip if aggregation */ + join->having || /* skip if with having */ + !join->conds || /* skip if no conds */ + /* skip if left expr is a single nonscalar subselect */ + (left_expr->type() == Item::SUBSELECT_ITEM && + !left_expr->type_handler()->is_scalar_type()))
The above seems to be some exotic SQL. Is it something like
(select col1,lol2 from t1) IN (select col3,col4 from t2 where ..)
and does MariaDB really support this? Please add a comment about this.
Done and replied previously
+ DBUG_RETURN(FALSE); + /* iterate over conditions, and check whether they can be moved out. */ + if (find_inner_outer_equalities(&join->conds, eqs)) + DBUG_RETURN(FALSE); + /* If we are in the execution of a prepared statement, check for + intersection with the free list to avoid segfault. Note that the + check for prepared statement execution is necessary, otherwise it + will likely always find intersection thus abort the + transformation. + + fixme: this only works when HAVE_PSI_STATEMENT_INTERFACE is + defined. It causes CI's like amd64-debian-10-debug-embedded to + fail. Are there other ways to find out we are in the execution of a + prepared statement? */
I'm looking at the code of activate_stmt_arena_if_needed(). It seems "stmt_arena->is_conventional()" is the check you're looking for.
Done and replied previously
+ if (thd->m_statement_state.m_parent_prepared_stmt) + { + for (uint i= 0; i < (uint) eqs.elements(); i++) + { + if (intersects_free_list(*eqs.at(i).eq_ref, thd)) + DBUG_RETURN(FALSE);
This check doesn't look good. As far as I understand, it is a temporary measure until related re-execution bug(s) are fixed by Igor (and/or Sanja)?
Replied previously
+ } + }
+ /* Determines whether the result will be correlated */ + { + List<Item> unused; + Collect_deps_prm prm= {&unused, // parameters + unit->first_select()->nest_level_base, // nest_level_base + 0, // count + unit->first_select()->nest_level, // nest_level + FALSE // collect + }; + walk(&Item::collect_outer_ref_processor, TRUE, &prm); + DBUG_ASSERT(prm.count > 0); + DBUG_ASSERT(prm.count >= (uint)eqs.elements()); + will_be_correlated= prm.count > (uint)eqs.elements(); See the example pasted in the MDEV: https://jira.mariadb.org/browse/MDEV-22534?focusedCommentId=259705&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-259705
Somehow, I get here will_be_correlated==FALSE but Materialization is still not considered for the subquery?
Done and replied in the ticket
+ } + /* Back up the free list */ + arena= thd->activate_stmt_arena_if_needed(&backup); + /* Construct the items for left_expr */ + if (left_expr->type() == Item::ROW_ITEM) + for (uint i= 0; i < left_expr->cols(); i++) + outer.push_back(left_expr->element_index(i)); + else + outer.push_back(left_expr); + const uint offset= first_select->item_list.elements; + /* Move items to outer and select item list */ + for (uint i= 0; i < (uint)eqs.elements(); i++) + { + Item **eq_ref= eqs.at(i).eq_ref; + Item_ident *local_field= eqs.at(i).local_field; + Item *outer_exp= eqs.at(i).outer_exp; + first_select->item_list.push_back(local_field, thd->mem_root); + first_select->ref_pointer_array[offset + i]= (Item *)local_field;
How do we know that ref_pointer_array has enough room for the new elements? (We can discuss this with Sanja, he's an expert on the topic).
I see this check in Item_exists_subselect::exists2in_processor
if ((uint)eqs.elements() > (first_select->item_list.elements + first_select->select_n_reserved))
Is it how Item_exists_subselect handles this issue?
Yes, I think so, done Best, Yuchen