Re: [Maria-developers] MDEV-26278 Elimination: consider GROUP BY as unique key
Hi Oleg, Please find the review below.
commit 20609d374b3a82c46291ee5e87cfadd26de47817 Author: Oleg Smirnov <olernov@gmail.com> Date: Thu Feb 17 22:53:37 2022 +0700
MDEV-26278 Elimination: consider GROUP BY as unique key
Please provide a more verbose description. We should aim for it to provide sufficient info for one to understand what the optimization does and how that is achieved. Some general comments: The patch uses this formatting for multi-line comments:
+ /* GROUP BY expression is considered as a synthetic "unique key" + for the derived table. Add this key as a dependency */
While the coding style and actual MariaDB code would use this: /* GROUP BY expression is considered as a synthetic "unique key" for the derived table. Add this key as a dependency */ Please use this style for all comments. Also, the patch uses the term "synthetic unique key". I think the term is not very good: It can be confused with "synthetic primary key", also I read "synthetic" as something that actually exists, while in our case the key doesn't exist. Would using "Pseudo-key" be better?
diff --git a/sql/table.h b/sql/table.h index 497502b2d06..fd598556cc3 100644 --- a/sql/table.h +++ b/sql/table.h @@ -2872,6 +2872,9 @@ struct TABLE_LIST } }
+ void mark_derived_as_eliminated() { m_is_derived_eliminated= true; } + bool is_derived_eliminated() const { return m_is_derived_eliminated; } + So, TABLE_LIST objects are created non-eliminated and then they are eliminated, and that is forever. TABLE_LIST object has the same lifetime as the statement, that is, it will survive multiple executions of a prepared statement.
On the other hand, Table Elimination optimization is done in eliminate_tables() call which is invoked for every statement execution. This looks like a dangerous mismatch. I'll continue below
private: bool prep_check_option(THD *thd, uint8 check_opt_type); bool prep_where(THD *thd, Item **conds, bool no_where_clause); diff --git a/sql/sql_lex.cc b/sql/sql_lex.cc index 3bf6d9abf57..02ef5cd7ad7 100644 --- a/sql/sql_lex.cc +++ b/sql/sql_lex.cc @@ -11786,7 +11786,7 @@ bool SELECT_LEX_UNIT::explainable() const EXPLAIN/ANALYZE unit, when: (1) if it's a subquery - it's not part of eliminated WHERE/ON clause. (2) if it's a CTE - it's not hanging (needed for execution) - (3) if it's a derived - it's not merged + (3) if it's a derived - it's not merged or eliminated if it's not 1/2/3 - it's some weird internal thing, ignore it */ return item ? @@ -11795,6 +11795,7 @@ bool SELECT_LEX_UNIT::explainable() const derived && derived->derived_result && !with_element->is_hanging_recursive(): // (2) derived ? - derived->is_materialized_derived() : // (3) + derived->is_materialized_derived() && // (3) + !derived->is_derived_eliminated() : false; } I think, there's no need for m_is_derived_eliminated. You can have this check in is_derived_eliminated():
this->derived->table->map & this->outer_select()->join->eliminated_tables Please try doing this.
diff --git a/mysql-test/main/table_elim.result b/mysql-test/main/table_elim.result index deff0623370..842330f3713 100644 --- a/mysql-test/main/table_elim.result +++ b/mysql-test/main/table_elim.result @@ -704,3 +704,252 @@ LIMIT 1; PostID Voted 1 NULL DROP TABLE t1,t2; +# +# MDEV-26278: Table elimination does not work across derived tables +# +create table t1 (a int, b int); +insert into t1 select seq, seq+10 from seq_1_to_10; +create table t11 ( +a int not null, +b int, +key(a) +); +insert into t11 select A.seq, A.seq+B.seq +from +seq_1_to_100 A, +seq_1_to_1000 B;
Do you really need 100K rows table? I think 10K rows will be just as good.
diff --git a/sql/opt_table_elimination.cc b/sql/opt_table_elimination.cc index 8c4720bdec4..fe2155191ca 100644 --- a/sql/opt_table_elimination.cc +++ b/sql/opt_table_elimination.cc ... @@ -278,6 +286,8 @@ class Dep_value_field : public Dep_value Dep_module_key *key_dep; /* Otherwise, this and advance */ uint equality_no; + + Dep_module_synthetic_key *synth_key_dep;
Please add a comment for the above that we should try that one, too. @@ -447,6 +463,55 @@ const size_t Dep_module::iterator_size=
MY_MAX(Dep_module_expr::iterator_size, Dep_module_key::iterator_size);
+/* A synthetic unique key module for a derived table. + For example, a derived table + SELECT t11.a, count(*) from t1 LEFT JOIN t2 ON t1.id = t2.fk GROUP BY t11.a + has unique values in its first field (t11.a) due to GROUP BY expression + so this can be considered as a unique key for this derived table +*/ + +class Dep_module_synthetic_key : public Dep_module +{ +public: + Dep_module_synthetic_key(Dep_value_table *table_arg, + std::vector<field_index_t>&& field_indexes) + : table(table_arg), derived_table_field_indexes(field_indexes) + { + unbound_args= static_cast<uint>(field_indexes.size()); + } + + Dep_value_table * table;
Coding style: "*table". ...
@@ -1603,7 +1674,73 @@ Dep_value_table *Dep_analysis_context::create_table_value(TABLE *table) key_list= &(key_dep->next_table_key); } } - return table_deps[table->tablenr]= tbl_dep; + + auto select_unit= table_list->get_unit(); + SELECT_LEX *first_select= nullptr; + if (select_unit)
Please move the code that creates the pseudo-key into a separate function. ...
+ +int Dep_analysis_context::find_field_in_list(List<Item> &fields_list, + Item *field)
Please add a one-line comment describing what function does.
@@ -1786,6 +1941,18 @@ Dep_module* Dep_value_field::get_next_unbound_module(Dep_analysis_context *dac, } else di->key_dep= NULL; + + Dep_module_synthetic_key *synth_key_dep= di->synth_key_dep; + if (synth_key_dep && !synth_key_dep->is_applicable() && + std::find(synth_key_dep->derived_table_field_indexes.begin(), + synth_key_dep->derived_table_field_indexes.end(), + field->field_index) != std::end(synth_key_dep->derived_table_field_indexes))
Please introduce a method like bool Dep_module_synthetic_key::covers_field(int field_index); Will this also allow to make Dep_module_synthetic_key::derived_table_field_indexes a private member? BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
Hi Oleg, Please find review input below. The most important is wrong-query-result observation at the bottom. A general comment: The term "Pseudo unique keys" is better than "synthetic keys", but I'm concerned about how it will be read. Does "pseudo" attach to the "unique" (pseudo-unique?) or to the "key"? This sounds ambiguous to me but let's check with native speakers. Does "Unique pseudo-key" sound better? At least it's unambigous...
commit 75278ac03681b2cd8e6f5098b536edfa2f34a549 Author: Oleg Smirnov <olernov@gmail.com> Date: Thu Feb 17 22:53:37 2022 +0700
MDEV-26278 Add functionality to eliminate derived tables from the query
Elimination of unnecessary tables from SQL queries is already present in MariaDB. But it only works for regular tables and not for derived ones.
Imagine we have a view: CREATE VIEW v2c AS SELECT t11.a AS a, max(t12.col1) AS b FROM t11 LEFT JOIN t12 ON t12.pk=t11.b GROUP BY t11.a
A question: is it important that the view has an outer join in it? Or any join? (I think not.) If not, please remove it. It's better when the example does not have any unneeded details.
Due to "GROUP BY t11.a" the values of "a" are unique, and this fact can be treated as like table "v2c" has a unique key on field "a".
Suppose we have a SQL query: SELECT t1.* FROM t1 LEFT JOIN v2c ON t1.a=v2c.a
Since "v2c.a" is unique, then: 1. "v2c" is functionally dependent on t1. This means every record of "t1" will be either joined with a single record from "v2c" or NULL-complemented. 2. No fields of "v2c" are present on the SELECT list
These two facts allow to completely exclude (eliminate) the derived table "v2c" from the query.
...
diff --git a/sql/sql_lex.h b/sql/sql_lex.h index 79d48528574..fb1422b7482 100644 --- a/sql/sql_lex.h +++ b/sql/sql_lex.h @@ -1026,7 +1026,7 @@ class st_select_lex_unit: public st_select_lex_node { int save_union_explain_part2(Explain_query *output); unit_common_op common_op();
- bool explainable() const; + bool explainable();
Please keep the function "const". Yes, this will require making is_derived_eliminated() and outer_select() const, as well, which seems easy to do.
diff --git a/sql/opt_table_elimination.cc b/sql/opt_table_elimination.cc index 8c4720bdec4..7f52d4cba0b 100644 --- a/sql/opt_table_elimination.cc +++ b/sql/opt_table_elimination.cc @@ -134,6 +136,11 @@ - Nodes representing unique keys. Unique key has = incoming edges from key component value modules = outgoing edge to key's table module + - Nodes representing pseudo unique keys for derived tables. + Pseudo unique keys are composed as a result of GROUP BY expressions. + Like normal unique keys, they have: + = incoming edges from key component value modules + = outgoing edge to key's table module - Inner side of outer join module. Outer join module has = incoming edges from table value modules = No outgoing edges. Once we reach it, we know we can eliminate the ... @@ -447,6 +465,60 @@ const size_t Dep_module::iterator_size= MY_MAX(Dep_module_expr::iterator_size, Dep_module_key::iterator_size);
+/* + A pseudo unique key module for a derived table. + For example, a derived table + SELECT t11.a, count(*) from t1 LEFT JOIN t2 ON t1.id = t2.fk GROUP BY t11.a + has unique values in its first field (t11.a) due to GROUP BY expression + so this can be considered as a unique key for this derived table +*/ + +class Dep_module_pseudo_key : public Dep_module +{ +public: + Dep_module_pseudo_key(Dep_value_table *table_arg, + std::vector<field_index_t>&& field_indexes) + : table(table_arg), derived_table_field_indexes(field_indexes) + { + unbound_args= static_cast<uint>(field_indexes.size()); + } + + Dep_value_table *table; + + Iterator init_unbound_values_iter(char *buf) override; + + Dep_value *get_next_unbound_value(Dep_analysis_context *dac, + Iterator iter) override; + + bool covers_field(int field_index); + + static const size_t iterator_size; + +private: + /* + Vector of field numbers (indexes) in the derived table's SELECT list + which are included in the GROUP BY expression. + For example, pseudo unique key for SQL + "SELECT count(t12.pk) as cnt, t11.a as a FROM t11 LEFT JOIN t12 + ON t12.pk=t11.b GROUP BY t11.a, t12.b" Same comment as above: does this example need outer join, or table t12 at all? If not, please remove it.
+ will include one element: {1}, since "t11.a" is in the + GROUP BY list and is present in the SELECT list with index 1 + (numeration starts from 0). Field "t12.b" is not included since + though it's present in the GROUP BY list but is missing in the SELECT list + */ + std::vector<field_index_t> derived_table_field_indexes; + + class Value_iter + { + public: + Dep_value_table *table; + }; +}; + +const size_t Dep_module_pseudo_key::iterator_size= + ALIGN_SIZE(sizeof(Dep_module_pseudo_key::Value_iter));
This doesn't seem to be used anywhere. Add it to the Dep_module::iterator_size please: const size_t Dep_module::iterator_size= MY_MAX(Dep_module_expr::iterator_size, Dep_module_key::iterator_size); ...
@@ -1577,33 +1654,127 @@ void add_module_expr(Dep_analysis_context *ctx, Dep_module_expr **eq_mod, +/* + Create pseudo unique key and add it as a dependency + if the query is not a UNION (ALL) and has GROUP BY expression +*/ + +void Dep_analysis_context::create_pseudo_unique_key_if_needed( + TABLE_LIST *table_list, Dep_value_table *tbl_dep) +{ + auto select_unit= table_list->get_unit(); + SELECT_LEX *first_select= nullptr; + if (select_unit) + { + first_select= select_unit->first_select(); + + /* + Exclude UNION (ALL) queries from consideration by checking + next_select() == nullptr + */ + if (unlikely(select_unit->first_select()->next_select())) + first_select= nullptr; + } + + /* + GROUP BY expression is considered as a pseudo unique key + for the derived table. Add this pseudo key as a dependency + */ + if (first_select && first_select->group_list.elements > 0) + { + /* + Make sure GROUP BY elements contain only fields + and no functions or other expressions + */ + bool valid= TRUE; + std::vector<field_index_t> exposed_fields_indexes; + for (auto cur_group= first_select->group_list.first; + cur_group; + cur_group= cur_group->next) + { + auto elem= *(cur_group->item); + if (elem->type() != Item::FIELD_ITEM) + { + valid= FALSE; + break; + } + auto field_idx= find_field_in_list(first_select->join->fields_list, elem); + if (field_idx != -1) + { + /* + The field is found in the SELECT list, so it is exposed + outside the derived table + */ + exposed_fields_indexes.push_back(field_idx); + } I don't see where this function checks that all GROUP BY elements are in the select list.
If one uses only a subset, for example "SELECT a FROM t ... GROUP BY a,b" then the value of "a" is not unique. Example: create table t2 (a int, b int, c int); insert into t2 select A.seq, B.seq, 123 from seq_1_to_3 A, seq_1_to_3 B; explain select t1.* from t1 left join (select a, count(*) as cnt from t2 group by a,b) T on T.a=t1.a; select t1.* from t1 left join (select a, count(*) as cnt from t2 group by a,b) T on T.a=t1.a; (6 rows) set optimizer_switch='table_elimination=off'; select t1.* from t1 left join (select a, count(*) as cnt from t2 group by a,b) T on T.a=t1.a; (2 rows) I'm surprised that testcases didn't hit this... BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
participants (1)
-
Sergey Petrunia