- developers - lists.mariadb.org

Review of MDEV-30073 Wrong result on 2nd execution of PS for query
by Michael Widenius 12 Jun '24

12 Jun '24

Review of MDEV-30073 Wrong result on 2nd execution of PS for query with NOT EXISTS Bye the way, nice to see that this patch close so many different MDEVs! Please add a list of all the MDEV's closed by this patch in the commit message! MDEV-23828: query with EXISTS over subquery using mergeable derived table / view executed in PS/SP for the second time MDEV-30396: 2nd execution of query with EXISTS subquery over view / DT MDEV-31175: 2nd execution of SELECT with mergeable view / DT and aggregate function MDEV-33081: 2-nd execution of query with forced materialization of a mergeable view because of too many base tables MDEV-31269: 2nd execution of query with HAVING containing IN predicand over subquery with EXISTS in WHERE clause MDEV-31281: 2nd execution of query with IN subquery whose left operand is aggregate function or aggregate function MDEV-31178: 2 execution of query with conversion from IN predicate to IN subquery applied > index 446154e517b..d474a39e228 100644 > --- a/mysql-test/main/func_group.test > +++ b/mysql-test/main/func_group.test > @@ -1268,12 +1268,14 @@ set @@optimizer_switch='materialization=on,in_to_exists=off,semijoin=off'; > --echo # 1) Test that subquery materialization is setup for query with > --echo # premature optimize() exit due to "Impossible WHERE" > --echo # > +--disable_ps2_protocol > SELECT MIN(t2.pk) > FROM t2 JOIN t1 ON t1.pk=t2.pk > WHERE 'j' > HAVING ('m') IN ( > SELECT v > FROM t2); > +--enable_ps2_protocol Why this change? Does this produce wrong results after your patch? At least I cannot see anything in the query that would disallow double execution. If double execution does not work anymore for this one, please add a note about this in the commit message or in the test case. > diff --git a/mysql-test/main/mysqltest_ps.test b/mysql-test/main/mysqltest_ps.test > index c5a332f691f..853ce8bd38e 100644 > --- a/mysql-test/main/mysqltest_ps.test > +++ b/mysql-test/main/mysqltest_ps.test > @@ -32,3 +32,4 @@ select 1 + "2 a"; > create table t (a int primary key, b blob default ''); > select a, (2*a) AS a from t group by a; > drop table t; > + > Remove the change to the above. No need to touch this test case in this commit > diff --git a/mysql-test/main/order_by.test b/mysql-test/main/order_by.test > index c1e0c318283..7ca9a8f8def 100644 > --- a/mysql-test/main/order_by.test > +++ b/mysql-test/main/order_by.test > @@ -90,7 +90,9 @@ drop table t1; > create table t1 (i int); > insert into t1 values(1),(2),(1),(2),(1),(2),(3); > select distinct i from t1; > +--disable_ps2_protocol > select distinct i from t1 order by rand(5); > +--enable_ps2_protocol > Why disable ps2_protocol here? > diff --git a/mysql-test/main/type_date.test b/mysql-test/main/type_date.test > index 3566e427d50..5f17a893d53 100644 > --- a/mysql-test/main/type_date.test > +++ b/mysql-test/main/type_date.test > @@ -601,7 +601,7 @@ SHOW CREATE TABLE t2; > DROP TABLE t2; > DROP VIEW v1; > DROP TABLE t1; > - > +# --enable_ps2_protocol The above looks wrong. Probably leftover from some testing. Please revert the change to this file > diff --git a/mysql-test/main/type_newdecimal.test b/mysql-test/main/type_newdecimal.test > index 366199e8aa8..7223b7b44f0 100644 > --- a/mysql-test/main/type_newdecimal.test > +++ b/mysql-test/main/type_newdecimal.test > @@ -1929,7 +1929,9 @@ drop table t1; > SELECT AVG(DISTINCT 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001); > --enable_view_protocol > CREATE TABLE t1 AS SELECT NULL AS v1; > +--disable_ps2_protocol > SELECT 1 FROM t1 GROUP BY v1 ORDER BY AVG ( from_unixtime ( '' ) ) ; > +--enable_ps2_protocol > DROP TABLE t1; Why the change? > diff --git a/mysql-test/main/win.test b/mysql-test/main/win.test > index cba6b0fd7af..00c79425fc0 100644 > --- a/mysql-test/main/win.test > +++ b/mysql-test/main/win.test > @@ -2770,6 +2770,7 @@ drop table t1; > CREATE TABLE t1 ( a char(25), b text); > INSERT INTO t1 VALUES ('foo','bar'); > > +--disable_ps2_protocol > SELECT > SUM(b) OVER (PARTITION BY a), > ROW_NUMBER() OVER (PARTITION BY b) > @@ -2777,6 +2778,7 @@ FROM t1 > GROUP BY > LEFT((SYSDATE()), 'foo') > WITH ROLLUP; > +--enable_ps2_protocol > drop table t1; Why ? > diff --git a/mysql-test/suite/federated/federatedx_create_handlers.test b/mysql-test/suite/federated/federatedx_create_handlers.test > index 8e504590282..b2884397333 100644 > --- a/mysql-test/suite/federated/federatedx_create_handlers.test > +++ b/mysql-test/suite/federated/federatedx_create_handlers.test > @@ -94,6 +94,7 @@ DEFAULT CHARSET=latin1; > INSERT INTO federated.t3 VALUES > ('yyy'), ('www'), ('yyy'), ('xxx'), ('www'), ('yyy'), ('www'); > > +--sorted_result > SELECT * > FROM federated.t3, (SELECT * FROM federated.t1 WHERE id > 3) t > WHERE federated.t3.name=t.name; > I assume this has nothing to do with the patch, but just a random failure you fixed? If that is the case, please add a comment about this file change in the commit message or move this change to another commit. > --- a/sql/item.cc > +++ b/sql/item.cc > @@ -3027,7 +3027,11 @@ Item* Item_ref::build_clone(THD *thd) > unlikely(!(copy->ref= (Item**) alloc_root(thd->mem_root, > sizeof(Item*)))) || > unlikely(!(*copy->ref= (* ref)->build_clone(thd)))) > + { > + if (copy) > + copy->ref= 0; > return 0; > + } > return copy; > } Ok code, but I would have prefered this a bit more simple approach: Item* Item_ref::build_clone(THD *thd) { Item_ref *copy= (Item_ref *) get_copy(thd); if (likely(copy)) { if (unlikely(!(copy->ref= (Item**) alloc_root(thd->mem_root, sizeof(Item*)))) || unlikely(!(*copy->ref= (* ref)->build_clone(thd)))) { copy->ref= 0; return 0; } } return copy; } > @@ -5823,22 +5827,48 @@ Item_field::fix_outer_field(THD *thd, Field **from_field, Item **reference) > select->group_list.elements && > (place == SELECT_LIST || place == IN_HAVING)) > { > - Item_outer_ref *rf; > /* > - If an outer field is resolved in a grouping select then it > - is replaced for an Item_outer_ref object. Otherwise an > - Item_field object is used. > - The new Item_outer_ref object is saved in the inner_refs_list of > - the outer select. Here it is only created. It can be fixed only > - after the original field has been fixed and this is done in the > - fix_inner_refs() function. > - */ > - ; > - if (!(rf= new (thd->mem_root) Item_outer_ref(thd, context, this))) > - return -1; > - thd->change_item_tree(reference, rf); > - select->inner_refs_list.push_back(rf, thd->mem_root); > - rf->in_sum_func= thd->lex->in_sum_func; > + This Item_field represents a reference to a column in some > + outer select. Wrap this Item_field item into an Item_outer_ref > + object if we are at the first execution of the query or at > + processing the prepare command for it. Make this wrapping > + permanent for the first query execution so that it could be used > + for next executions of the query. > + Add the wrapper item to the list st_select_lex::inner_refs_list > + of the select against which this Item_field has been resolved. > + */ > + bool is_first_execution= thd->is_first_query_execution(); > + if (is_first_execution || > + thd->stmt_arena->is_stmt_prepare()) The above is the same as: ((stmt_arena->is_conventional() || stmt_arena->state != STMT_EXECUTED) && !stmt_arena->is_stmt_prepare()) || stmt_arena->is_stmt_prepare()) => ((stmt_arena->state == STMT_CONVENTIONAL_EXECUTION || stmt_arena->state != STMT_EXECUTED) && stmt_arena->state != STMT_INITIALIZED) || stmt_arena->state == STMT_INITIALIZED)) => (stmt_arena->state != Query_arena::STMT_EXECUTED && stmt_arena->state != Query_arena::STMT_INITIALIZED) || stmt_arena->state == Query_arena::STMT_INITIALIZED => stmt_arena->state != Query_arena::STMT_EXECUTED Is it not better to use the above as it is shorter and easier to understand as there are less hidden abstractions? Is STMT_EXECUTED safe to use? We don't set state to STMT_EXECUTE in case of an error during execute. If the error is temporary (depending on data or out of space during execution) then state will be left in STMT_INITIALIZED. Don't we have a problem in this case for the next execution? In 11.5 we can trivially simulate an error in almost any query by setting the tmp disk quota to a very low value. Could be fixed by adding a new state STMT_OPTIMIZED that tells us that all optimizations rewrites has been done and then we could check for STMT_OPTIMIZED || STMT_EXECUTED or force a re-prepare if the previous execution returned an error (may be can set state to STMT_ERROR in this case)? Another options is for a new prepare in the case the last query ended with an error. > + /* > + If an outer field is resolved in a grouping select then it > + is replaced for an Item_outer_ref object. Otherwise an > + Item_field object is used. > + The new Item_outer_ref object is saved in the inner_refs_list of > + the outer select. Here it is only created. It can be fixed only > + after the original field has been fixed and this is done in the > + fix_inner_refs() function. > + */ > + if (!(rf= new (thd->mem_root) Item_outer_ref(thd, context, this))) > + rc= -1; > + select->inner_refs_list.push_back(rf, thd->mem_root); > + if (is_first_execution) > + *reference= rf; > + else > + thd->change_item_tree(reference, rf); > + if (arena) > + thd->restore_active_arena(arena, &backup); > + if (rc) > + return rc; > + rf->in_sum_func= thd->lex->in_sum_func; > + } > } I don't like that we use 'rf' when we know it is 0. We can fix that with the following code change that also removes the 'rc' variable (shorter): { Query_arena *arena=0, backup; Item_outer_ref *rf; bool is_first_execution; if ((is_first_execution= thd->is_first_query_execution()) && !thd->is_conventional()) arena= thd->activate_stmt_arena_if_needed(&backup); /* If an outer field is resolved in a grouping select then it is replaced for an Item_outer_ref object. Otherwise an Item_field object is used. The new Item_outer_ref object is saved in the inner_refs_list of the outer select. Here it is only created. It can be fixed only after the original field has been fixed and this is done in the fix_inner_refs() function. */ if (!(rf= new (thd->mem_root) Item_outer_ref(thd, context, this))) { select->inner_refs_list.push_back(rf, thd->mem_root); if (is_first_execution) *reference= rf; else thd->change_item_tree(reference, rf); } if (arena) thd->restore_active_arena(arena, &backup); if (!rf) // Memory allocation failed return -1 rf->in_sum_func= thd->lex->in_sum_func; } Now code has exactly the same amount of jump instructions as your code, but it is shorter, avoids the usage of 'ref' and does not need 'rc'. Note that in comparing to the original code, we are creating an arena for the states: STMT_INITIALIZED, STMT_INITIALIZED_FOR_SP= 1, STMT_PREPARED= 2, STMT_ERROR= -1 This is probably ok, just wanted to ensure you are aware of this. Would it be enough/safer to just test for SMT_PREPARED to see if an arena should be used? (Just wondering) > /* > A reference is resolved to a nest level that's outer or the same as > @@ -5938,7 +5968,7 @@ Item_field::fix_outer_field(THD *thd, Field **from_field, Item **reference) > else if (ref != not_found_item) > { > Item *save; > - Item_ref *rf; > + Item_ref *rf= NULL;; remove extra ; > /* Should have been checked in resolve_ref_in_select_and_group(). */ > DBUG_ASSERT(*ref && (*ref)->is_fixed()); > @@ -5950,35 +5980,64 @@ Item_field::fix_outer_field(THD *thd, Field **from_field, Item **reference) <cut> > + This Item_field represents a reference to a column in some outer select. > + Wrap this Item_field item into an object of the Item_ref class or a class > + derived from Item_ref we are at the first execution of the query or at > + processing the prepare command for it. Make this wrapping permanent for > + the first query execution so that it could be used for next executions > + of the query. The type of the wrapper depends on the place where this > + item field occurs or on type of the query. If it is in the the having clause > + we use a wrapper of the Item_ref type. Otherwise we use a wrapper either > + of Item_direct_ref type or of Item_outer_ref. The latter is used for > + grouping queries. Such a wrapper is always added to the list inner_refs_list > + for the select against which the outer reference has been resolved. > */ > + bool is_first_execution= thd->is_first_query_execution(); > + if (is_first_execution || > + thd->stmt_arena->is_stmt_prepare()) > + { Replace with if (thd->stmt_arena->state != Query_arena::STMT_EXECUTED) { bool is_first_execution= thd->is_first_query_execution(); > + int rc= 0; > + Query_arena *arena, backup; > + arena= 0; > + if (is_first_execution) > + arena= thd->activate_stmt_arena_if_needed(&backup); > + rf= (place == IN_HAVING ? > + new (thd->mem_root) > + Item_ref(thd, context, ref, table_name, > + field_name, alias_name_used) : > + (!select->group_list.elements ? > + new (thd->mem_root) > + Item_direct_ref(thd, context, ref, table_name, > + field_name, alias_name_used) : > + new (thd->mem_root) > + Item_outer_ref(thd, context, ref, table_name, > + field_name, alias_name_used))); > + *ref= save; > + if (!rf) > + rc= -1; > + if (!rc && place != IN_HAVING && select->group_list.elements) > + { > + outer_context->select_lex->inner_refs_list.push_back( > + (Item_outer_ref*)rf, thd->mem_root); > + ((Item_outer_ref*)rf)->in_sum_func= thd->lex->in_sum_func; > + } > + if (is_first_execution) > + *reference= rf; > + else > + thd->change_item_tree(reference, rf); > + if (arena) > + thd->restore_active_arena(arena, &backup); > + if (rc) > + return rc; Pleace do same change as in previous case by using an 'if' instead of setting rc. > We can not "move" aggregate function in the place where > @@ -6003,22 +6062,46 @@ Item_field::fix_outer_field(THD *thd, Field **from_field, Item **reference) > this, (Item_ident*)*reference, false); > if (last_checked_context->select_lex->having_fix_field) > { > + This Item_field represents a reference to a column in having clause > + of some outer select. Wrap this Item_field item into an Item_ref object > + if we are at the first execution of the query or at processing the > + prepare command for it. Make this wrapping permanent for the first query > + execution so that it could be used for next executions of the query. > */ > - DBUG_ASSERT(!rf->fixed); // Assured by Item_ref() > - if (rf->fix_fields(thd, reference) || rf->check_cols(1)) > - return -1; > + Item_ref *rf; > + bool is_first_execution= thd->is_first_query_execution(); > + if (is_first_execution || > + thd->stmt_arena->is_stmt_prepare()) Fix test, as in other cases. > + { > + int rc= 0; > + Query_arena *arena, backup; > + arena= 0; > + if (is_first_execution) > + arena= thd->activate_stmt_arena_if_needed(&backup); > + rf= new (thd->mem_root) Item_ref(thd, context, > + (*from_field)->table->s->db, > + Lex_cstring_strlen((*from_field)-> > + table->alias.c_ptr()), > + field_name); > + if (!rf) > + rc= -1; > + if (is_first_execution) > + *reference= rf; > + else > + thd->change_item_tree(reference, rf); > + if (arena) > + thd->restore_active_arena(arena, &backup); > + if (rc) > + return rc; > + /* > + rf is Item_ref => never substitute other items (in this case) > + during fix_fields() => we can use rf after fix_fields() > + */ > + DBUG_ASSERT(!rf->fixed); // Assured by Item_ref() > + if (rf->fix_fields(thd, reference) || rf->check_cols(1)) > + return -1; > + } same here. Change if (!rf) rc= -1; ... if (arena) thd->restore_active_arena(arena, &backup); to if (rc) { .... } if (arena) thd->restore_active_arena(arena, &backup); > @@ -8321,6 +8404,9 @@ bool Item_ref::fix_fields(THD *thd, Item **reference) > goto error; > } > + if ((*ref)->fix_fields_if_needed(thd, reference)) > + goto error; > + > set_properties(); Add a comment when this is needed (ie, when reference <> 0 and it is not fixed. I tried to come up with a simple comment but failed :( I assume this is at least true for the prepare phase of a prepared statement. > > if ((*ref)->check_cols(1)) > @@ -9282,8 +9368,21 @@ bool Item_direct_view_ref::fix_fields(THD *thd, Item **reference) > bitmap_set_bit(fld->table->read_set, fld->field_index); > } > } > - else if ((*ref)->fix_fields_if_needed(thd, ref)) > - return TRUE; > + else > + { > + if (thd->is_noninitial_query_execution()) replace (thd->is_noninitial_query_execution()) with thd->is_stmt_executed() See later comments in definition of is_noninitial_query_execution() > + { > + bool rc= false; > + bool save_wrapper= thd->lex->current_select->no_wrap_view_item; > + thd->lex->current_select->no_wrap_view_item= TRUE; > + rc= (*ref)->fix_fields_if_needed(thd, ref); > + thd->lex->current_select->no_wrap_view_item= save_wrapper; > + if (rc) > + return TRUE; > + } > + else if ((*ref)->fix_fields_if_needed(thd, ref)) > + return TRUE; > + } Do we need no_wrap_view_item anymore (we probably do, but it is not obvious)? The original code that introduced no_wrap_view_item used it to test things in sql_base.cc but that code does not exist anymore. The only usage we have of it now is in item.cc: if (!thd->lex->current_select->no_wrap_view_item && thd->lex->in_sum_func && thd->lex->in_sum_func->nest_level == select->nest_level) set_if_bigger(thd->lex->in_sum_func->max_arg_level, select->nest_level); There is no code comment what the above code is supposed to do. Could you PLEASE add a comment about what is the purpose o the above code, especially why the nest_level is not applicable to no_wrap_view_items ? I have verified that we need !thd->lex->current_select->no_wrap_view_item here. We get crashes in main.derived_view if remove the test. > diff --git a/sql/item_cmpfunc.cc b/sql/item_cmpfunc.cc > index f8959cfd2bf..cc9c81608ca 100644 > --- a/sql/item_cmpfunc.cc > +++ b/sql/item_cmpfunc.cc > @@ -4588,7 +4588,9 @@ bool Item_func_in::value_list_convert_const_to_int(THD *thd) > */ > if (arg[0]->type() != Item::NULL_ITEM && > !convert_const_to_int(thd, field_item, &arg[0])) > - all_converted= false; > + all_converted= false; > + else > + block_transform_into_subq= true; > } if (all_converted) m_comparator.set_handler(&type_handler_slonglong); The above looks strange. You are setting block_transform_into_subq to true in the case of IN (1,1) but not setting it for ("abb",NULL); I assume the idea is that if if convert_const_to_int() changes anything, then we block conversion of the IN to sub queries as our code cannot convert back to use the original code in this case? (as the translation to use subqueries is permanent). Here is a test cases the sets block_transform_into_subq to true: create table t1 (a bigint); insert into t1 select seq from seq_1_to_100; select benchmark(1,1); select * from t1 where a in ('1','2','3'); drop table t1; Is there not a better way to do this ? - Make all translations of strings to integers permantently in the statement arena - Keep the derived table around for next execution (which would be a nice speedup). If nothing can be done to improve things, we should ae least move setting block_transform_into_subq to the end of the function and fix that we only block the transformation for prepared statements: > if (all_converted) > m_comparator.set_handler(&type_handler_slonglong); -> if (all_converted) { m_comparator.set_handler(&type_handler_slonglong); if (thd->is_first_query_execution()) block_transform_into_subq= 1; } ---- > diff --git a/sql/item_cmpfunc.h b/sql/item_cmpfunc.h > index 465625f69bf..1404fa30d6b 100644 > --- a/sql/item_cmpfunc.h > +++ b/sql/item_cmpfunc.h > @@ -2401,6 +2401,7 @@ class Item_func_in :public Item_func_opt_neg, > { > return check_argument_types_like_args0(); > } > + bool block_transform_into_subq; Please add a comment about the purpose of this variable, when it should be used and how it is used. As far as I can see, this set in the case where an IN() requires a item conversion in which case we have to block it for prepared statements as we cannot roll it back. > diff --git a/sql/item_subselect.cc b/sql/item_subselect.cc > index ff84e6b7bf9..4dcd1f3ae9d 100644 > --- a/sql/item_subselect.cc > +++ b/sql/item_subselect.cc > @@ -104,6 +104,16 @@ void Item_subselect::init(st_select_lex *select_lex, > unit->item= this; > engine->change_result(this, result, FALSE); > } > + else if ((unit->item->substype() == ALL_SUBS || > + unit->item->substype() == ANY_SUBS) && > + (((Item_in_subselect*) unit->item)-> > + test_set_strategy(SUBS_MAXMIN_INJECTED) || > + ((Item_in_subselect*) unit->item)-> > + test_set_strategy(SUBS_MAXMIN_ENGINE))) > + { > + unit->item= this; > + engine->change_result(this, result, FALSE); > + } > else > { > /* Better to combine the test with the previous test as the code inside the {} is identical? Note that I understand the above code to 100%, so adding a comment what the purpose of this code would be great. Also change the function 'test_set_strategy()' to 'test_choosen_strategy(SUBS_MAXMIN_INJECTED | SUBS_MAXMIN_INJECTED)' This will make the code shorter and easier to read. Later in the review there is a new definition for test_choosen_strategy(). This is very similar to test_strategy(), but also ensure that the strategy is chosen. > @@ -200,15 +211,6 @@ void Item_in_subselect::cleanup() > > void Item_allany_subselect::cleanup() > { > - /* > - The MAX/MIN transformation through injection is reverted through the > - change_item_tree() mechanism. Revert the select_lex object of the > - query to its initial state. > - */ > - for (SELECT_LEX *sl= unit->first_select(); > - sl; sl= sl->next_select()) > - if (test_set_strategy(SUBS_MAXMIN_INJECTED)) > - sl->with_sum_func= false; > Item_in_subselect::cleanup(); > } I assume the above is now removed becasue MAX/MIN optimization changes are now permanent in Item_allany_subselect::transform_into_max_min(JOIN *join). Please add an explanation for the above change to the commit message and also add a note that thanks to this change we don't need to cleanup for MAX/MIN optimization. > @@ -497,6 +505,9 @@ void Item_subselect::recalc_used_tables(st_select_lex *new_parent, > Ref_to_outside *upper; > DBUG_ENTER("recalc_used_tables"); > > + if (!unit->thd->is_first_query_execution()) > + DBUG_VOID_RETURN; With prepared statements, the optimizer plan can change for each query. How come we don't need to recalc_used_tables() just because we have done one execution before? > @@ -2143,11 +2155,11 @@ bool Item_allany_subselect::transform_into_max_min(JOIN *join) > } > if (upper_item) > upper_item->set_sum_test(item); > - thd->change_item_tree(&select_lex->ref_pointer_array[0], item); > + select_lex->ref_pointer_array[0]= item; > { > List_iterator<Item> it(select_lex->item_list); > it++; > - thd->change_item_tree(it.ref(), item); > + it.replace(item); > } Looks like we are now are doing the MAX/MIN transformation permanently. > +uint st_select_lex_unit::ub_eq_for_exists2_in() > +{ > + if (derived && derived->is_materialized_derived()) > + return 0; > + st_select_lex *sl= first_select(); > + if (sl->next_select()) > + return 0; > + uint n= sl->select_n_eq; > + for (st_select_lex_unit *unit= sl->first_inner_unit(); > + unit; > + unit= unit->next_unit()) > + { > + n+= unit->ub_eq_for_exists2_in(); > + } > + return n; > +} Please use a more clear function name. 'ub' is not clear. Please also add a function comment to explain the purpose if this function as I do not understand the purpose of it. The function is is used here: bool Item_exists_subselect::fix_fields(THD *thd, Item **ref) { DBUG_ENTER("Item_exists_subselect::fix_fields"); st_select_lex *sel= unit->first_select(); sel->select_n_reserved= unit->ub_eq_for_exists2_in(); But it's purpose is not that clear or what it trying to calculate. For example, if you have a query: SELECT (a=1) + (b=2) + (c=3) select_n_eq will be 3. In case of SELECT (a = 1 or b = 1) or (a=2 or b=4) select_n_eq will be 4 I don't see how either value can be of any use for the optimizer. Why not instead calculate the number of = operations in the top level Item_comp_and() ? You could have the counter in Item_comp_and of how many '=' comparisons there are inside it. This would give you an exact number of = that may be needed, compared to the current code. Note find_inner_outer_equalities() is only appending items from the top and level or alternatively 0 or 1 element for other items. > +bool Item_singlerow_subselect::test_set_strategy(uchar strategy_arg) > +{ > + DBUG_ASSERT(strategy_arg == SUBS_MAXMIN_INJECTED || > + strategy_arg == SUBS_MAXMIN_ENGINE); > + return ((strategy & SUBS_STRATEGY_CHOSEN) && > + (strategy & ~SUBS_STRATEGY_CHOSEN) == strategy_arg); > +} Change to the following so that we can check all bits at once, like in test_strategy(). bool Item_singlerow_subselect::test_chosen_strategy(uint strategy_arg) { DBUG_ASSERT(strategy_arg && !(strategy_arg & ~(SUBS_MAXMIN_INJECTED | SUBS_MAXMIN_ENGINE))); return ((strategy & SUBS_STRATEGY_CHOSEN) && (strategy & strategy_arg); } > diff --git a/sql/item_subselect.h b/sql/item_subselect.h > index d3572e1abe7..9e318e74bab 100644 > --- a/sql/item_subselect.h > +++ b/sql/item_subselect.h > @@ -302,6 +302,7 @@ class Item_singlerow_subselect :public Item_subselect > { > protected: > Item_cache *value, **row; > + uchar strategy; Change to uint16 to ensure that we can add more bits later without having to change a lot of code (thanks to alignment, the storage required does not change if we use uint16) > diff --git a/sql/item_sum.cc b/sql/item_sum.cc > index 4cf403c1618..e1a32185ad1 100644 > --- a/sql/item_sum.cc > +++ b/sql/item_sum.cc > @@ -93,10 +93,15 @@ bool Item_sum::init_sum_func_check(THD *thd) > thd->lex->in_sum_func= this; > nest_level= thd->lex->current_select->nest_level; > ref_by= 0; > + if (!thd->is_noninitial_query_execution() || > + (thd->active_stmt_arena_to_use()->state == > + Query_arena::STMT_SP_QUERY_ARGUMENTS)) > + { > aggr_level= -1; > aggr_sel= NULL; > max_arg_level= -1; > max_sum_func_level= -1; > + } Please fix the indentation. Please add a comment for the above test. Here is a suggestion: /* Item_sum objects are persistent and does not have to be updated for pre-executed prepared statements (stmt_arena->state == EXECUTED) except when setting variables for stored procedures (state == Query_arena::STMT_SP_QUERY_ARGUMENTS)) */ > outer_fields.empty(); > return FALSE; > } > @@ -409,7 +414,7 @@ bool Item_sum::register_sum_func(THD *thd, Item **ref) > sl= sl->master_unit()->outer_select() ) > sl->master_unit()->item->get_with_sum_func_cache()->set_with_sum_func(); > } > - if (aggr_sel) > + if (aggr_sel && aggr_sel != thd->lex->current_select ) > thd->lex->current_select->mark_as_dependent(thd, aggr_sel, NULL); Remove extra space before && and before ) Please add comment under which circumstances this is true: aggr_sel != thd->lex->current_select I would like to understand why this was not needed before. > +++ b/sql/opt_subselect.cc > @@ -962,8 +962,13 @@ bool JOIN::transform_max_min_subquery() > if (!subselect || (subselect->substype() != Item_subselect::ALL_SUBS && > subselect->substype() != Item_subselect::ANY_SUBS)) > DBUG_RETURN(0); > - DBUG_RETURN(((Item_allany_subselect *) subselect)-> > - transform_into_max_min(this)); > + bool rc= false; > + Query_arena *arena, backup; > + arena= thd->activate_stmt_arena_if_needed(&backup); > + rc= (((Item_allany_subselect *) subselect)->transform_into_max_min(this)); > + if (arena) > + thd->restore_active_arena(arena, &backup); > + DBUG_RETURN(rc); > } Please update the comment for this function to make it clear that the changes to use min/max are now permanent. > > @@ -1889,7 +1894,7 @@ static bool convert_subq_to_sj(JOIN *parent_join, Item_in_subselect *subq_pred) > with thd->change_item_tree > */ > Item_func_eq *item_eq= > - new (thd->mem_root) Item_func_eq(thd, left_exp_orig, > + new (thd->mem_root) Item_func_eq(thd, left_exp, > subq_lex->ref_pointer_array[0]); > if (!item_eq) > goto restore_tl_and_exit; I assume this was a bug in the old code? I checked the history and this line has changed back and between left_exp_orig and left_exp over time. The documentation for left_expr_orig is: /* Important for PS/SP: left_expr_orig is the item that left_expr originally pointed at. That item is allocated on the statement arena, while left_expr could later be changed to something on the execution arena. */ Item *left_expr_orig; As an experiment I added left_exp_orig back and did run the test that you had changed in your commit with valgrind. All test passed. Do you have any test that shows that left_exp is the right value here or do you have any other proof why left_exp is right ? > @@ -6535,6 +6540,10 @@ bool JOIN::choose_subquery_plan(table_map join_tables) > if (is_in_subquery()) > { > in_subs= unit->item->get_IN_subquery(); > + if (in_subs->test_set_strategy(SUBS_MAXMIN_INJECTED)) > + return false; > + if (in_subs->test_set_strategy(SUBS_MAXMIN_ENGINE)) > + return false; > if (in_subs->create_in_to_exists_cond(this)) > return true; You can replace the above two changed lines with: if (in_subs->test_chosen_strategy(SUBS_MAXMIN_INJECTED | SUBS_MAXMIN_INJECTED)) > diff --git a/sql/sql_array.h b/sql/sql_array.h > index b6de1b18d78..b9fdef0623c 100644 > --- a/sql/sql_array.h > +++ b/sql/sql_array.h > @@ -39,7 +39,10 @@ template <typename Element_type> class Bounds_checked_array > > Bounds_checked_array(Element_type *el, size_t size_arg) > : m_array(el), m_size(size_arg) > - {} > + { > + if (!m_size) > + m_array= NULL; > + } Strange that one would call it with m_size() with el != 0 and m_size == 0. Sounds like a bug in the caller. I would change the above to be a DBUG_ASSERT instead. > void reset() { m_array= NULL; m_size= 0; } > > diff --git a/sql/sql_base.cc b/sql/sql_base.cc > index f1fd0053a82..a6cdd218c44 100644 > --- a/sql/sql_base.cc > +++ b/sql/sql_base.cc > @@ -5783,10 +5783,7 @@ find_field_in_view(THD *thd, TABLE_LIST *table_list, > { > if (!my_strcasecmp(system_charset_info, field_it.name()->str, name)) > { > - // in PS use own arena or data will be freed after prepare > - if (register_tree_change && > - thd->stmt_arena->is_stmt_prepare_or_first_stmt_execute()) > - arena= thd->activate_stmt_arena_if_needed(&backup); > + arena= thd->activate_stmt_arena_if_needed(&backup); Why this change? Will it not use statement area for a lot of cases where it did not before? like for state == STMT_EXECUTED ? - This was discussed on slack, but we did not have time to reach a conclusion. > @@ -5805,11 +5802,23 @@ find_field_in_view(THD *thd, TABLE_LIST *table_list, > the replacing item. > */ > if (*ref && !(*ref)->is_autogenerated_name()) > + { > + arena=0; > + if (thd->is_first_query_execution()) > + arena= thd->activate_stmt_arena_if_needed(&backup); > item->set_name(thd, (*ref)->name); > + if (arena) > + thd->restore_active_arena(arena, &backup); Item::set_name() does never allocate any memory, so we don't need arena here. (I checked all code on the sql level). > + } > + if (item != *ref) > + { > + if (register_tree_change && > + !thd->is_first_query_execution()) The above is same as: register_tree_change && !(stmp_arena->state != STMT_INITIALIZED && stmt_arena->state != Query_arena::STMT_EXECUTED) => register_tree_change && (stmp_arena->state == STMT_INITIALIZED || stmt_arena->state == STMT_EXECUTED)) Is the above correct and the intention of this cde? > + thd->change_item_tree(ref, item); > + else > + *ref= item; > + } You can remove the else before *ref. It is faster to update *ref than handle the jump. I am still a bit concerned that we set *ref=item when register_tree_change != 0 as our code comments says that if register_tree_change is set, then *ref will be removed before next execution and any tries to acces it can result in a core dump. > @@ -7639,10 +7648,14 @@ bool setup_fields(THD *thd, Ref_ptr_array ref_pointer_array, > TODO: remove it when (if) we made one list for allfields and > ref_pointer_array > */ > - if (!ref_pointer_array.is_null()) > + if (thd->is_first_query_execution() || > + thd->stmt_arena->is_stmt_prepare()) The new test is same as (see earlier examples): thd->stmt_arena->state != Query_arena::STMT_EXECUTED Please replace the test with the above or replace it with is_stmt_exceuted() <cut> > /* > @@ -7682,17 +7695,6 @@ bool setup_fields(THD *thd, Ref_ptr_array ref_pointer_array, > ref[0]= item; > ref.pop_front(); > } > - /* > - split_sum_func() must be called for Window Function items, see > - Item_window_func::split_sum_func. > - */ > - if (sum_func_list && > - ((item->with_sum_func() && item->type() != Item::SUM_FUNC_ITEM) || > - item->with_window_func)) > - { > - item->split_sum_func(thd, ref_pointer_array, *sum_func_list, > - SPLIT_SUM_SELECT); > - } I assume this is moved part of this from setup_fields to JOIN::prepare(). Any particular reason why this move was required ? Just curious. Still, it would be good to explain this in the commit message. (One reason is that it will allow us to remove sum_func_list from the arguments to setup_fields) I don't see any big problem with the code as it is JOIN::prepare() that is calling both setup_fields() and split_sum_func(). It will be a bit slower as we have to traverse the field_list twice, but that is not a high cost. Note that if we go with the above code change, you should also remove sum_func_list from the arguments as the above code was the only place it was used. Also, please add back the comment to the moved code: /* split_sum_func() must be called for Window Function items, see Item_window_func::split_sum_func. */ > diff --git a/sql/sql_class.cc b/sql/sql_class.cc > index 7913f1a40d2..19f1ec9728e 100644 > --- a/sql/sql_class.cc > +++ b/sql/sql_class.cc > @@ -3959,12 +3962,16 @@ void Query_arena::free_items() > Item *next; > DBUG_ENTER("Query_arena::free_items"); > /* This works because items are allocated on THD::mem_root */ > + for (next= free_list; next; next= next->next) > + { > + next->cleanup(); > + } Why not call cleanup in the following loop instead of looping over the elements twice (as we did before)? Is there some cases when we cannot do cleanup as part of delete_self()? If yes, please add comment in the code why we have to traverse the list twice as otherwise it is very likely some developer will try to optimize the first loop away while looking at the code. If this is the case, is this not a problem also for free_items() ? > diff --git a/sql/sql_class.h b/sql/sql_class.h > index 7e5d9ac96e3..2f05b2dbd03 100644 > --- a/sql/sql_class.h > +++ b/sql/sql_class.h > @@ -2569,6 +2569,20 @@ class THD: public THD_count, /* this must be first */ > return static_cast<PSI_thread*>(my_atomic_loadptr((void*volatile*)&m_psi)); > } > > + inline bool is_first_query_execution() > + { > + return (stmt_arena->is_conventional() || > + stmt_arena->state != Query_arena::STMT_EXECUTED) && > + !stmt_arena->is_stmt_prepare(); (stmt_arena->is_conventional() || stmt_arena->state != Query_arena::STMT_EXECUTED) && !stmt_arena->is_stmt_prepare() => (stmt_arena->state == STMT_CONVENTIONAL_EXECUTION || stmt_arena->state != STMT_EXECUTED) && stmp_arena->state != STMT_INITIALIZED) => stmt_arena->state != STMT_INITIALIZED && stmt_arena->state != STMT_EXECUTED Simpler and easier to understand. > + } > + > + inline bool is_noninitial_query_execution() > + { > + return !stmt_arena->is_conventional() && > + stmt_arena->state == Query_arena::STMT_EXECUTED && > + !stmt_arena->is_stmt_prepare() ; Please add comment to explain this function and when to use it. (name of is_first_query_execution() is clear, but this one is not) Anyway: !stmt_arena->is_conventional() && stmt_arena->state == Query_arena::STMT_EXECUTED && !stmt_arena->is_stmt_prepare() ; => stmt_arena->state != STMT_CONVENTIONAL_EXECUTION && stmt_arena->state == STMT_EXECUTED && stmt_arena->state != STMT_INITIALIZED => stmt_arena->state == STMT_EXECUTED Better to renamed it to is_stmt_executed() and move it to where we define is_stmt_prepare() in class THD and class Query_arena > diff --git a/sql/sql_lex.cc b/sql/sql_lex.cc > index b669925a263..d7b7e98a6e4 100644 > --- a/sql/sql_lex.cc > +++ b/sql/sql_lex.cc > @@ -3601,6 +3607,8 @@ ulong st_select_lex::get_table_join_options() > > uint st_select_lex::get_cardinality_of_ref_ptrs_slice(uint order_group_num_arg) > { > + if (card_of_ref_ptrs_slice) > + return card_of_ref_ptrs_slice; I assume this is just a cache and the code should work also without it? If yes, please add a comment about that. > if (!((options & SELECT_DISTINCT) && !group_list.elements)) > hidden_bit_fields= 0; > > @@ -3620,20 +3628,23 @@ uint st_select_lex::get_cardinality_of_ref_ptrs_slice(uint order_group_num_arg) > order_group_num * 2 + > hidden_bit_fields + > fields_in_window_functions; > + card_of_ref_ptrs_slice= n; > return n; > } Does caching card_of_ref_ptrs_slice really help ? (The function is quite trivial) We call the function in one place, setup_ref_array() It is called once per query (For selectr in JOIN::prepare()) I think having the variable and testing and updating clearing it for each normal query will cost us more than not having it. This assumes we have more normal queries than we have prepared statements. I would suggest we remove the variable. > -bool st_select_lex::setup_ref_array(THD *thd, uint order_group_num) > +static > +bool setup_ref_ptrs_array(THD *thd, Ref_ptr_array *ref_ptrs, uint n_elems) > { > + if (!ref_ptrs->is_null()) > + { > + if (ref_ptrs->size() >= n_elems) > + return false; > + } > Item **array= static_cast<Item**>( > thd->active_stmt_arena_to_use()->alloc(sizeof(Item*) * n_elems)); > if (likely(array != NULL)) > + *ref_ptrs= Ref_ptr_array(array, n_elems); > return array == NULL; > } I assume the change is that the new code will alloocate a new array if the old array was not big enough. That is good. Howevever it would be good to know when this can happen. For ref_pointer_array this should be impossible as we cache the value and it cannot change for two iterations. Is this reallocate needed for the save_ref_ptrs array? Please add a comment to explain in which case the old array was not big enough It would also be good to have the code for setup_ref_ptr_array, setup_ref_array and save_ref_ptrs_after_persistent_rewrites and save_ref_ptrs_if_needed close to each other as these are very similar and very related. > + > +bool st_select_lex::setup_ref_array(THD *thd, uint order_group_num) > +{ > + /* > + We have to create array in prepared statement memory if it is a > + prepared statement > + */ > + uint slice_card= !thd->is_noninitial_query_execution() ? > + get_cardinality_of_ref_ptrs_slice(order_group_num) : > + card_of_ref_ptrs_slice; > + uint n_elems= slice_card * 5; > + DBUG_ASSERT(n_elems % 5 == 0); You can remove the DBUG_ASSERT. It is obvious that this is always correct (for this code). > + return setup_ref_ptrs_array(thd, &ref_pointer_array, n_elems); > +} With the cache, why don't have to test for !thd->is_noninitial_query_execution(). This should be good enough: uint slice_card= get_cardinality_of_ref_ptrs_slice(order_group_num); With the cache, it will return at once after the first call. As the cache is updated on first run, the answer should always be the same is in your original code. > +bool st_select_lex::save_ref_ptrs_after_persistent_rewrites(THD *thd) > +{ > + uint n_elems= card_of_ref_ptrs_slice; > + if (setup_ref_ptrs_array(thd, &save_ref_ptrs, n_elems)) > + return true; > + if (!ref_pointer_array.is_null()) > + join->copy_ref_ptr_array(save_ref_ptrs, join->ref_ptr_array_slice(0)); > + return false; > +} Please add to function start: /* Ensure that setup_ref_array has been called */ DBUG_ASSERT(card_of_ref_ptrs_slice); It looks like all calls to setup_ref_ptrs_array() are using the same value for all future calls. If this is the case, why did you change setup_ref_ptrs_array() to check if number of elements has grown? If this is not needed, remove the check and add an assert() to ensure that the elements does not grow over time. > +bool st_select_lex::save_ref_ptrs_if_needed(THD *thd) > +{ > + if (thd->is_first_query_execution()) > + { > + if (save_ref_ptrs_after_persistent_rewrites(thd)) > + return true; > + } > + return false; > +} The above if is true in case of stmt_arena->state != STMT_INITIALIZED && stmt_arena->state != STMT_EXECUTED We call save_ref_prts_if_needed in JOIN::prepare() and subselect_single_select_engine::prepare(), mysql_select() and st_select_lex_unit::prepare_join I assume this is ok. However, I did find some inconsistently in how this functioin is called. JOIN::prepare() calls it on errors before returning (for most cases but not all). subselect_single_select_engine::prepare() & mysql_select() calls it if join->prepare() returns error, but not otherwise. st_select_lex_unit::prepare_join() calls it always after prepare. I would like to know when we MUST call save_ref_prts_if_needed(). Do we have to call it in case JOIN::prepare returns 0 ? It would be easier if things would be done this way: - JOIN::prepare() would always call it in case of error if there is any need to call it. - We would not have to call it in subselect_single_select_engine::prepare() or mysql_select() - We would remove the call in st_select_lex_unit::prepare_join() or alternative call it only if join->prepare() returns 0. (Depending on what is correct here) > void st_select_lex_unit::print(String *str, enum_query_type query_type) > { > if (with_clause) > @@ -4958,7 +5007,7 @@ bool st_select_lex::optimize_unflattened_subqueries(bool const_only) > } > if ((res= inner_join->optimize())) > return TRUE; > - if (!inner_join->cleaned) > + if (inner_join->thd->is_first_query_execution()) > sl->update_used_tables(); > sl->update_correlated_cache(); > is_correlated_unit|= sl->is_correlated; Fix intendation. Why is cleaned not good enough here? cleaned means that the join structure cannot be used anymore. Why not do: if (!inner_join->cleaned && inner_join->thd->is_first_query_execution()) Should be safer. > diff --git a/sql/sql_lex.h b/sql/sql_lex.h > index 3ab50d4aaa8..59029057a23 100644 > --- a/sql/sql_lex.h > +++ b/sql/sql_lex.h > @@ -1045,6 +1045,8 @@ class st_select_lex_unit: public st_select_lex_node { > > bool can_be_merged(); > > + uint ub_eq_for_exists2_in(); > + > friend class st_select_lex; > }; > > @@ -1176,6 +1178,7 @@ class st_select_lex: public st_select_lex_node > uint curr_tvc_name; > /* true <=> select has been created a TVC wrapper */ > bool is_tvc_wrapper; > + uint fields_added_by_fix_inner_refs; You can remove the above variable as it is not used anywhere. It is set at: sql_select.cc:1499: select_lex->fields_added_by_fix_inner_refs= sql_lex.cc:3119: fields_added_by_fix_inner_refs= 0; but not used anywhere. > /* > Needed to correctly generate 'PRIMARY' or 'SIMPLE' for select_type column > of EXPLAIN > @@ -1201,6 +1204,8 @@ class st_select_lex: public st_select_lex_node > > /// Array of pointers to top elements of all_fields list > Ref_ptr_array ref_pointer_array; > + Ref_ptr_array save_ref_ptrs; > + uint card_of_ref_ptrs_slice; Please add uint variables together with other uint variables. Otherwise we may loose 4 byte for each variable because of alignment. > > /* > number of items in select_list and HAVING clause used to get number > @@ -1220,6 +1225,7 @@ class st_select_lex: public st_select_lex_node > uint order_group_num; > /* reserved for exists 2 in */ > uint select_n_reserved; > + uint select_n_eq; Please add a descripton for the above variable. (Name does not tell me what it is used for). > /* > it counts the number of bit fields in the SELECT list. These are used when DISTINCT is > converted to a GROUP BY involving BIT fields. > @@ -1295,6 +1301,8 @@ class st_select_lex: public st_select_lex_node > case of an error during prepare the PS is not created. > */ > uint8 changed_elements; // see TOUCHED_SEL_* > + uint8 save_uncacheable; > + uint8 save_master_uncacheable; The above two variables can be removed as these are only assigned to but never used for anything. > /* TODO: add foloowing first_* to bitmap above */ > bool first_natural_join_processing; > bool first_cond_optimization; > diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc > index 5db53b6f2b8..83569207a0c 100644 > --- a/sql/sql_parse.cc > +++ b/sql/sql_parse.cc > @@ -4956,11 +4956,24 @@ mysql_execute_command(THD *thd) > if ((res= multi_delete_precheck(thd, all_tables))) > break; > > + if (thd->stmt_arena->is_stmt_prepare()) > + { > + if (add_item_to_list(thd, new (thd->mem_root) Item_null(thd))) > + goto error; > + } > + else if (thd->is_first_query_execution()) > + { The above tests for stmt_arena->state != STMT_INITIALIZED && stmt_arena->state != STMT_EXECUTED While .. != STMT_EXECUTED would be enough as it is tested on if above. No big deal, just and observarion > + if (select_lex->item_list.elements != 0) > + select_lex->item_list.empty(); > + bool rc= false; You don't have to set it to false here. > + Query_arena *arena= 0, backup; > + arena= thd->activate_stmt_arena_if_needed(&backup); > + rc= add_item_to_list(thd, new (thd->stmt_arena->mem_root) Item_null(thd)); > + if (arena) > + thd->restore_active_arena(arena, &backup); > + if (rc) > + goto error; > + } > diff --git a/sql/sql_select.cc b/sql/sql_select.cc > index a7b84bbfe3b..27eb466b47c 100644 > --- a/sql/sql_select.cc > +++ b/sql/sql_select.cc > @@ -551,6 +551,19 @@ fix_inner_refs(THD *thd, List<Item> &all_fields, SELECT_LEX *select, > the found_in_group_by field of the references from the list. > */ > List_iterator_fast <Item_outer_ref> ref_it(select->inner_refs_list); > + > + if (thd->is_noninitial_query_execution()) > + { > + while ((ref= ref_it++)) > + { > + if (ref->found_in_select_list) > + continue; > + int el= all_fields.elements; > + all_fields.push_front(ref_pointer_array[el], thd->mem_root); > + } > + return false; > + } Please move the comment before this loop after the loop as it refers to the next loop. Also please add a short comment why the purpose of this loop is. Will this not cause the ref_pointer_array to grow for each prepared statement call (after the first one)? Looking at the other parts of your patch, I thought that the size and content of ref_pointer_array[] will be fixed after the first call. (or saved in save_ref_ptrs). Adding something to ref_pointer_array in case of STMT_EXECUTED sounds like it is going against the above principles. > + bool is_first_execution= thd->is_first_query_execution(); > + if (is_first_execution || > + thd->stmt_arena->is_stmt_prepare()) The above can be replaced with if (thd->stmt_area->state != STMT_EXECUTED) It a bit confusing that we have defined: inline bool is_stmt_execute() const { return state == STMT_PREPARED || state == STMT_EXECUTED; } as it would be been logical to have inline bool is_stmt_execute() const { return state == STMT_EXECUTED; } Adding a function is_stmt_executed() would easily be confused with is_stmt_execute(). Maybe we should rename is_stmt_execute to is_stmt_preared_or_executed() and add is_stmt_executed() ? > @@ -1336,18 +1367,13 @@ JOIN::prepare(TABLE_LIST *tables_init, COND *conds_init, uint og_num, > DBUG_RETURN(-1); > } > > - if (thd->lex->current_select->first_cond_optimization) > - { > - if ( conds && ! thd->lex->current_select->merged_into) > - select_lex->select_n_reserved= conds->exists2in_reserved_items(); > - else > - select_lex->select_n_reserved= 0; > - } I asume this is now what is calcualated in Item_exists_subselect::fix_fields() and not needed here anymore? > @@ -1463,6 +1489,38 @@ JOIN::prepare(TABLE_LIST *tables_init, COND *conds_init, uint og_num, > if (res) > DBUG_RETURN(res); > > + if (thd->is_first_query_execution() || > + thd->lex->is_ps_or_view_context_analysis()) > + { > + if (fix_inner_refs(thd, all_fields, select_lex, ref_ptrs)) > + DBUG_RETURN(-1); Remove the above outside the if as the code is identical also for the else part. > + if (thd->is_first_query_execution() && > + all_fields.elements > select_lex->item_list.elements) > + select_lex->fields_added_by_fix_inner_refs= > + select_lex->inner_refs_list.elements; You can remove the above 4 lines as fields_added_by_fix_inner_refs is not used anywhere. > + } > + else > + { > + if (fix_inner_refs(thd, all_fields, select_lex, ref_ptrs)) > + DBUG_RETURN(-1); > + select_lex->uncacheable= select_lex->save_uncacheable; > + select_lex->master_unit()->uncacheable= select_lex->save_master_uncacheable; you can remove the above as neither variable is used. > + } > + The above means you can replace the above query block with: if (fix_inner_refs(thd, all_fields, select_lex, ref_ptrs)) DBUG_RETURN(-1); > @@ -1624,6 +1681,7 @@ JOIN::prepare(TABLE_LIST *tables_init, COND *conds_init, uint og_num, > err: > delete procedure; /* purecov: inspected */ > procedure= 0; > + select_lex->save_ref_ptrs_if_needed(thd); > DBUG_RETURN(-1); /* purecov: inspected */ > } ok. Please check if there is any other case in JOIN::prepare where we return <> 0 where we should call save_ref_ptrs_if_needed(thd). By the way, I am not sure why we would save_ref_ptrs_if_needed() at all in case of errors. The state will in this case not be set to STMT_EXECUTED and thus the state should not be used by the next queryt. Please add a comment why we need the result in case of an error. > // Update used tables after all handling derived table procedures > - select_lex->update_used_tables(); > + if (select_lex->first_cond_optimization) > + select_lex->update_used_tables(); Why first_cond_optimization() instead of state != STMT_EXECUTED ? > - if (transform_max_min_subquery()) > + if (select_lex->first_cond_optimization && transform_max_min_subquery()) > DBUG_RETURN(1); /* purecov: inspected */ > > if (select_lex->first_cond_optimization) Please combine the above two ifs > @@ -2042,6 +2100,11 @@ JOIN::optimize_inner() <cut> > + select_lex->save_uncacheable= select_lex->uncacheable; > + select_lex->save_master_uncacheable= select_lex->master_unit()->uncacheable; Remove as not used. > @@ -4832,6 +4895,7 @@ mysql_select(THD *thd, TABLE_LIST *tables, List<Item> &fields, COND *conds, > if ((err= join->prepare(tables, conds, og_num, order, false, group, > having, proc_param, select_lex, unit))) > { > + select_lex->save_ref_ptrs_if_needed(thd); Can probably be removed. See previous comment > @@ -4858,6 +4922,7 @@ mysql_select(THD *thd, TABLE_LIST *tables, List<Item> &fields, COND *conds, > if ((err= join->prepare(tables, conds, og_num, order, false, group, having, > proc_param, select_lex, unit))) > { > + select_lex->save_ref_ptrs_if_needed(thd); > goto err; Can probably be removed. See previous comment > @@ -25172,14 +25253,33 @@ find_order_in_list(THD *thd, Ref_ptr_array ref_pointer_array, <cut> > - from_field= find_field_in_tables(thd, (Item_ident*) order_item, tables, > - NULL, &view_ref, IGNORE_ERRORS, FALSE, > - FALSE); > + if (!thd->is_noninitial_query_execution()) > + { > + from_field= find_field_in_tables(thd, (Item_ident*) order_item, tables, > + NULL, &view_ref, IGNORE_ERRORS, FALSE, > + FALSE); > + order->view_ref= view_ref; > + order->from_field= from_field; > + } > + else > + { > + if (order->view_ref) > + { Fix indentation > + view_ref= order->view_ref; > + from_field= order->from_field; > + } > + else > + { Fix indentation > + from_field= find_field_in_tables(thd, (Item_ident*) order_item, > + tables, NULL, &view_ref, > + IGNORE_ERRORS, FALSE, FALSE); > + } > + } > if (!from_field) > from_field= (Field*) not_found_field; > } > @@ -25232,6 +25332,12 @@ find_order_in_list(THD *thd, Ref_ptr_array ref_pointer_array, > if (found_item != not_found_item) > { > order->item= &ref_pointer_array[all_fields.elements-1-counter]; > + if (!thd->is_first_query_execution() && > + !thd->lex->is_ps_or_view_context_analysis()) > + { > + if ((*order->item)->fix_fields_if_needed_for_order_by(thd, order->item)) > + return TRUE; > + } The above code is new. Why do need this when we did not need it before? Please add a comment for the block above. > order->in_field_list= 0; > return FALSE; > } > diff --git a/sql/sql_union.cc b/sql/sql_union.cc > index c3c4198439a..5fb31ccbe46 100644 > --- a/sql/sql_union.cc > +++ b/sql/sql_union.cc > @@ -1114,6 +1114,8 @@ bool st_select_lex_unit::prepare_join(THD *thd_arg, SELECT_LEX *sl, > thd_arg->lex->proc_list.first), > sl, this); > > + sl->save_ref_ptrs_if_needed(thd); You may be able to remove the above, look at other comments. > @@ -2784,6 +2786,8 @@ bool st_select_lex::cleanup() > > if (join) > { > + if (join->thd->stmt_arena->is_stmt_prepare()) > + inner_refs_list.empty(); > List_iterator<TABLE_LIST> ti(leaf_tables); > TABLE_LIST *tbl; > while ((tbl= ti++)) > @@ -2813,7 +2817,6 @@ bool st_select_lex::cleanup() > continue; > error= (bool) ((uint) error | (uint) lex_unit->cleanup()); > } > - inner_refs_list.empty(); > exclude_from_table_unique_test= FALSE; > hidden_bit_fields= 0; > DBUG_RETURN(error); The above patch looks a bit strange. We are deleting the join in this function. Isn't the inner_refs_list depending on the join struct in any way? One effect of the above is that we are resetting inner_refs_list only later when excecuting the next query in st_select_lex::init_select() for anything else than STMT_INITIALIZED. I tried reverting the fix and I did get errors. I assume this is because we now only update inner_refs_list once and permantently during first execution? It would be good to add some kind of comment in the commit message about this change. Why only is_stmt_prepare and not also for STMT_CONVENTIONAL_EXECUTION ? Regards, Monty

1 0

Re: 175adef14ef: MDEV-32537 due to Linux, restrict thread name 15 characters, also in PS.
by Sergei Golubchik 11 Jun '24

11 Jun '24

Hi, Vladislav, On Jun 08, Vladislav Vaintroub wrote: > revision-id: 175adef14ef (mariadb-11.0.1-288-g175adef14ef) > parent(s): 4ffb583d72f > author: Vladislav Vaintroub > committer: Vladislav Vaintroub > timestamp: 2024-01-25 09:57:52 +0100 > message: > > MDEV-32537 due to Linux, restrict thread name 15 characters, also in PS. > > Rename some threads to workaround this restrictions, > e.g "rpl_parallel_thread"->"rpl_parallel", > "slave_background" -> "slave_bg" etc. I know I suggested that, but I'm getting a second thought now, sorry. This would break compatibility with performance schema tools, whatever they might be. I don't know if there are any, though, so may be it's not a concern. Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org

2 2

Re: 4ffb583d72f: MDEV-32537 Validate my_thread_set_name parameter against performance schema names.
by Sergei Golubchik 11 Jun '24

11 Jun '24

Hi, Vladislav, On Jun 08, Vladislav Vaintroub wrote: > revision-id: 4ffb583d72f (mariadb-11.0.1-287-g4ffb583d72f) > parent(s): 7d863d3388d > author: Vladislav Vaintroub > committer: Vladislav Vaintroub > timestamp: 2024-01-20 11:26:21 +0100 > message: > > MDEV-32537 Validate my_thread_set_name parameter against performance schema names. > > In order to be consistent with PSI force thread name to be the same as > last part of thread_class_name e.g thread with PSI name > "thread/sql/main" must be called "main" . Due to Linux restriction on > thread name length ( 15 chars), full name can't be used. Does it have to be that way? I thought that mysql_thread_create() could automatically name the thread, why would we need to do it manually? Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org

2 2

Re: MariaDB topic on GitHub
by Ian Gilfillan 11 Jun '24

11 Jun '24

Hi! Thanks for the suggestions. We'll take a look on our side at adding topics and improving the repo metadata. On Tue, Mar 19, 2024 at 8:55 AM Ian Gilfillan <ian(a)mariadb.org> wrote: > Hi! > > Thanks for the suggestions. We'll take a look on our side at adding topics > and improving the repo metadata. > > On Tue, Mar 19, 2024 at 5:20 AM Otto Kekäläinen <otto(a)kekalainen.net> > wrote: > >> Hi! >> >> I was browsing the GitHub topic 'mariadb' at >> https://github.com/topics/mariadb and to my surprise MariaDB itself >> isn't in that list at all. This seems to be due to >> https://github.com/MariaDB/server not having any topics defined at >> all. >> >> I recommend you add in the GitHub settings of >> https://github.com/MariaDB/server at least the topic 'mariadb', and >> while at it, also add link to mariadb.org. >> >> More ideas on how to improve the look and features of the GitHub page >> can be found by browsing projects like https://github.com/pulsar-edit >> and https://github.com/prisma which have all their GitHub account >> metadata polished. >> >> - Otto >> >

3 8

Re: 7d863d3388d: MDEV-32537 Name threads to improve debugging experience and diagnostics.
by Sergei Golubchik 08 Jun '24

08 Jun '24

Hi, Vladislav, I'd prefer you use my_thread_set_name() everywhere, even in linux- or windows-only code. It's not performance critical, and it'd be much easier to grep to thread names if they all are set by the same function. Otherwise ok. On Jun 08, Vladislav Vaintroub wrote: > revision-id: 7d863d3388d (mariadb-11.0.1-286-g7d863d3388d) > parent(s): 8bf9f218556 > author: Vladislav Vaintroub > committer: Vladislav Vaintroub > timestamp: 2024-01-18 20:35:54 +0100 > message: > > MDEV-32537 Name threads to improve debugging experience and diagnostics. > > Use SetThreadDescription/pthread_setname_np to give threads a name. > > diff --git a/mysys/my_thr_init.c b/mysys/my_thr_init.c > index 2e8decd7d06..4271935472e 100644 > --- a/mysys/my_thr_init.c > +++ b/mysys/my_thr_init.c > @@ -192,6 +192,35 @@ my_bool my_thread_global_init(void) > return 0; > } > > +#ifdef _WIN32 > +#define MAX_THREAD_NAME 256 > +#elif defined(__linux__) > +#define MAX_THREAD_NAME 16 > +#elif defined(__FreeBSD__) || defined(__OpenBSD__) > +#include <pthread_np.h> > +#endif > + > +void my_thread_set_name(const char *name) > +{ > +#ifdef _WIN32 > + wchar_t wname[MAX_THREAD_NAME]; > + wname[0]= 0; > + MultiByteToWideChar(CP_UTF8, 0, name, -1, wname, MAX_THREAD_NAME); > + SetThreadDescription(GetCurrentThread(), wname); > +#elif defined __linux__ > + char shortname[MAX_THREAD_NAME]; > + snprintf(shortname, MAX_THREAD_NAME, "%s", name); > + pthread_setname_np(pthread_self(), shortname); > +#elif defined __NetBSD__ > + pthread_setname_np(pthread_self(), "%s", (void *) name); > +#elif defined __FreeBSD__ || defined __OpenBSD__ > + pthread_set_name_np(pthread_self(), name); > +#elif defined __APPLE__ > + pthread_setname_np(name); > +#else > + (void) name; > +#endif > +} > > /** > End the mysys thread system. Called when ending the last thread > diff --git a/tpool/aio_linux.cc b/tpool/aio_linux.cc > index 507c6b9264f..f39b1e9b457 100644 > --- a/tpool/aio_linux.cc > +++ b/tpool/aio_linux.cc > @@ -93,6 +94,7 @@ class aio_linux final : public aio > > static void getevent_thread_routine(aio_linux *aio) > { > + pthread_setname_np(pthread_self(), "my_getevents"); > /* > We collect events in small batches to hopefully reduce the > number of system calls. > diff --git a/tpool/aio_win.cc b/tpool/aio_win.cc > index b44f705bd1e..930bab88777 100644 > --- a/tpool/aio_win.cc > +++ b/tpool/aio_win.cc > @@ -92,6 +92,7 @@ class tpool_generic_win_aio : public aio > > static void aio_completion_thread_proc(tpool_generic_win_aio* aio) > { > + SetThreadDescription(GetCurrentThread(), L"aio_completion_thread_proc"); > aio->completion_thread_work(); > } > > Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org

1 0

Re: [PATCH] MDEV-21322: Report slave progress to the master
by Kristian Nielsen 04 Jun '24

04 Jun '24

Kristian Nielsen via commits <commits(a)lists.mariadb.org> writes: > From: Brandon Nesterenko <brandon.nesterenko(a)mariadb.com> > > This patch extends the command `SHOW REPLICA HOSTS` with three > columns: Hi Brandon, please find my review of MDEV-21322 below. > This patch extends the command `SHOW REPLICA HOSTS` with three > columns: > > 1) Gtid_State_Sent. This represents that latest GTIDs sent to the > replica in each domain. It will always be populated, regardless of > the semi-sync status (i.e. asynchronous connections will still > update this column with the latest GTID state sent to the replica). > > 2) Gtid_State_Ack. For semi-synchronous connections (only), this > column represents the last GTID in each domain that the replica has > acknowledged. Use "Pos" instead of "State", eg. Gtid_Pos_Sent and Gtid_Pos_Ack. To be consistent with eg. gtid_binlog_pos, which has last GTID per domain_id; vs. gtid_binlog_state, which has GTID per every (domain_id, server_id) combination. > @@ -2272,6 +2292,30 @@ send_event_to_slave(binlog_send_info *info, Log_event_type event_type, > return "Failed to run hook 'after_send_event'"; > } > > + if (info->thd->slave_info) > + { > + strncpy(info->thd->slave_info->gtid_state_sent.log_file, > + info->log_file_name + info->dirlen, > + strlen(info->log_file_name) - info->dirlen); > + info->thd->slave_info->gtid_state_sent.log_pos= pos; > + } Isn't there missing synchronisation here against concurrent access from another thread? I guess we need LOCK_thd_data around the strncpy(). Maybe we can avoid taking LOCK_thd_data except in the uncommon case where the gile name changes. It would be best to avoid a mutex lock/unlock for _every_ event sent to a slave. The position doesn't need locking I think, you can use an atomic std::memory_order_relaxed mostly for documentation if you like. > @@ -125,8 +114,11 @@ int THD::register_slave(uchar *packet, size_t packet_length) > if (check_access(this, PRIV_COM_REGISTER_SLAVE, any_db.str, NULL,NULL,0,0)) > return 1; > if (!(si= (Slave_info*)my_malloc(key_memory_SLAVE_INFO, sizeof(Slave_info), > - MYF(MY_WME)))) > + MYF(MY_WME|MY_ZEROFILL)))) > return 1; > + memset(si->gtid_state_sent.log_file, '\0', FN_REFLEN); > + memset(si->gtid_state_ack.log_file, '\0', FN_REFLEN); I think it's good to add the MY_ZEROFILL here, but then the two memset() are redundant and can be removed. > @@ -1235,6 +1251,14 @@ int Repl_semi_sync_master::update_sync_header(THD* thd, unsigned char *packet, > *need_sync= sync; > > l_end: > + if (is_on()) > + { > + thd->slave_info->sync_status= > + sync ? thd->slave_info->sync_status= > + Slave_info::SYNC_STATE_SEMI_SYNC_ACTIVE > + : thd->slave_info->sync_status= > + Slave_info::SYNC_STATE_SEMI_SYNC_STALE; > + } There's something wierd here, double assignment of thd->slave_info->sync_status, probably you just meant: thd->slave_info->sync_status= sync ? Slave_info::SYNC_STATE_SEMI_SYNC_ACTIVE : Slave_info::SYNC_STATE_SEMI_SYNC_STALE; More importantly, update_sync_header() is called just _before_ writing the event on the socket to the connected slave. Is that really the right place to do so? Shouldn't the status be updated only _after_ the event has been sent? Or does it not matter, because the actual TCP sending is asynchroneous anyway? In that case, should add a comment explaining why updating the status here is ok. > --- a/mysql-test/suite/rpl/r/rpl_mixed_ddl_dml.result > +++ b/mysql-test/suite/rpl/r/rpl_mixed_ddl_dml.result > @@ -11,8 +11,8 @@ n > 2002 > connection master; > show slave hosts; > -Server_id Host Port Master_id > -2 127.0.0.1 9999 1 > +Server_id Host Port Master_id Gtid_State_Sent Gtid_State_Ack Sync_Status > +2 127.0.0.1 9999 1 0-1-2 Asynchronous Just a remark that this output could be non-deterministic unless the test guarantees that the state will be stable at this point. So please check that the test will produce stable output; or if you already did so, then great! > +--echo # state, and Gtid_State_Sent/Ack should start, and only Gtid_State_Sent Typo, s/should start/should start empty/ ? > +--connection server_2 > +--source include/stop_slave.inc > +set global rpl_semi_sync_slave_enabled = 1; > +--source include/start_slave.inc > + > +--connection server_1 > +let $status_var= Rpl_semi_sync_master_clients; > +let $status_var_value= 1; > +source include/wait_for_status_var.inc; > + > +--replace_result $SLAVE_MYPORT SLAVE_PORT $SERVER_MYPORT_3 SLAVE2_PORT $DEFAULT_MASTER_PORT DEFAULT_PORT > +SHOW REPLICA HOSTS; Hm, won't this output be non-deterministic? From a quick look in the code, Rpl_semi_sync_master_clients is incremented from the start of mysql_binlog_send(), but the state is only set from update_sync_header() when the first event is sent to the slave. So won't there be a (small) window where the state has not yet switched to ACTIVE when this SHOW SLAVE HOSTS runs? So better wait for the state to become ACTIVE instead. Or for the expected value in Gtid_Pos_Sent, where that is non-empty. Also, better use SHOW SLAVE HOSTS here, for consistency with rest of test suite. > +--let $nextval= `SELECT max(a)+1 from t1` > +--eval insert into t1 values ($nextval) Is there a reason here to use separate --let and --eval? And not just: INSERT INTO t1 SELECT max(a)+1 FROM t1; > +--echo # Ensuring master gtid_binlog_pos matches Gtid_State_Ack > +--let $gtid_ack= query_get_value(show slave hosts, Gtid_State_Ack, 1) > +if (`SELECT strcmp("$master_gtid","$gtid_ack") != 0`) > +{ > + --echo # Master gtid_binlog_pos: $master_gtid > + --echo # Gtid_State_Ack: $gtid_ack > + --die Master's gtid_binlog_pos should match Gtid_State_Ack, but doesn't Again, sync_with_master_gtid.inc on the slave will not guarantee that the master saw the ack yet, even though it will be a rare race when it doesn't happen. So you'd need to wait for the expected value here with wait_condition.inc or somilar instead, I think. Probably the same in subsequent parts of the test case. > +--echo # > +--echo # 21322.5: When connecting a new slave (server_id 3) which initially has > +--echo # semi-sync disabled, SHOW SLAVE HOSTS on the master should show its > +--echo # Sync_Status as asynchronous (while server_id 2 is still semi-sync > +--echo # active). Sorry, it's a very long and complex test case, I didn't have the time today to carefully read and check every detail in it after this point. Please carefully check all the cases for possible races like described above. Replication tests are notorious for having non-deterministic failures (as I'm sure you've experienced first-hand), and it's important to try to avoid that as much as possible. A general remark on the test, the output of Gtid_Pos_Sent and Gtid_Pos_Ack are full GTID positions (multiple domain_id). But there is no test of having more than one domain in the output. I think you need to extend the test to test at least one case where there are multiple domains. And also need to test when the GTID state in the binlog has multiple server_id (though maybe the test case already does this, I didn't check myself). - Kristian.

2 3

Re: eba212af92d: MDEV-9101 Limit size of created disk temporary files and tables
by Sergei Golubchik 08 May '24

08 May '24

Hi, Monty, On Apr 18, Michael Widenius wrote: > commit eba212af92d > Author: Michael Widenius <monty(a)mariadb.org> > Date: Thu Mar 14 17:59:00 2024 +0100 > > diff --git a/sql/json_table.cc b/sql/json_table.cc > index 6713e796eb7..da5ab0c450a 100644 > --- a/sql/json_table.cc > +++ b/sql/json_table.cc > @@ -734,7 +734,8 @@ bool Create_json_table::finalize(THD *thd, TABLE *table, > > table->db_stat= HA_OPEN_KEYFILE; > if (unlikely(table->file->ha_open(table, table->s->path.str, O_RDWR, > - HA_OPEN_TMP_TABLE | HA_OPEN_INTERNAL_TABLE))) > + HA_OPEN_TMP_TABLE | HA_OPEN_INTERNAL_TABLE | > + HA_OPEN_SIZE_TRACKING))) shouldn't HA_OPEN_INTERNAL_TABLE always imply HA_OPEN_SIZE_TRACKING automatically? > DBUG_RETURN(true); > > table->set_created(); > diff --git a/storage/maria/ma_statrec.c b/storage/maria/ma_statrec.c > index d8a8b0a05d7..ff0c6c7a617 100644 > --- a/storage/maria/ma_statrec.c > +++ b/storage/maria/ma_statrec.c > @@ -45,6 +45,13 @@ my_bool _ma_write_static_record(MARIA_HA *info, const uchar *record) > my_errno=HA_ERR_RECORD_FILE_FULL; > return(2); > } > + if (info->s->tracked && are there temporary files that you don't want to track? are there non-temporary files that you want to track? > + _ma_update_tmp_file_size(&info->s->track_data, > + MY_ALIGN(info->s->state.state.data_file_length+ > + info->s->base.pack_reclength, > + MARIA_TRACK_INCREMENT_SIZE))) > + return(2); > + > if (info->opt_flag & WRITE_CACHE_USED) > { /* Cash in use */ > if (my_b_write(&info->rec_cache, record, > diff --git a/storage/maria/ma_page.c b/storage/maria/ma_page.c > index 5881456a69a..0877c966d1c 100644 > --- a/storage/maria/ma_page.c > +++ b/storage/maria/ma_page.c > @@ -417,6 +417,13 @@ my_off_t _ma_new(register MARIA_HA *info, int level, > DBUG_RETURN(HA_OFFSET_ERROR); > } > share->state.state.key_file_length+= block_size; > + if (info->s->tracked && > + _ma_update_tmp_file_size(&share->track_index, > + share->state.state.key_file_length)) shouldn't you do it before share->state.state.key_file_length is increased? Currently it seems that you're failing the operation, but the file will stay increased same pattern everywhere else > + { > + mysql_mutex_unlock(&share->intern_lock); > + DBUG_RETURN(HA_OFFSET_ERROR); > + } > /* Following is for not transactional tables */ > info->state->key_file_length= share->state.state.key_file_length; > mysql_mutex_unlock(&share->intern_lock); > diff --git a/storage/maria/ma_delete_all.c b/storage/maria/ma_delete_all.c > index f355d0da3e8..ed2087d3091 100644 > --- a/storage/maria/ma_delete_all.c > +++ b/storage/maria/ma_delete_all.c > @@ -132,9 +132,16 @@ int maria_delete_all_rows(MARIA_HA *info) > if (_ma_flush_table_files(info, MARIA_FLUSH_DATA|MARIA_FLUSH_INDEX, > FLUSH_IGNORE_CHANGED, FLUSH_IGNORE_CHANGED) || > mysql_file_chsize(info->dfile.file, 0, 0, MYF(MY_WME)) || > - mysql_file_chsize(share->kfile.file, share->base.keystart, 0, MYF(MY_WME))) > + mysql_file_chsize(share->kfile.file, share->base.keystart, 0, > + MYF(MY_WME))) should be > 0 here, shouldn't it? Both for kfile and dfile. > goto err; > > + if (info->s->tracked) > + { > + _ma_update_tmp_file_size(&info->s->track_data, 0); > + _ma_update_tmp_file_size(&info->s->track_index, share->base.keystart); but this shouldn't update the size if mysql_file_chsize returned -1 > + } > + > if (_ma_initialize_data_file(share, info->dfile.file)) > goto err; > > diff --git a/sql/uniques.cc b/sql/uniques.cc > index 36725e80a6b..44d93bb0f88 100644 > --- a/sql/uniques.cc > +++ b/sql/uniques.cc > @@ -720,7 +720,7 @@ bool Unique::merge(TABLE *table, uchar *buff, size_t buff_size, > /* Open cached file for table records if it isn't open */ > if (! my_b_inited(outfile) && > open_cached_file(outfile, mysql_tmpdir, TEMP_PREFIX, DISK_CHUNK_SIZE, > - MYF(MY_WME))) > + MYF(MY_WME | MY_TRACK | MY_TRACK_WITH_LIMIT))) as far as I can see, open_cached_file is only used for temporary files. it should apply MY_TRACK | MY_TRACK_WITH_LIMIT internally, it'll be less error-prone > return 1; > > bzero((char*) &sort_param,sizeof(sort_param)); > diff --git a/storage/maria/ma_write.c b/storage/maria/ma_write.c > index 9a6859bfd4d..7404db84067 100644 > --- a/storage/maria/ma_write.c > +++ b/storage/maria/ma_write.c > @@ -386,6 +387,11 @@ int maria_write(MARIA_HA *info, const uchar *record) > } > } > } > + else if (my_errno == HA_ERR_LOCAL_TMP_SPACE_FULL || > + my_errno == HA_ERR_GLOBAL_TMP_SPACE_FULL) > + { > + filepos= HA_OFFSET_ERROR; /* Avoid write_record_abort() */ why avoid? HA_ERR_LOCAL_TMP_SPACE_FULL is like EQUOT or ENOSPC. why should be masked as HA_OFFSET_ERROR, and not be a fatal error? > + } > else > fatal_error= 1; > > diff --git a/storage/maria/ma_close.c b/storage/maria/ma_close.c > index 7441e29a97b..47935d61b88 100644 > --- a/storage/maria/ma_close.c > +++ b/storage/maria/ma_close.c > @@ -14,7 +14,7 @@ > along with this program; if not, write to the Free Software > Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA */ > > -/* close a isam-database */ > +/* close an Aria table */ at last :) it's what 27 years old? > /* > TODO: > We need to have a separate mutex on the closed file to allow other threads > diff --git a/sql/sql_window.cc b/sql/sql_window.cc > index e0c7cef1a9e..4959c7833a1 100644 > --- a/sql/sql_window.cc > +++ b/sql/sql_window.cc > @@ -2922,15 +2922,16 @@ bool compute_window_func(THD *thd, > if (unlikely(thd->is_error() || thd->is_killed())) > break; > > - > /* Return to current row after notifying cursors for each window > function. */ > - tbl->file->ha_rnd_pos(tbl->record[0], rowid_buf); > + if (tbl->file->ha_rnd_pos(tbl->record[0], rowid_buf)) > + return true; how can it fail? > } > > /* We now have computed values for each window function. They can now > be saved in the current row. */ > - save_window_function_values(window_functions, tbl, rowid_buf); > + if (save_window_function_values(window_functions, tbl, rowid_buf)) > + return true; > > rownum++; > } > diff --git a/mysql-test/main/status.test b/mysql-test/main/status.test > index acf29f35b9e..b6be046a9ac 100644 > --- a/mysql-test/main/status.test > +++ b/mysql-test/main/status.test > @@ -244,9 +244,9 @@ let $rnd_next = `show global status like 'handler_read_rnd_next'`; > let $tmp_table = `show global status like 'Created_tmp_tables'`; > show status like 'com_show_status'; > show status like 'hand%write%'; > -show status like '%tmp%'; > +show status like '%_tmp%'; Doesn't look like a very meaningful change. Did you mean '%\_tmp%' ? > show status like 'hand%write%'; > -show status like '%tmp%'; > +show status like '%_tmp%'; > show status like 'com_show_status'; > let $rnd_next2 = `show global status like 'handler_read_rnd_next'`; > let $tmp_table2 = `show global status like 'Created_tmp_tables'`; > diff --git a/mysys/mf_cache.c b/mysys/mf_cache.c > index 2fec59f4e70..712187f6181 100644 > --- a/mysys/mf_cache.c > +++ b/mysys/mf_cache.c > @@ -43,7 +43,7 @@ my_bool open_cached_file(IO_CACHE *cache, const char* dir, const char *prefix, > cache->file_name=0; > cache->buffer=0; /* Mark that not open */ > if (!init_io_cache(cache, -1, cache_size, WRITE_CACHE, 0L, 0, > - MYF(cache_myflags | MY_NABP))) > + MYF(cache_myflags | MY_NABP | MY_TRACK))) Why not MY_TRACK_WITH_LIMIT ? > { > DBUG_RETURN(0); > } > diff --git a/sql/sql_select.cc b/sql/sql_select.cc > index 6f2797b6b39..324f7679e8f 100644 > --- a/sql/sql_select.cc > +++ b/sql/sql_select.cc > @@ -22275,7 +22275,8 @@ bool open_tmp_table(TABLE *table) > int error; > if (unlikely((error= table->file->ha_open(table, table->s->path.str, O_RDWR, > HA_OPEN_TMP_TABLE | > - HA_OPEN_INTERNAL_TABLE)))) > + HA_OPEN_INTERNAL_TABLE | > + HA_OPEN_SIZE_TRACKING)))) shouldn't HA_OPEN_TMP_TABLE or HA_OPEN_INTERNAL_TABLE imply HA_OPEN_SIZE_TRACKING? > { > table->file->print_error(error, MYF(0)); /* purecov: inspected */ > table->db_stat= 0; > diff --git a/sql/handler.cc b/sql/handler.cc > index fec473d97bb..a05b7476724 100644 > --- a/sql/handler.cc > +++ b/sql/handler.cc > @@ -443,9 +444,12 @@ int ha_init_errors(void) > /* Zerofill it to avoid uninitialized gaps. */ remove the comment, if you no longer zerofill > if (! (handler_errmsgs= (const char**) my_malloc(key_memory_handler_errmsgs, > HA_ERR_ERRORS * sizeof(char*), > - MYF(MY_WME | MY_ZEROFILL)))) > + MYF(MY_WME)))) > return 1; > > + /* Copy default handler error messages */ > + memcpy(handler_errmsgs, handler_error_messages, HA_ERR_ERRORS * sizeof(char*)); > + > /* Set the dedicated error messages. */ > SETMSG(HA_ERR_KEY_NOT_FOUND, ER_DEFAULT(ER_KEY_NOT_FOUND)); > SETMSG(HA_ERR_FOUND_DUPP_KEY, ER_DEFAULT(ER_DUP_KEY)); > diff --git a/storage/maria/maria_def.h b/storage/maria/maria_def.h > index 705562eb795..b256466f1f0 100644 > --- a/storage/maria/maria_def.h > +++ b/storage/maria/maria_def.h > @@ -778,6 +785,14 @@ typedef struct st_maria_share > myf write_flag; > enum data_file_type data_file_type; > enum pagecache_page_type page_type; /* value depending transactional */ > + > + /* > + tracked will cause lost bytes (not aligned) but this is ok as it is always > + used with tmp_file_tracking if set I don't understand your explanation why "this is ok" > + */ > + my_bool tracked; /* Tracked table (always internal) */ > + struct tmp_file_tracking track_data,track_index; > + > /** > if Checkpoint looking at table; protected by close_lock or THR_LOCK_maria > */ > diff --git a/include/my_sys.h b/include/my_sys.h > index af509c7df45..0df12963ab9 100644 > --- a/include/my_sys.h > +++ b/include/my_sys.h > @@ -178,6 +180,17 @@ uchar *my_large_malloc(size_t *size, myf my_flags); > void my_large_free(void *ptr, size_t size); > void my_large_page_truncate(size_t *size); > > +/* Tracking tmp file usage */ > + > +struct tmp_file_tracking > +{ > + ulonglong last_position; I had to read how "last_position" is used to realize that it's in fact previous file_size value. Why not to call it previous_file_size? Or recorded_file_size ? Or old_file_size? it's not a "position" > + ulonglong file_size; > +}; > + > +typedef int (*TMPFILE_SIZE_CB)(struct tmp_file_tracking *track, int no_error); > +extern TMPFILE_SIZE_CB update_tmp_file_size; > + > #ifdef _WIN32 > extern BOOL my_obtain_privilege(LPCSTR lpPrivilege); > #endif > diff --git a/sql/sql_class.h b/sql/sql_class.h > index c1efb49f6da..42eee9913ec 100644 > --- a/sql/sql_class.h > +++ b/sql/sql_class.h > @@ -1070,12 +1074,22 @@ typedef struct system_status_var > */ > > #define last_system_status_var questions > -#define last_cleared_system_status_var local_memory_used > + > +/* Parameters to set_status_var_init() */ > + > +#define STATUS_OFFSET(A) offsetof(STATUS_VAR,A) > +/* Clear as part of flush */ > +#define clear_up_to_tmp_space_used STATUS_OFFSET(tmp_space_used) > +/* Clear as part of startup */ > +#define clear_up_to_memory_used STATUS_OFFSET(local_memory_used) > +/* Full initialization. Note that global_memory_used is updated early! */ > +#define clear_up_to_global_memory_used STATUS_OFFSET(global_memory_used) > +#define last_restored_status_var clear_up_to_tmp_space_used those are bad names. and the fact that you need to rename these defines and change all code that uses them - this kind of shows they're not good. when there will be a new variable, clear_up_to_tmp_space_used can become clear_up_to_new_variable and everything will need to be changed again. better use names that don't depend on individual variables. last_restored_status_var and last_cleared_system_status_var are good. Try to split variables logically. First there are additive variables - global value is the sum of all local values. They're cleared, they're added up tp global, etc. Then there are non-additive, not cleared, not added - like local_memory_used and tmp_space_used, for example. And global_memory_used is special because it's initialized early, right? so, may be defines based on these properties would be easier to understand. because now I'm totally at loss when to use clear_up_to_tmp_space_used, when to use clear_up_to_memory_used, when clear_up_to_global_memory_used and when last_restored_status_var. > + > > /** Number of contiguous global status variables */ > -constexpr int COUNT_GLOBAL_STATUS_VARS= int(offsetof(STATUS_VAR, > - last_system_status_var) / > - sizeof(ulong)) + 1; > +constexpr int COUNT_GLOBAL_STATUS_VARS= > + int(STATUS_OFFSET(last_system_status_var) /sizeof(ulong)) + 1; > > /* > Global status variables > diff --git a/sql/filesort.cc b/sql/filesort.cc > index abcd4127a0b..df4508b635a 100644 > --- a/sql/filesort.cc > +++ b/sql/filesort.cc > @@ -1082,6 +1078,11 @@ static ha_rows find_all_keys(THD *thd, Sort_param *param, SQL_SELECT *select, > DBUG_RETURN(num_records); > > err: > + if (!quick_select) > + { > + (void) file->extra(HA_EXTRA_NO_CACHE); > + file->ha_rnd_end(); > + } why? > sort_form->column_bitmaps_set(save_read_set, save_write_set); > DBUG_RETURN(HA_POS_ERROR); > } /* find_all_keys */ > diff --git a/sql/mysqld.cc b/sql/mysqld.cc > index dc4a955aa79..4727b7b5413 100644 > --- a/sql/mysqld.cc > +++ b/sql/mysqld.cc > @@ -1947,6 +1950,10 @@ static void mysqld_exit(int exit_code) > if (exit_code == 0 || opt_endinfo) > SAFEMALLOC_REPORT_MEMORY(0); > } > + if (global_tmp_space_used) > + fprintf(stderr, "Warning: tmp_space not freed: %llu\n", > + (ulonglong) global_tmp_space_used); This is confusing. almost all (if not all) temp space is automatically freed by the OS when the process exits, so this warning will only scare the user about lost space when in fact no space is lost, it's only some space was not accounted for inside MariaDB. I suggest either make this warning debug-only or rephrase it like "Note: %llu tmp_space was lost and will be freed automatically when the process exits" > + > DBUG_LEAVE; > #ifdef _WIN32 > my_report_svc_status(SERVICE_STOPPED, exit_code, 0); > @@ -7571,6 +7636,7 @@ SHOW_VAR status_vars[]= { > {"Tc_log_page_size", (char*) &tc_log_page_size, SHOW_LONG_NOFLUSH}, > {"Tc_log_page_waits", (char*) &tc_log_page_waits, SHOW_LONG}, > #endif > + {"tmp_space_used", (char*) offsetof(STATUS_VAR, tmp_space_used), SHOW_LONGLONG_STATUS}, first letter - upper case > #ifdef HAVE_POOL_OF_THREADS > {"Threadpool_idle_threads", (char *) &show_threadpool_idle_threads, SHOW_SIMPLE_FUNC}, > {"Threadpool_threads", (char *) &show_threadpool_threads, SHOW_SIMPLE_FUNC}, > diff --git a/sql/log.cc b/sql/log.cc > index 460cefea47b..a9bc3758dd5 100644 > --- a/sql/log.cc > +++ b/sql/log.cc > @@ -2388,6 +2384,8 @@ bool Event_log::check_write_error(THD *thd) > case ER_TRANS_CACHE_FULL: > case ER_STMT_CACHE_FULL: > case ER_ERROR_ON_WRITE: > + case EE_LOCAL_TMP_SPACE_FULL: > + case EE_GLOBAL_TMP_SPACE_FULL: not a good idea to mix EE_ OS-like errors and ER_ SQL layer errors > case ER_BINLOG_LOGGING_IMPOSSIBLE: > checked= TRUE; > break; > @@ -6650,7 +6651,8 @@ Event_log::flush_and_set_pending_rows_event(THD *thd, Rows_log_event* event, > { > set_write_error(thd, is_transactional); > if (check_cache_error(thd, cache_data) && > - stmt_has_updated_non_trans_table(thd)) > + (stmt_has_updated_non_trans_table(thd) || > + !is_transactional)) why? > cache_data->set_incident(); > delete pending; > cache_data->set_pending(NULL); > @@ -7868,8 +7877,16 @@ int Event_log::write_cache(THD *thd, binlog_cache_data *cache_data) > DBUG_RETURN(res ? ER_ERROR_ON_WRITE : 0); > } > > - if (reinit_io_cache(cache, READ_CACHE, 0, 0, 0)) > + /* > + Allow flush of transaction logs to temporary go over the tmp space limit > + as we do not want the commit to fail > + */ > + cache->myflags&= ~MY_TRACK_WITH_LIMIT; > + res= reinit_io_cache(cache, READ_CACHE, 0, 0, 0); > + cache->myflags|= MY_TRACK_WITH_LIMIT; you also remove MY_TRACK_WITH_LIMIT inside reinit_io_cache(), so this is redundant. Or it's redundant inside reinit_io_cache(), either way, no point in doing it twice > + if (res) > DBUG_RETURN(ER_ERROR_ON_WRITE); > + > /* Amount of remaining bytes in the IO_CACHE read buffer. */ > size_t log_file_pos; > uchar header_buf[LOG_EVENT_HEADER_LEN]; > diff --git a/mysys/mf_iocache.c b/mysys/mf_iocache.c > index 4ee1331bdb3..6214057f9e3 100644 > --- a/mysys/mf_iocache.c > +++ b/mysys/mf_iocache.c > @@ -73,6 +74,50 @@ int (*_my_b_encr_read)(IO_CACHE *info,uchar *Buffer,size_t Count)= 0; > int (*_my_b_encr_write)(IO_CACHE *info,const uchar *Buffer,size_t Count)= 0; > > > +static inline my_bool tmp_file_track(IO_CACHE *info, ulonglong file_size) > +{ > + if ((info->myflags & (MY_TRACK | MY_TRACK_WITH_LIMIT)) && > + update_tmp_file_size) > + { > + if (info->tracking.file_size < file_size) > + { > + int error; > + info->tracking.file_size= file_size; > + if ((error= update_tmp_file_size(&info->tracking, > + !(info->myflags & > + MY_TRACK_WITH_LIMIT)))) > + { > + if (info->myflags & MY_WME) > + my_error(my_errno, MYF(0)); > + info->error= -1; > + return 1; > + } > + } > + } > + return 0; > +} > + > +my_bool io_cache_tmp_file_track(IO_CACHE *info, ulonglong file_size) > +{ > + return tmp_file_track(info, file_size); > +} > + > + > +void end_tracking_io_cache(IO_CACHE *info) what does "end tracking IO_CACHE" mean? I read it as "end tracking of this IO_CACHE, it is no longer tracked after this function call" - but it's not what it is doing. so what how do you read this function name? > +{ > + if ((info->myflags & (MY_TRACK | MY_TRACK_WITH_LIMIT)) && > + info->tracking.file_size) > + { > + info->tracking.file_size= 0; > + update_tmp_file_size(&info->tracking, 1); > + } > +} > + > +void truncate_io_cache(IO_CACHE *info) > +{ > + if (my_chsize(info->file, 0, 0, MYF(MY_WME)) <= 0) > + end_tracking_io_cache(info); not quite. If my_chsize() returns -1, you should not do info->tracking.file_size= 0, because file size wasn't actually changed. > +} > > static void > init_functions(IO_CACHE* info) Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org

2 1

Re: 0d6dc52f43d: MDEV-30046 Refactor write_record and fix idempotent replication
by Sergei Golubchik 04 May '24

04 May '24

Hi, Nikita, On Apr 29, Nikita Malyavin wrote: > revision-id: 0d6dc52f43d (mariadb-10.5.24-175-g0d6dc52f43d) > parent(s): 652e9ee405e > author: Nikita Malyavin > committer: Nikita Malyavin > timestamp: 2024-04-27 21:08:21 +0200 > message: > > MDEV-30046 Refactor write_record and fix idempotent replication > > Idempotent write_row works same as REPLACE: if there is a duplicating > record in the table, then it will be deleted and re-inserted, with the > same update optimization. > > The code in Rows:log_event::write_row was basically copy-pasted from > write_record. > > What's done: > REPLACE operation was unified across replication and sql. It is now > representred as a Write_record class, that holds the whole state, and allows > re-using some resources in between the row writes. > > Replace, IODKU and single insert implementations are split across different > methods, reluting in a much cleaner code. > > The entry point is preserved as a single Write_record::write_record() call. > The implementation to call is chosen on the constructor stage. > > This allowed several optimizations to be done: > 1. The table key list is not iterated for every row. We find last unique key in > the order of checking once and preserve it across the rows. See last_uniq_key(). > 2. ib_handler::referenced_by_foreign_key acquires a global lock. This call was > done per row as well. Not all the table config that allows optimized replace is > folded into a single boolean field can_optimize. All the fields to check are > even stored in a single register on a 64-bit platform. > 3. DUP_REPLACE and DUP_UPDATE cases now have one less level of indirection > 4. modified_non_trans_tables is checked and set only when it's really needed. > 5. Obsolete bitmap manipulations are removed. > > Also: > * Unify replace initialization step across implementations: > add prepare_for_replace and finalize_replace > * alloca is removed in favor of mem_root allocation. This memory is reused > across the rows. > * An rpl-related callback is added to the replace branch, meaning that an extra > check is made per row replace even for the common case. It can be avoided with > templates if considered a problem. > > diff --git a/mysql-test/main/long_unique_bugs_replication.test b/mysql-test/main/long_unique_bugs_replication.test > index 9c44d13e6a5..f2225cde101 100644 > --- a/mysql-test/main/long_unique_bugs_replication.test > +++ b/mysql-test/main/long_unique_bugs_replication.test > + > +--echo # > +--echo # End of 10.4 tests > +--echo # ... > --echo # > --echo # End of 10.4 tests > --echo # may be 10.5? > diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc > index 38e8b062f18..b5810eeaf7b 100644 > --- a/sql/log_event_server.cc > +++ b/sql/log_event_server.cc > @@ -5705,10 +5708,10 @@ int Rows_log_event::do_apply_event(rpl_group_info *rgi) > &m_cols_ai : &m_cols); > bitmap_intersect(table->write_set, after_image); > > - this->slave_exec_mode= slave_exec_mode_options; // fix the mode > - > + COPY_INFO copy_info; > + Write_record write_record; > // Do event specific preparations > - error= do_before_row_operations(rli); > + error= do_before_row_operations(rgi, &copy_info, &write_record); You create write_record on the stack and save it inside this->m_write_record 1. why not to have it permanently a part of Rows_log_event ? 2. if you'll keep it on the stack, please, reset m_write_record=0 before returning from ::do_apply_event(). And may be even do m_write_record= &write_record here, not in ::do_before_row_operations() it'll be a bit easier to see how a local variable is stored and then freed and that it doesn't actually leave the scope. > > /* > Bug#56662 Assertion failed: next_insert_id == 0, file handler.cc > @@ -7132,6 +7148,40 @@ Write_rows_log_event::do_before_row_operations(const Slave_reporting_capability > m_table->mark_auto_increment_column(true); > } > > + if (slave_exec_mode == SLAVE_EXEC_MODE_IDEMPOTENT && > + (m_table->file->ha_table_flags() & HA_DUPLICATE_POS || > + m_table->s->long_unique_table)) > + error= m_table->file->ha_rnd_init_with_error(0); I'd put this inside if (slave_exec_mode == SLAVE_EXEC_MODE_IDEMPOTENT) about ~40 lines above the place where you put it now. > + > + if (!error) > + { > + bzero(copy_info, sizeof *copy_info); > + copy_info->handle_duplicates= > + slave_exec_mode == SLAVE_EXEC_MODE_IDEMPOTENT ? > + DUP_REPLACE : DUP_ERROR; simply copy_info->handle_duplicates= thd->lex->duplicates; > + copy_info->table_list= m_table->pos_in_table_list; > + > + int (*callback)(void *, void*)= NULL; > + if (!get_flags(COMPLETE_ROWS_F)) > + { > + /* > + If row is incomplete we will use the record found to fill > + missing columns. > + */ > + callback= [](void *e, void* r)->int { > + auto rgi= static_cast<rpl_group_info*>(r); > + auto event= static_cast<Write_rows_log_event*>(e); > + return event->incomplete_record_callback(rgi); > + }; > + } > + new (write_record) Write_record(thd, m_table, copy_info, > + m_vers_from_plain && > + m_table->versioned(VERS_TIMESTAMP), > + m_table->triggers && do_invoke_trigger(), > + NULL, callback, this, rgi); please, use write_record->init(...) here. And in all similar places below, I won't comment on them separately. > + m_write_record= write_record; > + } > + > return error; > } > > @@ -7173,7 +7223,15 @@ Write_rows_log_event::do_after_row_operations(const Slave_reporting_capability * > { > m_table->file->print_error(local_error, MYF(0)); > } > - return error? error : local_error; > + int rnd_error= 0; > + if (m_table->file->inited) > + { > + DBUG_ASSERT(slave_exec_mode == SLAVE_EXEC_MODE_IDEMPOTENT); > + DBUG_ASSERT(m_table->file->ha_table_flags() & HA_DUPLICATE_POS || > + m_table->s->long_unique_table); DBUG_ASSERT(m_table->file->inited == handler::RND); > + rnd_error= m_table->file->ha_rnd_end(); > + } > + return error? error : local_error ? local_error : rnd_error; > } > > bool Rows_log_event::process_triggers(trg_event_type event, > diff --git a/sql/sql_insert.cc b/sql/sql_insert.cc > index ff71361493a..e410efda121 100644 > --- a/sql/sql_insert.cc > +++ b/sql/sql_insert.cc > @@ -677,6 +678,30 @@ Field **TABLE::field_to_fill() > return triggers && triggers->nullable_fields() ? triggers->nullable_fields() : field; > } > > +int prepare_for_replace(TABLE *table, enum_duplicates handle_duplicates, > + bool ignore) > +{ > + bool create_lookup_handler= handle_duplicates != DUP_ERROR; why do you initialize create_lookup_handler here and then immediately repeat it two lines below? :) less confusing would be bool create_lookup_handler= handle_duplicates != DUP_ERROR || ignore; if (create_lookup_handler) { > + if (ignore || handle_duplicates != DUP_ERROR) > + { > + create_lookup_handler= true; > + table->file->extra(HA_EXTRA_IGNORE_DUP_KEY); > + if (table->file->ha_table_flags() & HA_DUPLICATE_POS || > + table->s->long_unique_table) > + { > + if (table->file->ha_rnd_init_with_error(false)) > + return 1; > + } > + } > + return table->file->prepare_for_insert(create_lookup_handler); > +} > + > +int finalize_replace(TABLE *table, enum_duplicates handle_duplicates, > + bool ignore) > +{ DBUG_ASSERT(m_table->file->inited != handler::INDEX); > + return table->file->inited ? table->file->ha_rnd_end() : 0; > +} > + > > /** > INSERT statement implementation > @@ -1728,31 +1745,74 @@ int mysql_prepare_insert(THD *thd, TABLE_LIST *table_list, > } > > > /* Check if there is more uniq keys after field */ this comment isn't correct anymore, is it? on the other hand, the function is small and the name is clear, I think you can simply delete the comment, if you'd like. > +static ushort get_last_unique_key(TABLE *table, KEY *key_info) > +{ > + for (uint key= table->s->keys; key; --key) > + if (key_info[key-1].flags & HA_NOSAME) > + return key-1; > + return MAX_KEY; > +} > > +/** > + An update optimization can be done in REPLACE only when certain criterial met. > + In particular, if the "error key" was the last one to check. If not, there can > + be other keys that can fail the uniqueness check after we will delete the row. > + > + The problem arises with the order of the key checking. It is divided in two > + parts: > + 1. Higher-level handler::ha_write_row goes first. It checks for long unique > + and period overlaps (the latter is not supported by REPLACE yet). better "(when the latter will be supported by REPLACE)" this way the comment will be less wrong and misleading when we'll finally implement this support and but forget to update the comment :) > + 2. Engine usual unique index check. > + However, the order of the keys in TABLE_SHARE::key_info is: > + * first, engine unique keys > + * then long unique, which is not seen by engine > + WITHOUT OVERLAPS key has no particular order > + see sort_keys in sql_table.cc > + > + So we need to wrap this order. > + @return last unique key to check for duplication, or MAX_KEY if none. > + */ > -static int last_uniq_key(TABLE *table, const KEY *key, uint keynr) > +ushort Write_record::get_last_unique_key() const > { > + if (info->handle_duplicates != DUP_REPLACE) > + return MAX_KEY; // Not used. > + if (!table->s->key_info) > + return MAX_KEY; > /* > When an underlying storage engine informs that the unique key > conflicts are not reported in the ascending order by setting > the HA_DUPLICATE_KEY_NOT_IN_ORDER flag, we cannot rely on this > information to determine the last key conflict. > > The information about the last key conflict will be used to > do a replace of the new row on the conflicting row, rather > than doing a delete (of old row) + insert (of new row). > > Hence check for this flag and disable replacing the last row > - by returning 0 always. Returning 0 will result in doing > + by returning MAX_KEY always. Returning MAX_KEY will result in doing > a delete + insert always. > */ > if (table->file->ha_table_flags() & HA_DUPLICATE_KEY_NOT_IN_ORDER) > - return 0; > + return MAX_KEY; > > - while (++keynr < table->s->keys) > - if (key[keynr].flags & HA_NOSAME) > - return 0; > - return 1; > + /* > + Note, that TABLE_SHARE and TABLE see long uniques differently: > + - TABLE_SHARE sees as HA_KEY_ALG_LONG_HASH and HA_NOSAME > + - TABLE sees as usual non-unique indexes > + */ > + return > + table->key_info[0].flags & HA_NOSAME ? > + /* > + We have an in-engine unique, which should be checked last. > + Go through the TABLE list, which knows nothing about long uniques. > + */ > + ::get_last_unique_key(table, table->key_info) : > + /* > + We can only have long uniques. > + Go through the TABLE_SHARE list, where long uniques can be found. > + */ > + ::get_last_unique_key(table, table->s->key_info); > } > > > @@ -1786,449 +1846,511 @@ int vers_insert_history_row(TABLE *table) > } > > /* > - Write a record to table with optional deleting of conflicting records, > - invoke proper triggers if needed. > - > - SYNOPSIS > - write_record() > - thd - thread context > - table - table to which record should be written > - info - COPY_INFO structure describing handling of duplicates > - and which is used for counting number of records inserted > - and deleted. > - sink - result sink for the RETURNING clause > - > - NOTE > - Once this record will be written to table after insert trigger will > - be invoked. If instead of inserting new record we will update old one > - then both on update triggers will work instead. Similarly both on > - delete triggers will be invoked if we will delete conflicting records. > - > - Sets thd->transaction.stmt.modified_non_trans_table to TRUE if table which > - is updated didn't have transactions. > - > - RETURN VALUE > - 0 - success > - non-0 - error > - > -int write_record(THD *thd, TABLE *table, COPY_INFO *info, select_result *sink) > -{ > - int error, trg_error= 0; > - char *key=0; > - MY_BITMAP *save_read_set, *save_write_set; > - ulonglong prev_insert_id= table->file->next_insert_id; > - ulonglong insert_id_for_cur_row= 0; > - ulonglong prev_insert_id_for_cur_row= 0; > - DBUG_ENTER("write_record"); > + Retrieve the old (duplicated) row into record[1] to be able to > + either update or delete the offending record. We either: > + > + - use ha_rnd_pos() with a row-id (available as dup_ref) to the > + offending row, if that is possible (MyISAM and Blackhole, also long unique > + and application-time periods), or else > + > + - use ha_index_read_idx_map() with the key that is duplicated, to > + retrieve the offending row. > */ > +int Write_record::locate_dup_record() > +{ > + handler *h= table->file; > + int error= 0; > + if (h->ha_table_flags() & HA_DUPLICATE_POS || h->lookup_errkey != (uint)-1) > + { > + DBUG_PRINT("info", ("Locating offending record using rnd_pos()")); > + > + error= h->ha_rnd_pos(table->record[1], h->dup_ref); > + if (unlikely(error)) > + { > + DBUG_PRINT("info", ("rnd_pos() returns error %d",error)); > + h->print_error(error, MYF(0)); > + } > + } > + else > + { > + DBUG_PRINT("info", > + ("Locating offending record using ha_index_read_idx_map")); > > + if (h->lookup_handler) > + h= h->lookup_handler; Why? > + > + error= h->extra(HA_EXTRA_FLUSH_CACHE); > + if (unlikely(error)) > + { > + DBUG_PRINT("info",("Error when setting HA_EXTRA_FLUSH_CACHE")); > + return my_errno; > + } > > - info->records++; > - save_read_set= table->read_set; > - save_write_set= table->write_set; > + if (unlikely(key == NULL)) > + { > + key= static_cast<uchar*>(thd->alloc(table->s->max_unique_length)); > + if (key == NULL) > + { > + DBUG_PRINT("info",("Can't allocate key buffer")); > + return ENOMEM; > + } > + } > > - DBUG_EXECUTE_IF("rpl_write_record_small_sleep_gtid_100_200", > + key_copy(key, table->record[0], table->key_info + key_nr, 0); > + > + error= h->ha_index_read_idx_map(table->record[1], key_nr, key, > + HA_WHOLE_KEY, HA_READ_KEY_EXACT); why did you revert commit 5e345281e3599c79 here? > + if (unlikely(error)) > { > - if (thd->rgi_slave && (thd->rgi_slave->current_gtid.seq_no == 100 || > - thd->rgi_slave->current_gtid.seq_no == 200)) > - my_sleep(20000); > - }); > - if (info->handle_duplicates == DUP_REPLACE || > - info->handle_duplicates == DUP_UPDATE) > + DBUG_PRINT("info", ("index_read_idx() returns %d", error)); > + h->print_error(error, MYF(0)); > + } > + } > + > + if (unlikely(error)) > + return error; > + > + if (table->vfield) > + { > + /* > + We have not yet called update_virtual_fields() > + in handler methods for the just read row in record[1]. > + */ > + table->move_fields(table->field, table->record[1], table->record[0]); > + error= table->update_virtual_fields(table->file, VCOL_UPDATE_FOR_REPLACE); > + table->move_fields(table->field, table->record[0], table->record[1]); > + } > + return error; > +} > + > +/** > + REPLACE is defined as either INSERT or DELETE + INSERT. If > + possible, we can replace it with an UPDATE, but that will not > + work on InnoDB if FOREIGN KEY checks are necessary. > + > + Suppose that we got the duplicate key to be a key that is not > + the last unique key for the table and we perform an update: > + then there might be another key for which the unique check will > + fail, so we're better off just deleting the row and trying to insert again. > + > + Additionally we don't use ha_update_row in following cases: > + * when triggers should be invoked, as it may spoil the intermedite NEW_ROW > + representation > + * when we have referenced keys, as there can be different ON DELETE/ON UPDATE > + actions > +*/ > +int Write_record::replace_row(ha_rows *inserted, ha_rows *deleted) > +{ > + int error; > + while (unlikely(error=table->file->ha_write_row(table->record[0]))) > { > - while (unlikely(error=table->file->ha_write_row(table->record[0]))) > + error= prepare_handle_duplicate(error); > + if (unlikely(error)) > + return restore_on_error(); > + > + if (incomplete_records_cb && (error= incomplete_records_cb(arg1, arg2))) This callback is strange. I'd rather inherit Write_record_rpl from Write_record and put rpl-specific logic there. Also you could avoid a switch on INSERT/REPLACE/ODKU if you'd use inheritance instead. But the switch is less strange than the callback :) > + return restore_on_error(); > + > + if (can_optimize && key_nr == last_unique_key) > { > - uint key_nr; > - /* > - If we do more than one iteration of this loop, from the second one the > - row will have an explicit value in the autoinc field, which was set at > - the first call of handler::update_auto_increment(). So we must save > - the autogenerated value to avoid thd->insert_id_for_cur_row to become > - 0. > - */ > - if (table->file->insert_id_for_cur_row > 0) > - insert_id_for_cur_row= table->file->insert_id_for_cur_row; > - else > - table->file->insert_id_for_cur_row= insert_id_for_cur_row; > - bool is_duplicate_key_error; > - if (table->file->is_fatal_error(error, HA_CHECK_ALL)) > - goto err; > - is_duplicate_key_error= > - table->file->is_fatal_error(error, HA_CHECK_ALL & ~HA_CHECK_DUP); > - if (!is_duplicate_key_error) > - { > - /* > - We come here when we had an ignorable error which is not a duplicate > - key error. In this we ignore error if ignore flag is set, otherwise > - report error as usual. We will not do any duplicate key processing. > - */ > - if (info->ignore) > - { > - table->file->print_error(error, MYF(ME_WARNING)); > - goto after_trg_or_ignored_err; /* Ignoring a not fatal error */ > - } > - goto err; > - } > - if (unlikely((int) (key_nr = table->file->get_dup_key(error)) < 0)) > - { > - error= HA_ERR_FOUND_DUPP_KEY; /* Database can't find key */ > - goto err; > - } > - DEBUG_SYNC(thd, "write_row_replace"); > + DBUG_PRINT("info", ("Updating row using ha_update_row()")); > + if (table->versioned(VERS_TRX_ID)) > + table->vers_start_field()->store(0, false); > > - /* Read all columns for the row we are going to replace */ > - table->use_all_columns(); > - /* > - Don't allow REPLACE to replace a row when a auto_increment column > - was used. This ensures that we don't get a problem when the > - whole range of the key has been used. > - */ > - if (info->handle_duplicates == DUP_REPLACE && table->next_number_field && > - key_nr == table->s->next_number_index && insert_id_for_cur_row > 0) > - goto err; > - if (table->file->has_dup_ref()) > + error= table->file->ha_update_row(table->record[1], table->record[0]); > + > + if (likely(!error)) > + ++*deleted; > + else if (error != HA_ERR_RECORD_IS_THE_SAME) > + return on_ha_error(error); > + > + if (versioned) > { > - DBUG_ASSERT(table->file->inited == handler::RND); > - if (table->file->ha_rnd_pos(table->record[1],table->file->dup_ref)) > - goto err; > + store_record(table, record[2]); > + error= vers_insert_history_row(table); > + restore_record(table, record[2]); > + if (unlikely(error)) > + return on_ha_error(error); > } > + > + break; > + } > + else > + { > + DBUG_PRINT("info", ("Deleting offending row and trying to write" > + " new one again")); > + > + auto *trg = table->triggers; > + if (use_triggers && trg->process_triggers(table->in_use, TRG_EVENT_DELETE, > + TRG_ACTION_BEFORE, TRUE)) > + return restore_on_error(); > + if (!versioned) > + error = table->file->ha_delete_row(table->record[1]); > else > { > - if (table->file->extra(HA_EXTRA_FLUSH_CACHE)) /* Not needed with NISAM */ > - { > - error=my_errno; > - goto err; > - } > - > - if (!key) > - { > - if (!(key=(char*) my_safe_alloca(table->s->max_unique_length))) > - { > - error=ENOMEM; > - goto err; > - } > - } > - key_copy((uchar*) key,table->record[0],table->key_info+key_nr,0); > - key_part_map keypart_map= (1 << table->key_info[key_nr].user_defined_key_parts) - 1; > - if ((error= (table->file->ha_index_read_idx_map(table->record[1], > - key_nr, (uchar*) key, > - keypart_map, > - HA_READ_KEY_EXACT)))) > - goto err; > + store_record(table, record[2]); > + restore_record(table, record[1]); > + table->vers_update_end(); > + error = table->file->ha_update_row(table->record[1], table->record[0]); > + restore_record(table, record[2]); > } > - if (table->vfield) > + > + if (unlikely(error)) > + return on_ha_error(error); > + ++*deleted; > + > + if (use_triggers) > { > - /* > - We have not yet called update_virtual_fields(VOL_UPDATE_FOR_READ) > - in handler methods for the just read row in record[1]. > - */ > - table->move_fields(table->field, table->record[1], table->record[0]); > - int verr = table->update_virtual_fields(table->file, VCOL_UPDATE_FOR_REPLACE); > - table->move_fields(table->field, table->record[0], table->record[1]); > - if (verr) > - goto err; > + error= trg->process_triggers(table->in_use, TRG_EVENT_DELETE, > + TRG_ACTION_AFTER, TRUE); > + if (unlikely(error)) > + return restore_on_error(); > } > - if (info->handle_duplicates == DUP_UPDATE) > - { > - int res= 0; > - /* > - We don't check for other UNIQUE keys - the first row > - that matches, is updated. If update causes a conflict again, > - an error is returned > - */ > - DBUG_ASSERT(table->insert_values != NULL); > - store_record(table,insert_values); > - restore_record(table,record[1]); > - table->reset_default_fields(); > + } > + } > > - /* > - in INSERT ... ON DUPLICATE KEY UPDATE the set of modified fields can > - change per row. Thus, we have to do reset_default_fields() per row. > - Twice (before insert and before update). > - */ > - DBUG_ASSERT(info->update_fields->elements == > - info->update_values->elements); > - if (fill_record_n_invoke_before_triggers(thd, table, > - *info->update_fields, > - *info->update_values, > - info->ignore, > - TRG_EVENT_UPDATE)) > - goto before_trg_err; > - > - bool different_records= (!records_are_comparable(table) || > - compare_record(table)); > - /* > - Default fields must be updated before checking view updateability. > - This branch of INSERT is executed only when a UNIQUE key was violated > - with the ON DUPLICATE KEY UPDATE option. In this case the INSERT > - operation is transformed to an UPDATE, and the default fields must > - be updated as if this is an UPDATE. > - */ > - if (different_records && table->default_field) > - table->evaluate_update_default_function(); > - > - /* CHECK OPTION for VIEW ... ON DUPLICATE KEY UPDATE ... */ > - res= info->table_list->view_check_option(table->in_use, info->ignore); > - if (res == VIEW_CHECK_SKIP) > - goto after_trg_or_ignored_err; > - if (res == VIEW_CHECK_ERROR) > - goto before_trg_err; > - > - table->file->restore_auto_increment(prev_insert_id); > - info->touched++; > - if (different_records) > - { > - if (unlikely(error=table->file->ha_update_row(table->record[1], > - table->record[0])) && > - error != HA_ERR_RECORD_IS_THE_SAME) > - { > - if (info->ignore && > - !table->file->is_fatal_error(error, HA_CHECK_ALL)) > - { > - if (!(thd->variables.old_behavior & > - OLD_MODE_NO_DUP_KEY_WARNINGS_WITH_IGNORE)) > - table->file->print_error(error, MYF(ME_WARNING)); > - goto after_trg_or_ignored_err; > - } > - goto err; > - } > + /* > + If more than one iteration of the above loop is done, from > + the second one the row being inserted will have an explicit > + value in the autoinc field, which was set at the first call of > + handler::update_auto_increment(). This value is saved to avoid > + thd->insert_id_for_cur_row becoming 0. Use this saved autoinc > + value. > + */ > + if (table->file->insert_id_for_cur_row == 0) > + table->file->insert_id_for_cur_row = insert_id_for_cur_row; > > - if (error != HA_ERR_RECORD_IS_THE_SAME) > - { > - info->updated++; > - if (table->versioned() && > - table->vers_check_update(*info->update_fields)) > - { > - if (table->versioned(VERS_TIMESTAMP)) > - { > - store_record(table, record[2]); > - if ((error= vers_insert_history_row(table))) > - { > - info->last_errno= error; > - table->file->print_error(error, MYF(0)); > - trg_error= 1; > - restore_record(table, record[2]); > - goto after_trg_or_ignored_err; > - } > - restore_record(table, record[2]); > - } > - info->copied++; > - } > - } > - else > - error= 0; > - /* > - If ON DUP KEY UPDATE updates a row instead of inserting > - one, it's like a regular UPDATE statement: it should not > - affect the value of a next SELECT LAST_INSERT_ID() or > - mysql_insert_id(). Except if LAST_INSERT_ID(#) was in the > - INSERT query, which is handled separately by > - THD::arg_of_last_insert_id_function. > - */ > - prev_insert_id_for_cur_row= table->file->insert_id_for_cur_row; > - insert_id_for_cur_row= table->file->insert_id_for_cur_row= 0; > - trg_error= (table->triggers && > - table->triggers->process_triggers(thd, TRG_EVENT_UPDATE, > - TRG_ACTION_AFTER, TRUE)); > - info->copied++; > - } > + return after_insert(inserted); > +} > > - /* > - Only update next_insert_id if the AUTO_INCREMENT value was explicitly > - updated, so we don't update next_insert_id with the value from the > - row being updated. Otherwise reset next_insert_id to what it was > - before the duplicate key error, since that value is unused. > - */ > - if (table->next_number_field_updated) > - { > - DBUG_ASSERT(table->next_number_field != NULL); > > - table->file->adjust_next_insert_id_after_explicit_value(table->next_number_field->val_int()); > - } > - else if (prev_insert_id_for_cur_row) > - { > - table->file->restore_auto_increment(prev_insert_id_for_cur_row); > - } > - goto ok; > +int Write_record::insert_on_duplicate_update(ha_rows *inserted, > + ha_rows *updated) > +{ > + int res= table->file->ha_write_row(table->record[0]); > + if (likely(!res)) > + return after_insert(inserted); > + > + res=prepare_handle_duplicate(res); > + if (unlikely(res)) > + return restore_on_error(); > + > + ulonglong prev_insert_id_for_cur_row= 0; > + /* > + We don't check for other UNIQUE keys - the first row > + that matches, is updated. If update causes a conflict again, > + an error is returned > + */ > + DBUG_ASSERT(table->insert_values != NULL); > + store_record(table,insert_values); > + restore_record(table,record[1]); > + table->reset_default_fields(); > + > + /* > + in INSERT ... ON DUPLICATE KEY UPDATE the set of modified fields can > + change per row. Thus, we have to do reset_default_fields() per row. > + Twice (before insert and before update). > + */ > + DBUG_ASSERT(info->update_fields->elements == > + info->update_values->elements); > + if (fill_record_n_invoke_before_triggers(thd, table, > + *info->update_fields, > + *info->update_values, > + info->ignore, > + TRG_EVENT_UPDATE)) > + return restore_on_error(); > + > + bool different_records= (!records_are_comparable(table) || > + compare_record(table)); > + /* > + Default fields must be updated before checking view updateability. > + This branch of INSERT is executed only when a UNIQUE key was violated > + with the ON DUPLICATE KEY UPDATE option. In this case the INSERT > + operation is transformed to an UPDATE, and the default fields must > + be updated as if this is an UPDATE. > + */ > + if (different_records && table->default_field) > + table->evaluate_update_default_function(); > + > + /* CHECK OPTION for VIEW ... ON DUPLICATE KEY UPDATE ... */ > + res= info->table_list->view_check_option(table->in_use, info->ignore); > + if (unlikely(res)) > + { > + if (res == VIEW_CHECK_ERROR) > + return restore_on_error(); > + DBUG_ASSERT(res == VIEW_CHECK_SKIP); > + return 0; > + } > + > + table->file->restore_auto_increment(prev_insert_id); > + info->touched++; > + if (different_records) > + { > + int error= table->file->ha_update_row(table->record[1], table->record[0]); > + > + if (unlikely(error) && error != HA_ERR_RECORD_IS_THE_SAME) > + { > + if (info->ignore && !table->file->is_fatal_error(error, HA_CHECK_ALL)) why not is_fatal_error(error) ? > + { > + if (!(thd->variables.old_behavior & > + OLD_MODE_NO_DUP_KEY_WARNINGS_WITH_IGNORE)) > + table->file->print_error(error, MYF(ME_WARNING)); > + return 0; > } > - else /* DUP_REPLACE */ > + return on_ha_error(error); > + } > + > + if (error != HA_ERR_RECORD_IS_THE_SAME) > + { > + ++*updated; > + if (table->versioned() && > + table->vers_check_update(*info->update_fields)) > { > - /* > - The manual defines the REPLACE semantics that it is either > - an INSERT or DELETE(s) + INSERT; FOREIGN KEY checks in > - InnoDB do not function in the defined way if we allow MySQL > - to convert the latter operation internally to an UPDATE. > - We also should not perform this conversion if we have > - timestamp field with ON UPDATE which is different from DEFAULT. > - Another case when conversion should not be performed is when > - we have ON DELETE trigger on table so user may notice that > - we cheat here. Note that it is ok to do such conversion for > - tables which have ON UPDATE but have no ON DELETE triggers, > - we just should not expose this fact to users by invoking > - ON UPDATE triggers. > - > - Note, TABLE_SHARE and TABLE see long uniques differently: > - - TABLE_SHARE sees as HA_KEY_ALG_LONG_HASH and HA_NOSAME > - - TABLE sees as usual non-unique indexes > - */ > - bool is_long_unique= table->s->key_info && > - table->s->key_info[key_nr].algorithm == > - HA_KEY_ALG_LONG_HASH; > - if ((is_long_unique ? > - /* > - We have a long unique. Test that there are no in-engine > - uniques and the current long unique is the last long unique. > - */ > - !(table->key_info[0].flags & HA_NOSAME) && > - last_uniq_key(table, table->s->key_info, key_nr) : > - /* > - We have a normal key - not a long unique. > - Test is the current normal key is unique and > - it is the last normal unique. > - */ > - last_uniq_key(table, table->key_info, key_nr)) && > - !table->file->referenced_by_foreign_key() && > - (!table->triggers || !table->triggers->has_delete_triggers())) > - { > - if (table->versioned(VERS_TRX_ID)) > - { > - bitmap_set_bit(table->write_set, table->vers_start_field()->field_index); > - table->file->column_bitmaps_signal(); > - table->vers_start_field()->store(0, false); > - } > - if (unlikely(error= table->file->ha_update_row(table->record[1], > - table->record[0])) && > - error != HA_ERR_RECORD_IS_THE_SAME) > - goto err; > - if (likely(!error)) > - { > - info->deleted++; > - if (!table->file->has_transactions()) > - thd->transaction->stmt.modified_non_trans_table= TRUE; > - if (table->versioned(VERS_TIMESTAMP)) > - { > - store_record(table, record[2]); > - error= vers_insert_history_row(table); > - restore_record(table, record[2]); > - if (unlikely(error)) > - goto err; > - } > - } > - else > - error= 0; // error was HA_ERR_RECORD_IS_THE_SAME > - /* > - Since we pretend that we have done insert we should call > - its after triggers. > - */ > - goto after_trg_n_copied_inc; > - } > - else > + if (versioned) > { > - if (table->triggers && > - table->triggers->process_triggers(thd, TRG_EVENT_DELETE, > - TRG_ACTION_BEFORE, TRUE)) > - goto before_trg_err; > - > - if (!table->versioned(VERS_TIMESTAMP)) > - error= table->file->ha_delete_row(table->record[1]); > - else > - { > - store_record(table, record[2]); > - restore_record(table, record[1]); > - table->vers_update_end(); > - error= table->file->ha_update_row(table->record[1], > - table->record[0]); > - restore_record(table, record[2]); > - } > + store_record(table, record[2]); > + error= vers_insert_history_row(table); > + restore_record(table, record[2]); > if (unlikely(error)) > - goto err; > - if (!table->versioned(VERS_TIMESTAMP)) > - info->deleted++; > - else > - info->updated++; > - if (!table->file->has_transactions_and_rollback()) > - thd->transaction->stmt.modified_non_trans_table= TRUE; > - if (table->triggers && > - table->triggers->process_triggers(thd, TRG_EVENT_DELETE, > - TRG_ACTION_AFTER, TRUE)) > - { > - trg_error= 1; > - goto after_trg_or_ignored_err; > - } > - /* Let us attempt do write_row() once more */ > + return on_ha_error(error); > } > + ++*inserted; > } > } > - > - /* > - If more than one iteration of the above while loop is done, from > - the second one the row being inserted will have an explicit > - value in the autoinc field, which was set at the first call of > - handler::update_auto_increment(). This value is saved to avoid > - thd->insert_id_for_cur_row becoming 0. Use this saved autoinc > - value. > - */ > - if (table->file->insert_id_for_cur_row == 0) > - table->file->insert_id_for_cur_row= insert_id_for_cur_row; > - > /* > - Restore column maps if they where replaced during an duplicate key > - problem. > + If ON DUP KEY UPDATE updates a row instead of inserting > + one, it's like a regular UPDATE statement: it should not > + affect the value of a next SELECT LAST_INSERT_ID() or > + mysql_insert_id(). Except if LAST_INSERT_ID(#) was in the > + INSERT query, which is handled separately by > + THD::arg_of_last_insert_id_function. > */ > - if (table->read_set != save_read_set || > - table->write_set != save_write_set) > - table->column_bitmaps_set(save_read_set, save_write_set); > + prev_insert_id_for_cur_row= table->file->insert_id_for_cur_row; > + insert_id_for_cur_row= table->file->insert_id_for_cur_row= 0; > + > + ++*inserted; // Conforms the older behavior; > + > + if (use_triggers > + && table->triggers->process_triggers(thd, TRG_EVENT_UPDATE, > + TRG_ACTION_AFTER, TRUE)) > + return restore_on_error(); > } > - else if (unlikely((error=table->file->ha_write_row(table->record[0])))) > + > + /* > + Only update next_insert_id if the AUTO_INCREMENT value was explicitly > + updated, so we don't update next_insert_id with the value from the > + row being updated. Otherwise reset next_insert_id to what it was > + before the duplicate key error, since that value is unused. > + */ > + if (table->next_number_field_updated) > + { > + DBUG_ASSERT(table->next_number_field != NULL); > + > + table->file->adjust_next_insert_id_after_explicit_value( > + table->next_number_field->val_int()); > + } > + else if (prev_insert_id_for_cur_row) > + { > + table->file->restore_auto_increment(prev_insert_id_for_cur_row); > + } > + > + return send_data(); > +} > + > +bool Write_record::is_fatal_error(int error) > +{ > + return error == HA_ERR_LOCK_DEADLOCK || > + error == HA_ERR_LOCK_WAIT_TIMEOUT || These errors were only in rpl code, not anywhere in sql_insert.cc. I don't think it's wrong to check for them everywhere consistently, but does it change the behavior? Can you create a test case for INSERT before your changes that comes to thd->is_fatal_error() with HA_ERR_LOCK_DEADLOCK or HA_ERR_LOCK_WAIT_TIMEOUT? What will be the behavior before and after your change? > + table->file->is_fatal_error(error, HA_CHECK_ALL); > +} > + > +int Write_record::single_insert(ha_rows *inserted) > +{ > + DBUG_EXECUTE_IF("rpl_write_record_small_sleep_gtid_100_200", > + { > + uint64 seq= thd->rgi_slave ? thd->rgi_slave->current_gtid.seq_no : 0; > + if (seq == 100 || seq == 200) > + my_sleep(20000); > + }); > + > + int error= table->file->ha_write_row(table->record[0]); > + if (unlikely(error)) > { > DEBUG_SYNC(thd, "write_row_noreplace"); > - if (!info->ignore || > - table->file->is_fatal_error(error, HA_CHECK_ALL)) > - goto err; > + if (!info->ignore || is_fatal_error(error)) > + return on_ha_error(error); > if (!(thd->variables.old_behavior & > OLD_MODE_NO_DUP_KEY_WARNINGS_WITH_IGNORE)) > table->file->print_error(error, MYF(ME_WARNING)); > table->file->restore_auto_increment(prev_insert_id); > - goto after_trg_or_ignored_err; > + return 0; > } > + return after_insert(inserted); > +} > > -after_trg_n_copied_inc: > - info->copied++; > - thd->record_first_successful_insert_id_in_cur_stmt(table->file->insert_id_for_cur_row); > - trg_error= (table->triggers && > - table->triggers->process_triggers(thd, TRG_EVENT_INSERT, > - TRG_ACTION_AFTER, TRUE)); > +/** > + Write a record to table with optional deleting of conflicting records, > + invoke proper triggers if needed. > + > + @note > + Once this record will be written to table after insert trigger will > + be invoked. If instead of inserting new record we will update old one > + then both on update triggers will work instead. Similarly both on > + delete triggers will be invoked if we will delete conflicting records. > > -ok: > + Sets thd->transaction.stmt.modified_non_trans_table to TRUE if table which > + is updated didn't have transactions. > + > + @return > + 0 - success > + non-0 - error > +*/ > +int Write_record::write_record() > +{ > + ha_rows inserted= 0, updated= 0; > + prev_insert_id= table->file->next_insert_id; > + info->records++; > + ignored_error= false; > + int res= 0; > + switch(info->handle_duplicates) > + { > + case DUP_ERROR: > + res= single_insert(&inserted); > + break; > + case DUP_UPDATE: > + res= insert_on_duplicate_update(&inserted, &updated); > + info->updated+= updated; > + break; > + case DUP_REPLACE: > + res= replace_row(&inserted, &updated); > + if (versioned) > + info->updated+= updated; > + else > + info->deleted+= updated; > + } > + > + info->copied+= inserted; > + if (inserted || updated) > + notify_non_trans_table_modified(); > + return res; > +} > + > + > +int Write_record::prepare_handle_duplicate(int error) > +{ > /* > - We send the row after writing it to the table so that the > - correct values are sent to the client. Otherwise it won't show > - autoinc values (generated inside the handler::ha_write()) and > - values updated in ON DUPLICATE KEY UPDATE. > + If we do more than one iteration of the replace loop, from the second one the > + row will have an explicit value in the autoinc field, which was set at > + the first call of handler::update_auto_increment(). So we must save > + the autogenerated value to avoid thd->insert_id_for_cur_row to become > + 0. > */ > - if (sink && sink->send_data(thd->lex->returning()->item_list) < 0) > - trg_error= 1; > + if (table->file->insert_id_for_cur_row > 0) > + insert_id_for_cur_row= table->file->insert_id_for_cur_row; > + else > + table->file->insert_id_for_cur_row= insert_id_for_cur_row; > + > + if (is_fatal_error(error)) > + return on_ha_error(error); > > -after_trg_or_ignored_err: > - if (key) > - my_safe_afree(key,table->s->max_unique_length); > + bool is_duplicate_key_error= > + table->file->is_fatal_error(error, HA_CHECK_ALL & ~HA_CHECK_DUP); > + if (!is_duplicate_key_error) > + { > + /* > + We come here when we had an ignorable error which is not a duplicate > + key error. In this we ignore error if ignore flag is set, otherwise > + report error as usual. We will not do any duplicate key processing. > + */ > + if (info->ignore) > + { > + table->file->print_error(error, MYF(ME_WARNING)); > + ignored_error= true; > + return 1; /* Ignoring a not fatal error */ > + } > + return on_ha_error(error); > + } > + > + key_nr = table->file->get_dup_key(error); > + > + if (unlikely(key_nr == std::numeric_limits<decltype(key_nr)>::max())) Well, the handler returns (uint)-1, please compare with it, instead of inventing numerous other ways of archieving the same. And why did you decide to use ushort for the key number? We pretty much never use ushort, key number is uint everywhere. I agree that ushort is enough size-wise, but let's rather be consistent. > + return on_ha_error(HA_ERR_FOUND_DUPP_KEY); /* Database can't find key */ > + > + DEBUG_SYNC(thd, "write_row_replace"); > + > + /* Read all columns for the row we are going to replace */ > + table->use_all_columns(); > + /* > + Don't allow REPLACE to replace a row when an auto_increment column > + was used. This ensures that we don't get a problem when the > + whole range of the key has been used. > + */ > + if (info->handle_duplicates == DUP_REPLACE && table->next_number_field && > + key_nr == table->s->next_number_index && insert_id_for_cur_row > 0) > + return on_ha_error(error); > + > + error= locate_dup_record(); > + if (error) > + return on_ha_error(error); > + > + return 0; > +} > + > +inline > +void Write_record::notify_non_trans_table_modified() > +{ > if (!table->file->has_transactions_and_rollback()) > thd->transaction->stmt.modified_non_trans_table= TRUE; ok, as you like. I'm personally not a great fan of one-liners that are used only once. > - DBUG_RETURN(trg_error); > +} > > -err: > +inline > +int Write_record::on_ha_error(int error) > +{ > info->last_errno= error; > table->file->print_error(error,MYF(0)); > - > -before_trg_err: > + DBUG_PRINT("info", ("Returning error %d", error)); > + return restore_on_error(); > +} > + > +inline > +int Write_record::restore_on_error() > +{ > table->file->restore_auto_increment(prev_insert_id); > - if (key) > - my_safe_afree(key, table->s->max_unique_length); > - table->column_bitmaps_set(save_read_set, save_write_set); > - DBUG_RETURN(1); > + return ignored_error ? 0 : 1; > +} > + > +inline > +int Write_record::after_insert(ha_rows *inserted) > +{ > + ++*inserted; > + thd->record_first_successful_insert_id_in_cur_stmt(table->file->insert_id_for_cur_row); > + return after_ins_trg() || send_data(); > +} > + > +inline > +int Write_record::after_ins_trg() > +{ > + int res= 0; > + if (use_triggers) > + res= table->triggers->process_triggers(thd, TRG_EVENT_INSERT, > + TRG_ACTION_AFTER, TRUE); > + return res; > +} > + > +inline > +int Write_record::send_data() > +{ > + /* > + We send the row after writing it to the table so that the > + correct values are sent to the client. Otherwise it won't show > + autoinc values (generated inside the handler::ha_write()) and > + values updated in ON DUPLICATE KEY UPDATE. > + */ > + return sink ? sink->send_data(thd->lex->returning()->item_list) < 0 : 0; > } > > > + > /****************************************************************************** > Check that there aren't any null_fields > ******************************************************************************/ > @@ -4816,18 +4934,8 @@ select_create::prepare(List<Item> &_values, SELECT_LEX_UNIT *u) > > restore_record(table,s->default_values); // Get empty record > thd->cuted_fields=0; > - bool create_lookup_handler= info.handle_duplicates != DUP_ERROR; > - if (info.ignore || info.handle_duplicates != DUP_ERROR) > - { > - create_lookup_handler= true; > - table->file->extra(HA_EXTRA_IGNORE_DUP_KEY); > - if (table->file->ha_table_flags() & HA_DUPLICATE_POS) > - { > - if (table->file->ha_rnd_init_with_error(0)) > - DBUG_RETURN(1); > - } > - } > - table->file->prepare_for_insert(create_lookup_handler); > + if (prepare_for_replace(table, info.handle_duplicates, info.ignore)) ok. strange that select_create::prepare doesn't call select_insert::prepare > + DBUG_RETURN(1); > if (info.handle_duplicates == DUP_REPLACE && > (!table->triggers || !table->triggers->has_delete_triggers())) > table->file->extra(HA_EXTRA_WRITE_CAN_REPLACE); > diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc > index 5db53b6f2b8..7ce4850f534 100644 > --- a/sql/sql_parse.cc > +++ b/sql/sql_parse.cc > @@ -4814,7 +4815,8 @@ mysql_execute_command(THD *thd) > &lex->value_list, > lex->duplicates, > lex->ignore, > - result))) > + result, > + &write))) Why do you need Write_record here on the stack? Wouldn't it be more logical to have it in select_insert ? > { > if (lex->analyze_stmt) > ((select_result_interceptor*)sel_result)->disable_my_ok_calls(); Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org

2 1

Re: 652e9ee405e: MDEV-30046 wrong row targeted with "insert ... on duplicate" and "replace"
by Sergei Golubchik 29 Apr '24

29 Apr '24

Hi, Nikita, Good fix. Ok to push, thanks On Apr 29, Nikita Malyavin wrote: > revision-id: 652e9ee405e (mariadb-10.5.24-174-g652e9ee405e) > parent(s): 3d417476256 > author: Nikita Malyavin > committer: Nikita Malyavin > timestamp: 2024-04-27 21:08:21 +0200 > message: > > MDEV-30046 wrong row targeted with "insert ... on duplicate" and "replace" > > When HA_DUPLICATE_POS is not supported, the row to replace was navigated by > ha_index_read_idx_map, which uses only hash to navigate. > > Suchwise, given a hash collision it may choose an incorrect row. > > handler::position would be correct and very convenient to use here. > The code is updated to set dup_ref by a handler independently of engine > capabilities, when extra lookup is made (for long unique or something else, > for example WITHOUT OVERLAPS) > Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org

1 1

SOUNDEX sources
by lucyfrost＠tutanota.com 25 Apr '24

25 Apr '24

I'm developing a project which uses soundex matching (alongside other options more sophisticated than that). This is custom software with a mariad back end. MariaDB soundex built-in produces codes of apparently arbitary length; the usual is 4 characters. I've come to the conclusion that the best way forward is to mimic Maria's soundex in my code. I cannot find a description of the algorithm Maria uses though: the only source code I can find is in mf_soundex.c, and that is explicitly limited to 4 characters. But I have installed triggers which generate codes such as A4152636125262 Where is the soundex function found? Anyone know?

3 4