Hi, Aleksey! On Jun 29, Aleksey Midenkov wrote:
revision-id: 9bf4b92cbc5 (mariadb-10.5.2-168-g9bf4b92cbc5) parent(s): 478301d9b9a author: Aleksey Midenkov <midenok@gmail.com> committer: Aleksey Midenkov <midenok@gmail.com> timestamp: 2020-04-17 17:04:24 +0300 message:
MDEV-17554 Auto-create new partition for system versioned tables with history partitioned by INTERVAL/LIMIT
== Syntax change ==
Keyword AUTO_INCREMENT (or AUTO) enables partition auto-creation.
I still think that AUTO_INCREMENT here looks rather out-of-place, I don't know how to explain to users why AUTO_INCREMENT is allowed here. But if we'll have it, we'll have to support it here for a long time for compatibility reasons. "Because somebody might be already using it" I'd suggest to support only AUTO here. Or, if the parser can do it easily, a much more readable syntax would be create table t1 (x int) with system versioning auto partition by system_time interval 1 hour; this would be the best, easy to read, matches the documentation (where it could be called "auto partitioning"). If it won't cause any difficult parser problems, I'd prefer us to use that.
create or replace table t1 (x int) with system versioning partition by system_time interval 1 hour auto_increment;
create or replace table t1 (x int) with system versioning partition by system_time limit 1000 auto;
Or with explicit partitions:
create or replace table t1 (x int) with system versioning partition by system_time interval 1 hour auto (partition p0 history, partition pn current);
== Description ==
Before executing history-generating DML command add N history partitions, so that N would be sufficient for potentially generated history. N > 1 may be required when history is rotated by INTERVAL and timestamp was jumped to future further than interval value.
It is assumed that one DML command will not generate history rows more than specified LIMIT. Otherwise history would overflow generated partition, a warning would be printed, and user action would be required to rebuild partitions.
Auto-creation is implemented by synchronous fast_alter_partition_table() call from the thread of the executed DML command before the command itself (by the fallback and retry mechanism similar to Discovery feature, see Open_table_context).
Creating history partitions was made by the principle of minimal disruption of the main business process. I.e. when procedure of creation fails it will not (if possible) fail the original DML command. Warning info will display what happened and the last history partition may overflow. In such case partition rebuild is required to correctly display history info. User application may detect warning code ER_VERS_HIST_PART_FAILED and stop execution if that is preferred.
The name for newly added partitions are generated like default partition names with extension of MDEV-22155 (which avoids name clashes by extending assignment counter to next free-enough gap).
These DML commands trigger auto-creation:
* DELETE (including multi-delete, excluding DELETE HISTORY) * UPDATE (including multi-update) * REPLACE (including REPLACE .. SELECT)
INSERT ... ON DUPLICATE KEY UPDATE ?
diff --git a/mysql-test/suite/versioning/common.inc b/mysql-test/suite/versioning/common.inc index 355b571e5a0..b35a5138015 100644 --- a/mysql-test/suite/versioning/common.inc +++ b/mysql-test/suite/versioning/common.inc @@ -6,6 +6,7 @@ if (!$TEST_VERSIONING_SO) source include/have_innodb.inc;
set @@session.time_zone='+00:00'; +set @@global.time_zone='+00:00';
Why is that? I'd expect your auto-adding of partitions to happen in the context of a session, using session time zone. Is it for replication?
select ifnull(max(transaction_id), 0) into @start_trx_id from mysql.transaction_registry; set @test_start=now(6);
diff --git a/mysql-test/suite/versioning/r/delete_history.result b/mysql-test/suite/versioning/r/delete_history.result index cb865a835b3..90c9e4777bb 100644 --- a/mysql-test/suite/versioning/r/delete_history.result +++ b/mysql-test/suite/versioning/r/delete_history.result @@ -63,8 +63,6 @@ insert into t values (1); update t set a= 2; update t set a= 3; delete history from t; -Warnings: -Warning 4114 Versioned table `test`.`t`: last HISTORY partition (`p1`) is out of LIMIT, need more HISTORY partitions
good riddance, it didn't make much sense anyway
# The above warning is one command late (MDEV-20345) ^^^ select * from t for system_time all; a diff --git a/sql/ha_partition.h b/sql/ha_partition.h index 85cb736b5bd..5fb893b4f0e 100644 --- a/sql/ha_partition.h +++ b/sql/ha_partition.h @@ -1623,7 +1623,7 @@ class ha_partition :public handler
could you also, please, fix (in a separate commit) ha_partition::part_records() to take a proper partition_element* argument, not void* ?
for (; part_id < part_id_end; ++part_id) { handler *file= m_file[part_id]; - DBUG_ASSERT(bitmap_is_set(&(m_part_info->read_partitions), part_id)); + bitmap_set_bit(&(m_part_info->read_partitions), part_id); file->info(HA_STATUS_VARIABLE | HA_STATUS_NO_LOCK | HA_STATUS_OPEN); part_recs+= file->stats.records; } diff --git a/sql/partition_info.cc b/sql/partition_info.cc index e9dbf2b49c3..72312ee7ac4 100644 --- a/sql/partition_info.cc +++ b/sql/partition_info.cc @@ -814,10 +814,16 @@ bool partition_info::has_unique_name(partition_element *element) vers_info->interval Limit by fixed time interval vers_info->hist_part (out) Working history partition */ -void partition_info::vers_set_hist_part(THD *thd) +uint partition_info::vers_set_hist_part(THD *thd, bool auto_inc)
let's not all it auto_inc. it'll add noise to every grep for auto_inc issues. What about auto_add ?
{ + DBUG_ASSERT(!thd->lex->last_table()->vers_conditions.delete_history); + + uint create_count= 0; + auto_inc= auto_inc && vers_info->auto_inc; + if (vers_info->limit) { + DBUG_ASSERT(!vers_info->interval.is_set()); ha_partition *hp= (ha_partition*)(table->file); partition_element *next= NULL; List_iterator<partition_element> it(partitions); @@ -836,22 +842,26 @@ void partition_info::vers_set_hist_part(THD *thd) { if (next == vers_info->now_part) { - my_error(WARN_VERS_PART_FULL, MYF(ME_WARNING|ME_ERROR_LOG), - table->s->db.str, table->s->table_name.str, - vers_info->hist_part->partition_name, "LIMIT"); + if (auto_inc) + create_count= 1; + else + my_error(WARN_VERS_PART_FULL, MYF(ME_WARNING|ME_ERROR_LOG), + table->s->db.str, table->s->table_name.str, + vers_info->hist_part->partition_name, "LIMIT"); } else vers_info->hist_part= next; } - return; + // reserve at least one history partition + if (auto_inc && create_count == 0 && + vers_info->hist_part->id + 1 == vers_info->now_part->id) + create_count= 1;
Questionable. What does it solve?
} - - if (vers_info->interval.is_set()) + else if (vers_info->interval.is_set() && + vers_info->hist_part->range_value <= thd->query_start()) { - if (vers_info->hist_part->range_value > thd->query_start()) - return; - partition_element *next= NULL; + bool error= true; List_iterator<partition_element> it(partitions); while (next != vers_info->hist_part) next= it++; @@ -860,12 +870,200 @@ void partition_info::vers_set_hist_part(THD *thd) { vers_info->hist_part= next; if (next->range_value > thd->query_start()) - return; + { + error= false; + break;
but here you don't "reserve at least one history partition" Why INTERVAL is different?
+ } + } + if (error) + { + if (auto_inc) + { + DBUG_ASSERT(thd->query_start() >= vers_info->hist_part->range_value); + my_time_t diff= thd->query_start() - vers_info->hist_part->range_value; + if (diff > 0) + { + size_t delta= vers_info->interval.seconds(); + create_count= diff / delta + 1; + if (diff % delta) + create_count++; + } + else + create_count= 1; + } + else + { + my_error(WARN_VERS_PART_FULL, MYF(ME_WARNING|ME_ERROR_LOG), + table->s->db.str, table->s->table_name.str, + vers_info->hist_part->partition_name, "INTERVAL"); + } } - my_error(WARN_VERS_PART_FULL, MYF(ME_WARNING|ME_ERROR_LOG), - table->s->db.str, table->s->table_name.str, - vers_info->hist_part->partition_name, "INTERVAL"); } + + return create_count; +} + + +/** + @brief Run fast_alter_partition_table() to add new history partitions + for tables requiring them. +*/ +void vers_add_auto_parts(THD *thd, TABLE_LIST* tl, uint num_parts) +{ + HA_CREATE_INFO create_info; + Alter_info alter_info; + String query; + partition_info *save_part_info= thd->work_part_info; + Query_tables_list save_query_tables; + Reprepare_observer *save_reprepare_observer= thd->m_reprepare_observer; + Diagnostics_area new_stmt_da(thd->query_id, false, true); + Diagnostics_area *save_stmt_da= thd->get_stmt_da(); + bool save_no_write_to_binlog= thd->lex->no_write_to_binlog; + const CSET_STRING save_query= thd->query_string; + thd->m_reprepare_observer= NULL; + thd->lex->reset_n_backup_query_tables_list(&save_query_tables); + thd->in_sub_stmt|= SUB_STMT_AUTO_HIST; + thd->lex->no_write_to_binlog= !thd->is_current_stmt_binlog_format_row(); + TABLE *table= tl->table; + + DBUG_ASSERT(!thd->is_error()); + /* NB: we have to preserve m_affected_rows, m_row_count_func, m_last_insert_id, etc */ + thd->set_stmt_da(&new_stmt_da); + new_stmt_da.set_overwrite_status(true);
Is it needed? You've just started using a new Diagnostics_area, the status should be DA_EMPTY here, nothing to overwrite.
+ + { + DBUG_ASSERT(table->s->get_table_ref_type() == TABLE_REF_BASE_TABLE); + DBUG_ASSERT(table->versioned()); + DBUG_ASSERT(table->part_info); + DBUG_ASSERT(table->part_info->vers_info); + alter_info.reset(); + alter_info.partition_flags= ALTER_PARTITION_ADD|ALTER_PARTITION_AUTO_HIST; + create_info.init(); + create_info.alter_info= &alter_info; + Alter_table_ctx alter_ctx(thd, tl, 1, &table->s->db, &table->s->table_name); + + MDL_REQUEST_INIT(&tl->mdl_request, MDL_key::TABLE, tl->db.str, + tl->table_name.str, MDL_SHARED_NO_WRITE, MDL_TRANSACTION); + if (thd->mdl_context.acquire_lock(&tl->mdl_request, + thd->variables.lock_wait_timeout)) + goto exit; + table->mdl_ticket= tl->mdl_request.ticket; + + create_info.db_type= table->s->db_type(); + create_info.options|= HA_VERSIONED_TABLE; + DBUG_ASSERT(create_info.db_type); + + create_info.vers_info.set_start(table->s->vers_start_field()->field_name); + create_info.vers_info.set_end(table->s->vers_end_field()->field_name); + + partition_info *part_info= new partition_info(); + if (unlikely(!part_info)) + { + my_error(ER_OUT_OF_RESOURCES, MYF(0)); + goto exit; + } + part_info->use_default_num_partitions= false; + part_info->use_default_num_subpartitions= false; + part_info->num_parts= num_parts; + part_info->num_subparts= table->part_info->num_subparts; + part_info->subpart_type= table->part_info->subpart_type; + if (unlikely(part_info->vers_init_info(thd))) + { + my_error(ER_OUT_OF_RESOURCES, MYF(0)); + goto exit; + } + + // NB: set_ok_status() requires DA_EMPTY + thd->get_stmt_da()->reset_diagnostics_area();
Is it needed? You've just started using a new Diagnostics_area, the status should be DA_EMPTY here.
+ + thd->work_part_info= part_info; + if (part_info->set_up_defaults_for_partitioning(thd, table->file, NULL, + table->part_info->next_part_no(num_parts))) + { + push_warning(thd, Sql_condition::WARN_LEVEL_WARN, + ER_VERS_HIST_PART_FAILED, + "Auto-increment history partition: " + "setting up defaults failed"); + my_error(ER_VERS_HIST_PART_FAILED, MYF(ME_WARNING), + tl->db.str, tl->table_name.str); + goto exit; + } + bool partition_changed= false; + bool fast_alter_partition= false; + if (prep_alter_part_table(thd, table, &alter_info, &create_info, + &partition_changed, &fast_alter_partition)) + { + push_warning(thd, Sql_condition::WARN_LEVEL_WARN, ER_VERS_HIST_PART_FAILED, + "Auto-increment history partition: " + "alter partitition prepare failed"); + my_error(ER_VERS_HIST_PART_FAILED, MYF(ME_WARNING), + tl->db.str, tl->table_name.str); + goto exit; + } + if (!fast_alter_partition) + { + push_warning(thd, Sql_condition::WARN_LEVEL_WARN, ER_VERS_HIST_PART_FAILED, + "Auto-increment history partition: " + "fast alter partitition is not possible"); + my_error(ER_VERS_HIST_PART_FAILED, MYF(ME_WARNING), + tl->db.str, tl->table_name.str); + goto exit; + } + DBUG_ASSERT(partition_changed); + if (mysql_prepare_alter_table(thd, table, &create_info, &alter_info, + &alter_ctx)) + { + push_warning(thd, Sql_condition::WARN_LEVEL_WARN, ER_VERS_HIST_PART_FAILED, + "Auto-increment history partition: " + "alter prepare failed"); + my_error(ER_VERS_HIST_PART_FAILED, MYF(ME_WARNING), + tl->db.str, tl->table_name.str); + goto exit; + } + + // Forge query string for rpl logging + if (!thd->lex->no_write_to_binlog) + { + query.set(STRING_WITH_LEN("ALTER TABLE `"), &my_charset_latin1);
better use StringBuffer<...> query; because you're doing lots of reallocs now.
+ + if (query.append(table->s->db) || + query.append(STRING_WITH_LEN("`.`")) || + query.append(table->s->table_name) ||
this is wrong, the table and db names can have backticks inside. use append_identifier() instead.
+ query.append("` ADD PARTITION PARTITIONS ") || + query.append_ulonglong(part_info->num_parts) || + query.append(" AUTO")) + { + my_error(ER_OUT_OF_RESOURCES, MYF(ME_ERROR_LOG)); + goto exit; + } + CSET_STRING qs(query.c_ptr(), query.length(), &my_charset_latin1); + thd->set_query(qs); + }
Should it be binlogged at all? May be just leave it to the slave to auto-add the partition as needed? Looks a bit suspicious now, you log an ALTER TABLE possibly in the middle of a transaction. It will be replayed differently, not as auto-adding, in particular, it will commit. So, possibly different results on the slave, perhaps different gtid. Do clear it out with Andrei please. Or don't binlog auto-adding at all.
+ + if (fast_alter_partition_table(thd, table, &alter_info, &create_info, + tl, &table->s->db, &table->s->table_name)) + { + push_warning(thd, Sql_condition::WARN_LEVEL_WARN, ER_VERS_HIST_PART_FAILED, + "Auto-increment history partition: " + "alter partition table failed"); + my_error(ER_VERS_HIST_PART_FAILED, MYF(ME_WARNING), + tl->db.str, tl->table_name.str); + } + } + + if (!thd->transaction.stmt.is_empty()) + trans_commit_stmt(thd);
1. why would this operation register a transaction? 2. you amended a couple of checks like - if (thd->in_sub_stmt) + if (thd->in_sub_stmt & ~SUB_STMT_AUTO_HIST) but what will happen if it's, for example, an insert inside a trigger? So a real sub_stmt and SUB_STMT_AUTO_HIST?
+ +exit: + thd->work_part_info= save_part_info; + thd->m_reprepare_observer= save_reprepare_observer; + thd->lex->restore_backup_query_tables_list(&save_query_tables); + thd->in_sub_stmt&= ~SUB_STMT_AUTO_HIST; + if (!new_stmt_da.is_warning_info_empty()) + save_stmt_da->copy_sql_conditions_from_wi(thd, new_stmt_da.get_warning_info()); + thd->set_stmt_da(save_stmt_da); + thd->lex->no_write_to_binlog= save_no_write_to_binlog; + thd->set_query(save_query); }
diff --git a/sql/partition_info.h b/sql/partition_info.h index 7ae2d168068..ca68e61cbe2 100644 --- a/sql/partition_info.h +++ b/sql/partition_info.h @@ -72,9 +73,34 @@ struct Vers_part_info : public Sql_alloc my_time_t start; INTERVAL step; enum interval_type type; - bool is_set() { return type < INTERVAL_LAST; } + bool is_set() const { return type < INTERVAL_LAST; } + size_t seconds() const + { + if (step.second) + return step.second; + if (step.minute) + return step.minute * 60; + if (step.hour) + return step.hour * 3600; + if (step.day) + return step.day * 3600 * 24; + // comparison is used in rough estimates, it doesn't need to be calendar-correct
Are you sure the approximate value is enough here? You use it to estimate how many partitions to create. It's not a big deal if you create more than necessary (although it's not nice and we'll definitely get bug reports if that will happen). But you surely don't want to create less.
+ if (step.month) + return step.month * 3600 * 24 * 30; + DBUG_ASSERT(step.year); + return step.year * 86400 * 30 * 365; + } + bool lt(size_t secs) const + { + return seconds() < secs; + } + bool ge(size_t seconds) const + { + return !(this->lt(seconds)); + }
lt() and ge() don't seem to be used anywhere.
} interval; ulonglong limit; + bool auto_inc; partition_element *now_part; partition_element *hist_part; }; diff --git a/sql/sql_base.cc b/sql/sql_base.cc index b89be77f282..5583d70f8eb 100644 --- a/sql/sql_base.cc +++ b/sql/sql_base.cc @@ -1862,6 +1862,25 @@ bool open_table(THD *thd, TABLE_LIST *table_list, Open_table_context *ot_ctx) DBUG_PRINT("info",("Using locked table")); #ifdef WITH_PARTITION_STORAGE_ENGINE part_names_error= set_partitions_as_used(table_list, table); + if (table->part_info && + table->part_info->part_type == VERSIONING_PARTITION && + !table_list->vers_conditions.delete_history && + table_list->lock_type >= TL_WRITE_ALLOW_WRITE && + table_list->mdl_request.type == MDL_SHARED_WRITE) + { + switch (thd->lex->sql_command) + { + case SQLCOM_DELETE: + case SQLCOM_UPDATE: + case SQLCOM_REPLACE: + case SQLCOM_REPLACE_SELECT: + case SQLCOM_DELETE_MULTI: + case SQLCOM_UPDATE_MULTI: + /* Rotation is still needed under LOCK TABLES */ + table->part_info->vers_set_hist_part(thd, false);
ALTER TABLE works under LOCK TABLES, so this should be possible too, but let's have it as a separate task.
+ default:; + } + } #endif goto reset; } @@ -2104,6 +2123,37 @@ bool open_table(THD *thd, TABLE_LIST *table_list, Open_table_context *ot_ctx) tc_add_table(thd, table); }
+#ifdef WITH_PARTITION_STORAGE_ENGINE + if (table->part_info && + table->part_info->part_type == VERSIONING_PARTITION && + !table_list->vers_conditions.delete_history && + table_list->lock_type >= TL_WRITE_ALLOW_WRITE && + table_list->mdl_request.type >= MDL_SHARED_WRITE && + table_list->mdl_request.type < MDL_EXCLUSIVE) + { + switch (thd->lex->sql_command) + { + case SQLCOM_LOCK_TABLES: + case SQLCOM_DELETE: + case SQLCOM_UPDATE: + case SQLCOM_REPLACE: + case SQLCOM_REPLACE_SELECT: + case SQLCOM_DELETE_MULTI: + case SQLCOM_UPDATE_MULTI:
this is quite complex condition, better not to duplicate it. May be something like bool need_set_hist_part(TABLE_LIST *table_list, enum_sql_command sql_command) { if (table_list->table->part_info && table_list->table->part_info->part_type == VERSIONING_PARTITION && !table_list->vers_conditions.delete_history && table_list->lock_type >= TL_WRITE_ALLOW_WRITE && table_list->mdl_request.type >= MDL_SHARED_WRITE && table_list->mdl_request.type < MDL_EXCLUSIVE) { switch(sql_command) { case SQLCOM_LOCK_TABLES: case SQLCOM_DELETE: case SQLCOM_UPDATE: case SQLCOM_REPLACE: case SQLCOM_REPLACE_SELECT: case SQLCOM_DELETE_MULTI: case SQLCOM_UPDATE_MULTI: return true; } default:; } return false; }
+ ot_ctx->vers_create_count= table->part_info->vers_set_hist_part(thd, true); + if (ot_ctx->vers_create_count) + { + ot_ctx->request_backoff_action(Open_table_context::OT_ADD_HISTORY_PARTITION, + table_list); + MYSQL_UNBIND_TABLE(table->file); + tc_release_table(table);
There's no MYSQL_UNBIND_TABLE/tc_release_table after other request_backoff_action invocations. Why this one is special?
+ DBUG_RETURN(TRUE); + } + default:; + } + } +#endif + if (!(flags & MYSQL_OPEN_HAS_MDL_LOCK) && table->s->table_category < TABLE_CATEGORY_INFORMATION) { diff --git a/sql/sql_class.h b/sql/sql_class.h index 14516811262..145ac5c5f64 100644 --- a/sql/sql_class.h +++ b/sql/sql_class.h @@ -3346,6 +3347,10 @@ class THD: public THD_count, /* this must be first */
#ifdef WITH_PARTITION_STORAGE_ENGINE partition_info *work_part_info; + /** + List of tables requiring new history partition. + */ + List<TABLE_SHARE> vers_auto_part_tables;
this doesn't seem to be used anywhere
#endif
#ifndef EMBEDDED_LIBRARY diff --git a/sql/sql_yacc.yy b/sql/sql_yacc.yy index 7cc1faea79b..3f6c1793432 100644 --- a/sql/sql_yacc.yy +++ b/sql/sql_yacc.yy @@ -7521,14 +7529,17 @@ add_partition_rule:
add_part_extra: /* empty */ - | '(' part_def_list ')' + | '(' part_def_list ')' opt_vers_auto_inc { - LEX *lex= Lex; - lex->part_info->num_parts= lex->part_info->partitions.elements; + Lex->part_info->num_parts= Lex->part_info->partitions.elements; + if ($4) + Lex->alter_info.partition_flags|= ALTER_PARTITION_AUTO_HIST; } - | PARTITIONS_SYM real_ulong_num + | PARTITIONS_SYM real_ulong_num opt_vers_auto_inc { Lex->part_info->num_parts= $2; + if ($3) + Lex->alter_info.partition_flags|= ALTER_PARTITION_AUTO_HIST;
I'm confused. I thought that ALTER_PARTITION_AUTO_HIST is what you set in vers_add_auto_parts() to mark auto-adding of a new partition. But here you set it in the parser, when a user specifies AUTO in partition specifications. So it seems you're using this flag for two very different purposes. How do you distinguish between them? And why did you decide to do it this way?
} ;
diff --git a/sql/sql_partition.cc b/sql/sql_partition.cc index fbed614489d..25017ee5425 100644 --- a/sql/sql_partition.cc +++ b/sql/sql_partition.cc @@ -4825,6 +4829,13 @@ uint prep_alter_part_table(THD *thd, TABLE *table, Alter_info *alter_info, DBUG_RETURN(TRUE); }
+ if (alter_info->partition_flags & ALTER_PARTITION_AUTO_HIST && + (!table->part_info || !table->part_info->vers_info)) + { + my_error(ER_SYNTAX_ERROR, MYF(0));
can it even happen? Or should it be DBUG_ASSERT(0) here?
+ DBUG_RETURN(TRUE); + } + partition_info *alt_part_info= thd->lex->part_info; /* This variable is TRUE in very special case when we add only DEFAULT @@ -5312,7 +5323,9 @@ that are reorganised. now_part= el; } } - if (*fast_alter_table && tab_part_info->vers_info->interval.is_set()) + if (*fast_alter_table && + !(alter_info->partition_flags & ALTER_PARTITION_AUTO_HIST) && + tab_part_info->vers_info->interval.is_set())
this !ALTER_PARTITION_AUTO_HIST - do you mean not vers_add_auto_parts() or not sql statement that uses AUTO ?
{ partition_element *hist_part= tab_part_info->vers_info->hist_part; if (hist_part->range_value <= thd->query_start()) @@ -5347,7 +5360,8 @@ that are reorganised. */ if (!(alter_info->partition_flags & ALTER_PARTITION_TABLE_REORG)) { - if (!alt_part_info->use_default_partitions) + if (!alt_part_info->use_default_partitions && + !(alter_info->partition_flags & ALTER_PARTITION_AUTO_HIST))
Sorry, I don't understand that if() at all. What does it do?
{ DBUG_PRINT("info", ("part_info: %p", tab_part_info)); tab_part_info->use_default_partitions= FALSE;
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org