Hello Sergei, Thanks for the review. Please see comments below: On 6/24/23 12:20 AM, Sergei Golubchik wrote:
Hi, Alexander,
Please, see comments below. Sorry, it took a while.
On Jun 23, Alexander Barkov wrote:
revision-id: 8d51c6d234b (mariadb-10.11.2-13-g8d51c6d234b) parent(s): ceb0e7f944b author: Alexander Barkov committer: Alexander Barkov timestamp: 2023-03-21 10:07:57 +0400 message:
MDEV-30164 System variable for default collations
diff --git a/mysql-test/main/ctype_collate_implicit.test
b/mysql-test/main/ctype_collate_implicit.test
--- /dev/null +++ b/mysql-test/main/ctype_collate_implicit.test @@ -0,0 +1,209 @@ +--source include/have_utf8.inc +--source include/have_utf8mb4.inc + +--echo # +--echo # MDEV-30164 System variable for default collations +--echo # + +SET @@character_set_collations= ' utf8mb3 = utf8mb3_bin , LATIN1 = LATIN1_BIN ';
also, please, test that parsing of commas is relaxed too. Like
set @@character_set_collations= ',,,utf8mb3 = utf8mb3_bin,,latin1 = latin1_bin,,,';
it makes it easier for the user to edit the value.
Done. <cut>
+++ b/mysql-test/suite/sys_vars/r/sysvars_server_notembedded.result @@ -502,6 +502,16 @@ NUMERIC_BLOCK_SIZE NULL ENUM_VALUE_LIST NULL READ_ONLY NO COMMAND_LINE_ARGUMENT NULL +VARIABLE_NAME CHARACTER_SET_COLLATIONS +VARIABLE_SCOPE SESSION +VARIABLE_TYPE VARCHAR +VARIABLE_COMMENT Default collations for character sets
Better to emphasize somehow that it's not a complete list of default collations, but adjustments to the compiled-in list of default collations. If this variable is empty, it doesn't mean that character sets have no default collations.
I'm not sure how to put this succintly in the help text though
As discussed on slack, changed VARIABLE_COMMENT to: Overrides for character set default collations <cut>
diff --git a/sql/lex_charset.cc b/sql/lex_charset.cc --- a/sql/lex_charset.cc +++ b/sql/lex_charset.cc @@ -447,6 +451,17 @@ Lex_exact_charset_opt_extended_collate::find_default_collation() const }
+CHARSET_INFO * +Lex_exact_charset_opt_extended_collate:: + find_mapped_default_collation(Charset_collation_map_st::Used *used, + const Charset_collation_map_st &map) const +{ + CHARSET_INFO *cs= find_compiled_default_collation(); + if (!cs) + return nullptr; + return map.get_collation_for_charset(used, cs);
This seems a bit redundant. find_compiled_default_collation() will do get_charset_by_csname(m_ci, MY_CS_PRIMARY) and only then you'll be able to search, because you map default compiled collation id to use specified collation id.
That's true. But get_charset_by_csname() is called only if we earlier processed a sequence "CHARATER SET cs [ COLLATE cl]", and the optional COLLATE clause did present, and now we need to find the default collation for "cs". Without the optional COLLATE, find_compiled_default_collation() will immediately return here without calling get_charset_by_csname(): /* CREATE TABLE t1 (a CHAR(10) COLLATE DEFAULT) CHARACTER SET utf8mb4; Nothing to do, we have the default collation already. */ if (m_ci->state & MY_CS_PRIMARY) return m_ci; And by the way, the situation when find_compiled_default_collation() is really called is very rare itself: CREATE TABLE t1 (a CHAR(10) COLLATE DEFAULT) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin; Here the parser needs to resolve DEFAULT on the column level after it parsed the table level. But this syntax is very uncommon.
but conceptually you need to map a charset to a collation, not a collation to a collation. If you'd map, say, cs->cs_name.str (just the pointer, not the string value) or, say, cs->cset, then you'd be mapping a charset to a collation. and here you'd be able to do just
return map.get_collation_for_charset(used, m_ci)
with find_compiled_default_collation() being done inside get_collation_for_charset() and only when Elem_st wasn't found
cs->cset is a pointer to MY_CHARSET_HANDLER. It cannot be used here, because the same handler is shared between multiple character sets. cs->cs_name.str could work. However, this may not be reliable with user-defined collations. Taking into account that the situation when redundant lookup is done is very rare, I propose not to change anything here now. Instead, eventually we need to fix the collation library itself, to avoid lookup inside get_charset_by_csname(m_ci, MY_CS_PRIMARY). <cut>
diff --git a/sql/lex_charset.h b/sql/lex_charset.h --- a/sql/lex_charset.h +++ b/sql/lex_charset.h @@ -544,6 +561,21 @@ struct Lex_exact_charset_extended_collation_attrs_st { return m_ci; } + CHARSET_INFO *charset_info(Charset_collation_map_st::Used *used, + const Charset_collation_map_st &map) const + { + switch (m_type) + { + case TYPE_CHARACTER_SET: + return map.get_collation_for_charset(used, m_ci);
Why do you need this special charset_info(used,map) that differs from charset_info() when m_type==TYPE_CHARACTER_SET, and is used in set_charset_collation_attrs() only?
diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc --- a/sql/log_event_server.cc +++ b/sql/log_event_server.cc @@ -1989,6 +1997,16 @@ int Query_log_event::do_apply_event(rpl_group_info *rgi, thd->variables.sql_mode= (sql_mode_t) ((thd->variables.sql_mode & MODE_NO_DIR_IN_CREATE) | (sql_mode & ~(sql_mode_t) MODE_NO_DIR_IN_CREATE)); + + size_t cslen=
The question should actually be "Why do we need the simple charset_info() returning m_ci without mapping". :) The short answer: to workaround Bison token size limitation. Details: There were three different cases when either of the variants of charset_info() was needed: 1. Inside Lex_exact_charset_extended_collation_attrs_st itself methods: merge_column_charset_clause_and_collate_clause() merge_column_collate_clause_and_collate_clause() Here it's not needed at all, I replaced charset_info() to just m_ci. 2. In two variants of Lex_field_type_st::set(): Lex_field_type_st distinguishes all types of CHARSET_INFO listed in the enum Lex_exact_charset_extended_collation_attrs_st::Type: TYPE_EMPTY, TYPE_CHARACTER_SET, TYPE_COLLATE_EXACT, TYPE_CHARACTER_SET_COLLATE_EXACT, TYPE_COLLATE_CONTEXTUALLY_TYPED. Lex_field_type_st is actually a packed form of Lex_exact_charset_extended_collation_attrs_st. Normally Lex_field_type_st would just derive from Lex_exact_charset_extended_collation_attrs_st but we need to satisfy Bison's token size limitation. Hence the packed form. So it copies the Type value, and copies the CHARSET_INFO value as is. This is why I needed the non-mapping version of charset_info(). 3. In this method of Column_definition: void set_charset_collation_attrs(Charset_collation_map_st::Used *used, const Charset_collation_map_st &map, const Lex_column_charset_collation_attrs_st &lc) Column_definition distiguishes two variants of CHARSET_INFO depending on whether the CONTEXT_COLLATION_FLAG bit is set in flags: - A ready to use CHARSET_INFO pointer - A contextual collection, which needs resolution later, when the context becomes known (e.g. from an upper level). Here we use the version of charset_info() with the "map" argument, to: - map lc's "CHARSET cs" (which did not have a COLLATE) to its mapped form. The returned CHARSET_INFO is now ready, and does not need any further resolution or mapping. - preserve the lc.m_ci value otherwise (when it has a COLLATE part, with or without CHARSET) Column_definition is marked with the CONTEXT_COLLATION_FLAG in case it was a "COLLATE context" (without CHARSET), so it'll resolve the CHARSET_INFO pointer when CHARSET becomes known. I propose to rename the simple version of charset_info() to charset_info_not_mapped(). For better incapsulation, we can also make it private and mark as "friend Lex_field_type_st::set". It should not be used for other purposes than exporing to packed forms. <cut> thd->variables.character_set_collations.from_binary(
+ character_set_collations.str, + character_set_collations.length); + if (cslen != character_set_collations.length) + { + thd->variables.character_set_collations.init(); + goto compare_errors; // QQ: report an error here?
is it corrupted event or unknown collation?
It can be both - a corrupted event, and an unknown collation. Both cases are fatal - replication cannot continue. <cut>
diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc --- a/sql/sql_parse.cc +++ b/sql/sql_parse.cc @@ -6049,6 +6049,27 @@ mysql_execute_command(THD *thd, bool is_called_from_prepared_stmt) } thd->reset_kill_query(); } + + /* + If a non-default collation (in @@character_set_collations) + was used during the statement, the mysqlbinlog output for + the current statement will contain a sequence like this: + + SET character_set_collations='utf8mb3=utf8mb3_bin'; + INSERT INTO t1 VALUES (_utf8mb3'test'); + COMMIT; + + The statment (INSERT in this example) is already in binlog at this point, and the + and the "SET character_set_collations" is written inside a + Q_CHARACTER_SET_COLLATIONS chunk in its log entry header. + The flag CHARACTER_SET_COLLATIONS_USED is not needed any more. + + Let's suppress the flag to avoid a Q_CHARACTER_SET_COLLATIONS chunk + inside the COMMIT log entry header - it would be useless and would + only waste space in the binary log. + */ + thd->used&= ~THD::CHARACTER_SET_COLLATIONS_USED;
Wouldn't that logic apply to other THD::xxx_USED flags? May be you should reset thd->used=0 here?
--- a/sql/sql_table.cc +++ b/sql/sql_table.cc @@ -12457,8 +12459,10 @@ bool HA_CREATE_INFO::
I'd prefer you didn't put a line break after :: see how useful it makes the hunk header?
{ // Make sure we don't do double resolution in direct SQL execution DBUG_ASSERT(!default_table_charset ||
+ Character_set_collations_used used(thd); if (!(default_table_charset= - default_cscl.resolved_to_context(ctx))) + default_cscl.resolved_to_context(&used, +
Sounds reasonable. I tried thd->used=0. rpl.rpl_hrtime failed with multiple chunks like this: -SET TIMESTAMP=1293832861.123456/*!*/; +SET TIMESTAMP=1293832861/*!*/; COMMIT Do you know why binlog always puts "SET TIMESTAMP" before "COMMIT"? This change would not be comfortable for me. I'd prefer if "SET TIMESTAMP" was not printed at all when it's not needed, instead of removing only the fractional part. Also these tests: binlog.binlog_mysqlbinlog_row binlog.binlog_mysqlbinlog_row_myisam failed with one chunk: SET TIMESTAMP=1000000000/*!*/; -SET @@session.time_zone='SYSTEM'/*!*/; COMMIT Btw, "grep SET.*zone mysql-test/suite/binlog/r/*" returns only these two lines: mysql-test/suite/binlog/r/binlog_mysqlbinlog_row_myisam.result:SET @@session.time_zone='SYSTEM'/*!*/; mysql-test/suite/binlog/r/binlog_mysqlbinlog_row.result:SET @@session.time_zone='SYSTEM'/*!*/; both before COMMIT. Either @@session.time_zone not printed before DML or DDL statements (which is probably a bug), or it's very purely covered. What do you suggest? I'd solve it in a separate task. It seems it needs further investigations, not related to this MDEV. <cut> thd->stmt_arena->is_stmt_execute()); thd->variables.character_set_collations, ctx)))
return true; }
Right, looks useless. Will try not to put line breaks after :: in the future. Thanks.
diff --git a/sql/sql_class.h b/sql/sql_class.h --- a/sql/sql_class.h +++ b/sql/sql_class.h @@ -5620,6 +5624,29 @@ class THD: public THD_count, /* this must be first */ };
+class Character_set_collations_used: public Charset_collation_map_st::Used +{ + THD *m_thd; +public: + Character_set_collations_used(THD *thd) + :m_thd(thd) + { } + ~Character_set_collations_used() + { + /* + Mark THD that the collation map was used, + no matter if a compiled or a mapped collation was
why no matter? as far as I understand, you didn't plan to change compiled-in defaults
This is needed for a different reason, not related to compiled defaults. The second part of this comment was intended to explain:
+ found during charset->collation resolution. + Even if the map was empty, we still need to print + SET @@session.character_set_collations=''; + in mariadb-binlog output.
We need to print the empty value too, to replay the output of mysqlbinlog safely, to be independent of the @@session.character_set_collations value on the replaying server. We agreed on this solution on slack, but a few months ago :) Perhaps, the comment should be extended. <cut>
diff --git a/sql/sql_prepare.cc b/sql/sql_prepare.cc --- a/sql/sql_prepare.cc +++ b/sql/sql_prepare.cc @@ -3538,6 +3544,13 @@ static void mysql_stmt_execute_common(THD *thd, DBUG_VOID_RETURN; }
+ if (stmt->prepare_time_charset_collation_map_version() != + thd->variables.character_set_collations.version()) + { + if (stmt->reprepare()) + DBUG_VOID_RETURN; + }
why do you reprepare here and now where reprepare() is normally happens, in the Prepared_statement::execute_loop()?
@@ -5220,6 +5252,13 @@ bool Prepared_statement::execute(String *expanded_query, bool open_cursor) MYSQL_QUERY_EXEC_START(thd->query(), thd->thread_id,
&thd->security_ctx->priv_user[0], (char *)
Can you please suggest a better place for this code? <cut> thd->get_db(), thd->security_ctx->host_or_ip, 1);
+ /* + If PREPARE used @@character_set_collations, + then we need to make sure binary log writes + the map in the event header. + */
why is it different from other THD::xx_USED flags?
CHARACTER_SET_COLLATIONS_USED in most cases is set during parsing only, and is not set on later stages afterwards. Other XXX_USED flags are set at later stages, either during fix_fields(), or during evaluation. We don't need to care about copying them from m_prepare_time_thd_used_flags, because the flags will be set during EXECUTE (and in many cases they are not even set during PREPARE). <cut>
diff --git a/sql/sys_vars.cc b/sql/sys_vars.cc --- a/sql/sys_vars.cc +++ b/sql/sys_vars.cc @@ -436,6 +436,115 @@ static bool update_auto_increment_increment (sys_var *self, THD *thd, enum_var_t
#endif /* WITH_WSREP */
+ +class Sys_var_charset_collation_map: public sys_var
in sys_var.inl, please
Done. <cut>
+ static const uchar *make_value_ptr(THD *thd, + const Charset_collation_map_st &map) + { + size_t nbytes= map.text_format_nbytes_needed(); + char *buf= (char *) thd->alloc(nbytes); + size_t length= map.print(buf, nbytes); + return (uchar *) thd->strmake(buf, length);
Eh, sorry. I don't understand. You allocate memory twice? for two copies of the string?
+ bool session_update(THD *thd, set_var *var) override + { + Charset_collation_map_st map; + if (!var->value) + { + session_save_default(thd, var); + return false; + } + if (charset_collation_map_from_item(&map, var->value,
Fixed. <cut> thd->get_utf8_flag()))
The idea is that var->value is only evaluated in check and is not reevaluated in update. check is supposed to store the value in
set_var *var
for update to use.
Fixed. <cut>
diff --git a/sql/sql_yacc.yy b/sql/sql_yacc.yy --- a/sql/sql_yacc.yy +++ b/sql/sql_yacc.yy @@ -9686,6 +9713,9 @@ column_default_non_parenthesized_expr: } | CONVERT_SYM '(' expr USING charset_name ')' { + Character_set_collations_used used(thd); + $5= thd->variables.character_set_collations. + get_collation_for_charset(&used, $5);
this is rather awkward pattern that you use everywhere, with
Character_set_collations_used used(thd);
why not to pass thd as an argument instead?
charset_collations.h and lex_charset.h use Charset_collation_map_st::Used. They neither know what its descendant Character_set_collations_used is, nor know what THD is. I would like not to make charset_collations.h and lex_charset.h depend on the entire sql_class.h. As a solution, we can move the part of THD where the data type used_t and the enum with XXX_USED flags are defined into a separate class in a separate header: class Statement_features_used // or we can try to find a better class name { public: typedef uint used_t; enum { RAND_USED=1, TIME_ZONE_USED=2, QUERY_START_SEC_PART_USED=4, THREAD_SPECIFIC_USED=8, CHARACTER_SET_COLLATIONS_USED= 16 }; used_t used; }; then derive THD from this class and fix charset_collations.h and lex_charset.h functions to get a Statement_feature_used poiter instead of a Charset_collation_map_st::Used pointer. In this case we'll be able to pass "thd" directly to charset_collations.h and lex_charset.h functions, because THD will be a descendant of Statement_feature_used. <cut>
diff --git a/sql/simple_tokenizer.h b/sql/simple_tokenizer.h --- /dev/null +++ b/sql/simple_tokenizer.h @@ -0,0 +1,83 @@ +/* Copyright (c) 2022, MariaDB Corporation.
2023
Fixed. <cut>
+class Simple_tokenizer +{ + const char *m_ptr; + const char *m_end; +public: + Simple_tokenizer(const char *str, size_t length) + :m_ptr(str), m_end(str + length) + { }
<cut>
that's too big and complex, but also not complex enough. note that we have quite a few of those already, in patricular in typelib.c, in strfunc.cc, item_func.cc (find_in_set) etc.
and this one could've been 2-3 times smaller for
@@character_set_collations
only, and it still cannot replace any other parsers from above.
I'd keep just two methods: get_token() - which skips whitespaces and returns the next substring of non-white-space and not ',' or '=' characters. multibyte-safe. And get_char, which returns the next non-white-space character.
this parser could be used in all use cases from above.
As discussed on slack: - I moved get_spaces() inside get_char() and get_ident(). The caller code became simpler indeed. Thanks for suggesting! - This charset-unaware (ASCII-only) tokenizer can be reused for a number of system variables (sql_mode, optimizer_switch, log_slow_filter, myisam_recover_options, slave_transaction_retry_errors). - A charset-aware tokenizer can be added later as a new class later, deriving from Simple_tokenizer
diff --git a/sql/charset_collations.h b/sql/charset_collations.h --- /dev/null +++ b/sql/charset_collations.h <cut> + + struct Elem_st + { + protected: + CHARSET_INFO *m_charset; + CHARSET_INFO *m_collation;
What is the difference between charset and collation? I was under impression that any CHARSET_INFO describes a collation, there is no special structure just for a charset. Am I wrong?
You're not wrong. Just conceptually we map from a charset to a collation here. Hence the names. I can rename to: CHARSET_INFO *m_from_charset; CHARSET_INFO *m_to_collation; to make it more clear. <cut>