Re: [Maria-developers] 940d028521f: MDEV-30164 System variable for default collations
Hi, Alexander, I'm sorry it took a while. I'm still thinking about the whole thing, it's a rather big change for a really fringe functionality. But I failed to come up with something better so far. Code-wise the patch is mostly fine. See few small comments below, and one slightly larger comment re. replication. On Mar 07, Alexander Barkov wrote:
revision-id: 940d028521f (mariadb-10.11.1-4-g940d028521f) parent(s): 71dedd0ebcd author: Alexander Barkov committer: Alexander Barkov timestamp: 2022-12-14 20:00:22 +0400 message:
MDEV-30164 System variable for default collations
This patch adds a way to override default collations (or "character set collations") for desired character sets.
diff --git a/sql/lex_charset.h b/sql/lex_charset.h --- a/sql/lex_charset.h +++ b/sql/lex_charset.h @@ -544,6 +699,20 @@ struct Lex_exact_charset_extended_collation_attrs_st { return m_ci; } + CHARSET_INFO *charset_info(const Charset_collation_map_st &map) const + { + switch (m_type) + { + case Lex_exact_charset_extended_collation_attrs_st::TYPE_CHARACTER_SET: + return map.get_collation_for_charset(m_ci); + case TYPE_EMPTY:
Lex_exact_charset_extended_collation_attrs_st::TYPE_EMPTY (or all case labels without a struct name)
+ case Lex_exact_charset_extended_collation_attrs_st::TYPE_CHARACTER_SET_COLLATE_EXACT: + case Lex_exact_charset_extended_collation_attrs_st::TYPE_COLLATE_CONTEXTUALLY_TYPED:
COLLATE DEFAULT is TYPE_COLLATE_CONTEXTUALLY_TYPED. shouldn't it use the map too?
+ case Lex_exact_charset_extended_collation_attrs_st::TYPE_COLLATE_EXACT: + break; + } + return m_ci; + } Type type() const { return m_type; diff --git a/sql/log_event.h b/sql/log_event.h --- a/sql/log_event.h +++ b/sql/log_event.h @@ -2147,6 +2153,8 @@ class Query_log_event: public Log_event bool sql_mode_inited; bool charset_inited;
+ LEX_CSTRING character_set_collations_str;
better LEX_CUSTRING, don't you think? It's not even a _str.
+ uint32 flags2; sql_mode_t sql_mode; ulong auto_increment_increment, auto_increment_offset; diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc --- a/sql/log_event_server.cc +++ b/sql/log_event_server.cc @@ -1194,6 +1194,14 @@ bool Query_log_event::write() int2store(start+2, auto_increment_offset); start+= 4; } + + if (thd && thd->variables.character_set_collations.count()) + { + *start++= Q_COLLATIONS_SESSION; + size_t len= thd->variables.character_set_collations.to_binary((char*)start); + start+= len; + }
Perhaps, detect if it's needed? A cheap way of doing it would be to extend your Elem_st with a query_id. And every time you find_elem_by_charset, you set this elem's query_id to thd->query_id. And here you write only elements with the current query id. If any. Another approach would be to have a bitmap, like uchar used_default_coll_mapping; and in find_elem_by_charset() you set the bit, like used_default_coll_mapping |= 1 << i; and then, again, print affected collations, if any. Most often used_default_coll_mapping will likely be zero one more question. In, say, 10.10->11.1 replication master and slave will have different default collations, but thd->variables.character_set_collations will not reflect that. How do you plan to solve it?
+ if (charset_inited) { *start++= Q_CHARSET_CODE;
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hello Sergei, Thanks for the review. Please see comments inline: On 3/7/23 2:20 PM, Sergei Golubchik wrote:
Hi, Alexander,
I'm sorry it took a while.
I'm still thinking about the whole thing, it's a rather big change for a really fringe functionality. But I failed to come up with something better so far.
Code-wise the patch is mostly fine. See few small comments below, and one slightly larger comment re. replication.
As we discussed on slack, my plan was not to change the hard-coded default collations in the collation library. So in the library it will always be utf8mb4_general_ci as a default collation for utf8mb4. And the server default collations will be handled by this new variable character_set_collations. I did not mean this variable to be something fringe or temporary.
On Mar 07, Alexander Barkov wrote:
revision-id: 940d028521f (mariadb-10.11.1-4-g940d028521f) parent(s): 71dedd0ebcd author: Alexander Barkov committer: Alexander Barkov timestamp: 2022-12-14 20:00:22 +0400 message:
MDEV-30164 System variable for default collations
This patch adds a way to override default collations (or "character set collations") for desired character sets.
diff --git a/sql/lex_charset.h b/sql/lex_charset.h --- a/sql/lex_charset.h +++ b/sql/lex_charset.h @@ -544,6 +699,20 @@ struct Lex_exact_charset_extended_collation_attrs_st { return m_ci; } + CHARSET_INFO *charset_info(const Charset_collation_map_st &map) const + { + switch (m_type) + { + case Lex_exact_charset_extended_collation_attrs_st::TYPE_CHARACTER_SET: + return map.get_collation_for_charset(m_ci); + case TYPE_EMPTY:
Lex_exact_charset_extended_collation_attrs_st::TYPE_EMPTY (or all case labels without a struct name)
Correct, the qualifiers were redundant here. Removing.
+ case Lex_exact_charset_extended_collation_attrs_st::TYPE_CHARACTER_SET_COLLATE_EXACT: + case Lex_exact_charset_extended_collation_attrs_st::TYPE_COLLATE_CONTEXTUALLY_TYPED:
COLLATE DEFAULT is TYPE_COLLATE_CONTEXTUALLY_TYPED. shouldn't it use the map too?
This method charset_info() is used only in one place, in this method of class Column_definition: void set_charset_collation_attrs(const Charset_collation_map_st &map, const Lex_column_charset_collation_attrs_st &lc) { charset= lc.charset_info2(map); if (lc.is_contextually_typed_collation()) flags|= CONTEXT_COLLATION_FLAG; else flags&= ~CONTEXT_COLLATION_FLAG; } When this method is called from the parser, the character set of the column may not be know yet: CREATE TABLE t1 (a COLLATE DEFAULT /*HERE*/ ) CHARACTER SET utf8mb4; Notice, the character set comes later, from the table level. So in the method charset_info() we need to return the verbatim value of Lex_exact_charset_extended_collation_attrs_st::m_ci for all types (except TYPE_CHARACTER_SET, see below). The returned value is copied to Column_definition::charset, and the CONTEXT_COLLATION_FLAG bit is set or unset in Column_definition::flags according to "context-ness" of Lex_exact_charset_extended_collation_attrs_st. The resolution of COLLATE DEFAULT happens later, after parsing the whole CREATE TABLE statement, when we have the information about the table level character set. But if Lex_exact_charset_extended_collation_attrs_st::m_type is equal to TYPE_CHARACTER_SET, we can find the current default collation for the character set right here, inside Lex_exact_charset_extended_collation_attrs_st::charset_info(). There is no a need to postpone the resolution. And by the way, there is no even a way to postpone this resolution, because Column_definition does not distinguish between: a. Column_definition::charset is CHARACTER SET b. Column_definition::charset is CHARACTER SET with COLLATE The former needs to take into account @@character_set_collations, the latter does not need.
+ case Lex_exact_charset_extended_collation_attrs_st::TYPE_COLLATE_EXACT: + break; + } + return m_ci; + } Type type() const { return m_type; diff --git a/sql/log_event.h b/sql/log_event.h --- a/sql/log_event.h +++ b/sql/log_event.h @@ -2147,6 +2153,8 @@ class Query_log_event: public Log_event bool sql_mode_inited; bool charset_inited;
+ LEX_CSTRING character_set_collations_str;
better LEX_CUSTRING, don't you think? It's not even a _str.
Right, it's not _str. It's a binary data. Removed the suffix. What are benefits of LEX_CUSTRING? I tried to change the data type, it didn't make the code better.
+ uint32 flags2; sql_mode_t sql_mode; ulong auto_increment_increment, auto_increment_offset; diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc --- a/sql/log_event_server.cc +++ b/sql/log_event_server.cc @@ -1194,6 +1194,14 @@ bool Query_log_event::write() int2store(start+2, auto_increment_offset); start+= 4; } + + if (thd && thd->variables.character_set_collations.count()) + { + *start++= Q_COLLATIONS_SESSION; + size_t len= thd->variables.character_set_collations.to_binary((char*)start); + start+= len; + }
Perhaps, detect if it's needed? A cheap way of doing it would be to extend your Elem_st with a query_id. And every time you find_elem_by_charset, you set this elem's query_id to thd->query_id. And here you write only elements with the current query id. If any.
Another approach would be to have a bitmap, like
uchar used_default_coll_mapping;
and in find_elem_by_charset() you set the bit, like
used_default_coll_mapping |= 1 << i;
and then, again, print affected collations, if any. Most often used_default_coll_mapping will likely be zero
This will need additional resources even if the server is not in replication at all. But on the other hand, when the server is in replication, the benefit on the slave side is not so obvious. I propose a different optimization. It's related to the question below:
one more question. In, say, 10.10->11.1 replication master and slave will have different default collations, but thd->variables.character_set_collations will not reflect that. How do you plan to solve it?
+ if (charset_inited) { *start++= Q_CHARSET_CODE;
Every major release will have its own default value for character_set_collations. For example, in 11.1 we plan to change it to: utf8mb3=uca1400_ai_ci,utf8mb4=uca1400_ai_ci,ucs2=uca1400_ai_ci,utf16=uca1400_ai_ci,utf16le=uca1400_ai_ci,utf32=uca1400_ai_ci while in 10.11 it will be empty by default. I propose to add server version into the binlog event header. (by the way it will be useful for many other purposes). If we know the master version on the slave side, we can optimize writing of the Q_COLLATION_SESSION chunk when character_set_collations is equal to its default value (according to the master version): - We can either omit writing Q_COLLATIONS_SESSION completely - Or at least write a very short version (e.g. with count==0) which will mean "use the default value of character_set_collations corresponding to my version". Oh, by the way I need to rename Q_COLLATIONS_SESSION to Q_CHARACTER_SET_COLLATIONS.
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi, Alexander, On Mar 10, Alexander Barkov wrote:
On 3/7/23 2:20 PM, Sergei Golubchik wrote:
I'm sorry it took a while.
I'm still thinking about the whole thing, it's a rather big change for a really fringe functionality. But I failed to come up with something better so far.
Code-wise the patch is mostly fine. See few small comments below, and one slightly larger comment re. replication.
As we discussed on slack, my plan was not to change the hard-coded default collations in the collation library. So in the library it will always be utf8mb4_general_ci as a default collation for utf8mb4. And the server default collations will be handled by this new variable character_set_collations. I did not mean this variable to be something fringe or temporary.
I see. Then collations set as default this way must be undistinguishable in behavior from compiled-in defaults. In particular, SHOW COLLATIONS, I_S.COLLATIONS, etc (if anything) should show them as default. And performance considerations become more important, I wasn't paying much attention to that, as I thought it's a fringe feature. Like, a linear search in Charset_collation_map_st could be replaced with a binary. A HASH might be too much overhead, but a binary search in the array could be worth it. As it's a search that every single user will be using in the future, perhaps even in every single query. Or even many times per query.
On Mar 07, Alexander Barkov wrote:
revision-id: 940d028521f (mariadb-10.11.1-4-g940d028521f) parent(s): 71dedd0ebcd author: Alexander Barkov committer: Alexander Barkov timestamp: 2022-12-14 20:00:22 +0400 message:
MDEV-30164 System variable for default collations
This patch adds a way to override default collations (or "character set collations") for desired character sets.
diff --git a/sql/lex_charset.h b/sql/lex_charset.h --- a/sql/lex_charset.h +++ b/sql/lex_charset.h
+ case Lex_exact_charset_extended_collation_attrs_st::TYPE_CHARACTER_SET_COLLATE_EXACT: + case Lex_exact_charset_extended_collation_attrs_st::TYPE_COLLATE_CONTEXTUALLY_TYPED:
COLLATE DEFAULT is TYPE_COLLATE_CONTEXTUALLY_TYPED. shouldn't it use the map too?
This method charset_info() is used only in one place, in this method of class Column_definition: ...
Thanks for the explanation!
+ case Lex_exact_charset_extended_collation_attrs_st::TYPE_COLLATE_EXACT: + break; + } + return m_ci; + } Type type() const { return m_type; diff --git a/sql/log_event.h b/sql/log_event.h --- a/sql/log_event.h +++ b/sql/log_event.h @@ -2147,6 +2153,8 @@ class Query_log_event: public Log_event bool sql_mode_inited; bool charset_inited;
+ LEX_CSTRING character_set_collations_str;
better LEX_CUSTRING, don't you think? It's not even a _str.
Right, it's not _str. It's a binary data. Removed the suffix.
What are benefits of LEX_CUSTRING?
LEX_CSTRING is a string, char*. LEX_CUSTRING is an array of bytes, uchar*. I think it's more appropriate here.
diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc --- a/sql/log_event_server.cc +++ b/sql/log_event_server.cc @@ -1194,6 +1194,14 @@ bool Query_log_event::write() int2store(start+2, auto_increment_offset); start+= 4; } + + if (thd && thd->variables.character_set_collations.count()) + { + *start++= Q_COLLATIONS_SESSION; + size_t len= thd->variables.character_set_collations.to_binary((char*)start); + start+= len; + }
Perhaps, detect if it's needed? A cheap way of doing it would be to extend your Elem_st with a query_id. And every time you find_elem_by_charset, you set this elem's query_id to thd->query_id. And here you write only elements with the current query id. If any.
Another approach would be to have a bitmap, like
uchar used_default_coll_mapping;
and in find_elem_by_charset() you set the bit, like
used_default_coll_mapping |= 1 << i;
and then, again, print affected collations, if any. Most often used_default_coll_mapping will likely be zero
This will need additional resources even if the server is not in replication at all.
8 bytes per collation in the map or one byte per thd? That's not noticeable.
But on the other hand, when the server is in replication, the benefit on the slave side is not so obvious.
The benefit is that it'll log only the affected collations (typically zero or one) not the whole map in every event.
I propose a different optimization. It's related to the question below:
one more question. In, say, 10.10->11.1 replication master and slave will have different default collations, but thd->variables.character_set_collations will not reflect that. How do you plan to solve it?
+ if (charset_inited) { *start++= Q_CHARSET_CODE;
Every major release will have its own default value for character_set_collations.
For example, in 11.1 we plan to change it to:
utf8mb3=uca1400_ai_ci,utf8mb4=uca1400_ai_ci,ucs2=uca1400_ai_ci,utf16=uca1400_ai_ci,utf16le=uca1400_ai_ci,utf32=uca1400_ai_ci
while in 10.11 it will be empty by default.
I propose to add server version into the binlog event header. (by the way it will be useful for many other purposes).
It's there already. see log_event.cc for Format_description_log_event::server_version_split and class Version.
If we know the master version on the slave side, we can optimize writing of the Q_COLLATION_SESSION chunk when character_set_collations is equal to its default value (according to the master version):
I don't know how it can work, if you replicate from 11.1 to an old slave that knows how to read Q_COLLATION_SESSION, but doesn't know what hard-coded defaults are in some later MariaDB version. Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hello Sergei, Please review a new patch version: https://github.com/MariaDB/server/commit/8d51c6d234b1730d4ff3b2c1fe7828eeca8... Comments go inline:
As we discussed on slack, my plan was not to change the hard-coded default collations in the collation library. So in the library it will always be utf8mb4_general_ci as a default collation for utf8mb4. And the server default collations will be handled by this new variable character_set_collations. I did not mean this variable to be something fringe or temporary.
I see. Then collations set as default this way must be undistinguishable in behavior from compiled-in defaults. In particular, SHOW COLLATIONS, I_S.COLLATIONS, etc (if anything) should show them as default.
Done.
And performance considerations become more important, I wasn't paying much attention to that, as I thought it's a fringe feature.
Like, a linear search in Charset_collation_map_st could be replaced with a binary. A HASH might be too much overhead, but a binary search in the array could be worth it. As it's a search that every single user will be using in the future, perhaps even in every single query. Or even many times per query.
Done. It now uses binary search.
+ LEX_CSTRING character_set_collations_str;
better LEX_CUSTRING, don't you think? It's not even a _str.
Right, it's not _str. It's a binary data. Removed the suffix.
What are benefits of LEX_CUSTRING?
LEX_CSTRING is a string, char*.
LEX_CUSTRING is an array of bytes, uchar*. I think it's more appropriate here.
I tried LEX_CUSTRING. It required a little bit more code than the current version. I propose to keep it LEX_CSTRING.
diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc --- a/sql/log_event_server.cc +++ b/sql/log_event_server.cc @@ -1194,6 +1194,14 @@ bool Query_log_event::write() int2store(start+2, auto_increment_offset); start+= 4; } + + if (thd && thd->variables.character_set_collations.count()) + { + *start++= Q_COLLATIONS_SESSION; + size_t len= thd->variables.character_set_collations.to_binary((char*)start); + start+= len; + }
Perhaps, detect if it's needed? A cheap way of doing it would be to extend your Elem_st with a query_id. And every time you find_elem_by_charset, you set this elem's query_id to thd->query_id. And here you write only elements with the current query id. If any.
As agreed on Slack, I did it in the following way: The map is now written to binlog only if the statement performed a "get default collation for a character set" resolution. Most DML statements do not do such resolution, so this approach is very good to reduce the number of map writes. And it's very cheap and simple: a new flag in THD::used. Additionally, as discussed on slack, I added automatic PS re-prepare if @@character_set_collations has changed between PREPARE and EXECUTE. Thanks.
participants (2)
-
Alexander Barkov
-
Sergei Golubchik