Hi, Alexander, On Mar 10, Alexander Barkov wrote:
On 3/7/23 2:20 PM, Sergei Golubchik wrote:
I'm sorry it took a while.
I'm still thinking about the whole thing, it's a rather big change for a really fringe functionality. But I failed to come up with something better so far.
Code-wise the patch is mostly fine. See few small comments below, and one slightly larger comment re. replication.
As we discussed on slack, my plan was not to change the hard-coded default collations in the collation library. So in the library it will always be utf8mb4_general_ci as a default collation for utf8mb4. And the server default collations will be handled by this new variable character_set_collations. I did not mean this variable to be something fringe or temporary.
I see. Then collations set as default this way must be undistinguishable in behavior from compiled-in defaults. In particular, SHOW COLLATIONS, I_S.COLLATIONS, etc (if anything) should show them as default. And performance considerations become more important, I wasn't paying much attention to that, as I thought it's a fringe feature. Like, a linear search in Charset_collation_map_st could be replaced with a binary. A HASH might be too much overhead, but a binary search in the array could be worth it. As it's a search that every single user will be using in the future, perhaps even in every single query. Or even many times per query.
On Mar 07, Alexander Barkov wrote:
revision-id: 940d028521f (mariadb-10.11.1-4-g940d028521f) parent(s): 71dedd0ebcd author: Alexander Barkov committer: Alexander Barkov timestamp: 2022-12-14 20:00:22 +0400 message:
MDEV-30164 System variable for default collations
This patch adds a way to override default collations (or "character set collations") for desired character sets.
diff --git a/sql/lex_charset.h b/sql/lex_charset.h --- a/sql/lex_charset.h +++ b/sql/lex_charset.h
+ case Lex_exact_charset_extended_collation_attrs_st::TYPE_CHARACTER_SET_COLLATE_EXACT: + case Lex_exact_charset_extended_collation_attrs_st::TYPE_COLLATE_CONTEXTUALLY_TYPED:
COLLATE DEFAULT is TYPE_COLLATE_CONTEXTUALLY_TYPED. shouldn't it use the map too?
This method charset_info() is used only in one place, in this method of class Column_definition: ...
Thanks for the explanation!
+ case Lex_exact_charset_extended_collation_attrs_st::TYPE_COLLATE_EXACT: + break; + } + return m_ci; + } Type type() const { return m_type; diff --git a/sql/log_event.h b/sql/log_event.h --- a/sql/log_event.h +++ b/sql/log_event.h @@ -2147,6 +2153,8 @@ class Query_log_event: public Log_event bool sql_mode_inited; bool charset_inited;
+ LEX_CSTRING character_set_collations_str;
better LEX_CUSTRING, don't you think? It's not even a _str.
Right, it's not _str. It's a binary data. Removed the suffix.
What are benefits of LEX_CUSTRING?
LEX_CSTRING is a string, char*. LEX_CUSTRING is an array of bytes, uchar*. I think it's more appropriate here.
diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc --- a/sql/log_event_server.cc +++ b/sql/log_event_server.cc @@ -1194,6 +1194,14 @@ bool Query_log_event::write() int2store(start+2, auto_increment_offset); start+= 4; } + + if (thd && thd->variables.character_set_collations.count()) + { + *start++= Q_COLLATIONS_SESSION; + size_t len= thd->variables.character_set_collations.to_binary((char*)start); + start+= len; + }
Perhaps, detect if it's needed? A cheap way of doing it would be to extend your Elem_st with a query_id. And every time you find_elem_by_charset, you set this elem's query_id to thd->query_id. And here you write only elements with the current query id. If any.
Another approach would be to have a bitmap, like
uchar used_default_coll_mapping;
and in find_elem_by_charset() you set the bit, like
used_default_coll_mapping |= 1 << i;
and then, again, print affected collations, if any. Most often used_default_coll_mapping will likely be zero
This will need additional resources even if the server is not in replication at all.
8 bytes per collation in the map or one byte per thd? That's not noticeable.
But on the other hand, when the server is in replication, the benefit on the slave side is not so obvious.
The benefit is that it'll log only the affected collations (typically zero or one) not the whole map in every event.
I propose a different optimization. It's related to the question below:
one more question. In, say, 10.10->11.1 replication master and slave will have different default collations, but thd->variables.character_set_collations will not reflect that. How do you plan to solve it?
+ if (charset_inited) { *start++= Q_CHARSET_CODE;
Every major release will have its own default value for character_set_collations.
For example, in 11.1 we plan to change it to:
utf8mb3=uca1400_ai_ci,utf8mb4=uca1400_ai_ci,ucs2=uca1400_ai_ci,utf16=uca1400_ai_ci,utf16le=uca1400_ai_ci,utf32=uca1400_ai_ci
while in 10.11 it will be empty by default.
I propose to add server version into the binlog event header. (by the way it will be useful for many other purposes).
It's there already. see log_event.cc for Format_description_log_event::server_version_split and class Version.
If we know the master version on the slave side, we can optimize writing of the Q_COLLATION_SESSION chunk when character_set_collations is equal to its default value (according to the master version):
I don't know how it can work, if you replicate from 11.1 to an old slave that knows how to read Q_COLLATION_SESSION, but doesn't know what hard-coded defaults are in some later MariaDB version. Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org