Hello Sergei, Please review a new patch version: https://github.com/MariaDB/server/commit/8d51c6d234b1730d4ff3b2c1fe7828eeca8... Comments go inline:
As we discussed on slack, my plan was not to change the hard-coded default collations in the collation library. So in the library it will always be utf8mb4_general_ci as a default collation for utf8mb4. And the server default collations will be handled by this new variable character_set_collations. I did not mean this variable to be something fringe or temporary.
I see. Then collations set as default this way must be undistinguishable in behavior from compiled-in defaults. In particular, SHOW COLLATIONS, I_S.COLLATIONS, etc (if anything) should show them as default.
Done.
And performance considerations become more important, I wasn't paying much attention to that, as I thought it's a fringe feature.
Like, a linear search in Charset_collation_map_st could be replaced with a binary. A HASH might be too much overhead, but a binary search in the array could be worth it. As it's a search that every single user will be using in the future, perhaps even in every single query. Or even many times per query.
Done. It now uses binary search.
+ LEX_CSTRING character_set_collations_str;
better LEX_CUSTRING, don't you think? It's not even a _str.
Right, it's not _str. It's a binary data. Removed the suffix.
What are benefits of LEX_CUSTRING?
LEX_CSTRING is a string, char*.
LEX_CUSTRING is an array of bytes, uchar*. I think it's more appropriate here.
I tried LEX_CUSTRING. It required a little bit more code than the current version. I propose to keep it LEX_CSTRING.
diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc --- a/sql/log_event_server.cc +++ b/sql/log_event_server.cc @@ -1194,6 +1194,14 @@ bool Query_log_event::write() int2store(start+2, auto_increment_offset); start+= 4; } + + if (thd && thd->variables.character_set_collations.count()) + { + *start++= Q_COLLATIONS_SESSION; + size_t len= thd->variables.character_set_collations.to_binary((char*)start); + start+= len; + }
Perhaps, detect if it's needed? A cheap way of doing it would be to extend your Elem_st with a query_id. And every time you find_elem_by_charset, you set this elem's query_id to thd->query_id. And here you write only elements with the current query id. If any.
As agreed on Slack, I did it in the following way: The map is now written to binlog only if the statement performed a "get default collation for a character set" resolution. Most DML statements do not do such resolution, so this approach is very good to reduce the number of map writes. And it's very cheap and simple: a new flag in THD::used. Additionally, as discussed on slack, I added automatic PS re-prepare if @@character_set_collations has changed between PREPARE and EXECUTE. Thanks.