Re: [Maria-developers] 49ecf935415: MDEV-27009 Add UCA-14.0.0 collations

Hi, Alexander, On Mar 14, Alexander Barkov wrote:
please, list all user visible changes there. Mainly that collations are now decoupled from charsets. New syntax in CREATE TABLE, changes in I_S tables, etc.
file names are confusing. better rename ctype_ucs_uca1400_ids.inc to something like ctype_convert_uca1400_ids and ctype_utf_uca1400_ids to ctype_set_names_uca1400_ids or something like that, to show what they do.
That's FULL_COLLATION_NAME_SIZE, right?
AI_CS and S3 don't seem to be used yet
I don't see why you need a new argument here. It's create_info.default_charset_collation, so, mysql_alter_table already gets it in create_info. All other mysql_alter_table invocations also take create_info argument and can get default_charset_collation from there
Same. Can be in part_create_info
eh... please, rename the class from Yesno to something like Yesempty or Yes_or_empty, something that says that the second should not be Lex_cstring(STRING_WITH_LEN("No"))
Hmm, is that correct? Could you check other invocation of thd->reset_db()? Perhaps they all need to switch charset? In that case it should be done inside THD::reset_db(). Or may be they have to use mysql_change_db_impl() instead?
How could this (and similar if()'s in other files) fail?
you can have C++ code in mysys too, you know, no need to put it in sql/mysys*
don't use && in asserts, please create two separate asserts instead: DBUG_ASSERT(strength > 0); DBUG_ASSERT(strength <= MY_STRXFRM_NLEVELS);
+ cs->levels_for_order= ((1 << strength) - 1);
why do you still use the old concept of "strength"? Why not to use bitmap consistently everywhere?
looks quite ugly. can you do, like, default_charset_collation.set(cl) ?
if it's generated, do you need to check it in? perhaps it should be generated during the build? you've checked in allkeys1400.txt anyway.
Hmm, what if one is doing ALTER TABLE db.test CHARSET DEFAULT and current db is not `db` but `test` ?
+ if (!(cs= tmp.charset_collation())) + return true; // Should not actually happen
assert?
I'd rather put this in assert, not in if(). Like - if (!strncasecmp(context_name.str, STRING_WITH_LEN("uca1400_"))) + DBUG_ASSERT(!strncasecmp(context_cl_name.str, STRING_WITH_LEN("uca1400_")));
Like above, better DBUG_ASSERT(!strncasecmp(context_cl_name.str, STRING_WITH_LEN("uca1400_")))
What's going on with myanmar? You removed a version here and added &my_uca_v520 below in its charset_info_st. What does this change mean?
What does this change?
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org

Hello Sergei, Thanks for the review. Please review the new set of UCA-14.0.0 patches here: https://github.com/MariaDB/server/tree/bb-10.9-bar-uca14 Please see comments below: On 3/16/22 10:19 PM, Sergei Golubchik wrote:
Added. By the way, perhaps some of these statements should display short collation names: SHOW CREATE TABLE t1; SHOW CREATE DATABASE db1; SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS; SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES; SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA; Can we discuss this?
Renamed to ctype_uca1400_ids_using_convert.inc ctype_uca1400_ids_using_set_names.inc
I think we can have just one at this point, which fits any collation name (full and short).
Right, there are no old _AI_CS and _AS_CS (aka S3) collations. New _AI_CS and _AS_CS collations definitions are initialized by this function: my_uca1400_collation_definition_init(MY_CHARSET_LOADER *loader, struct charset_info_st *dst, uint id) Level flags are calculated by this function from "id". So there are no hard-coded definitions with MY_CS_COLL_LEVELS_AI_CS and MY_CS_COLL_LEVELS_S3 either. Should I remove these definitions?
I extracted this part and pushed it separately under terms of this bug fix: commit 208addf48444c0a36a2cc16cd2558ae694e905d5 Author: Alexander Barkov <bar@mariadb.com> Date: Tue May 17 12:52:23 2022 +0400 Main patch MDEV-27896 Wrong result upon `COLLATE latin1_bin CHARACTER SET latin1` on the table or the database level As you suggested, I did not add the new paramenter, I changed the data type of "create_info" instead: bool mysql_alter_table(THD *thd, const LEX_CSTRING *new_db, const LEX_CSTRING *new_name, - HA_CREATE_INFO *create_info, + Table_specification_st *create_info,
Same here: mysql_prepare_alter_table(THD *thd, TABLE *table, - HA_CREATE_INFO *create_info, + Table_specification_st *create_info,
Renamed and pushed as a separate commit: commit 821808c45dd3c5d4bc98cd04810732f647872747 (origin/bb-10.5-bar) Author: Alexander Barkov <bar@mariadb.com> Date: Thu Apr 28 11:23:12 2022 +0400 A clean-up for "MDEV-19772 Add helper classes for ST_FIELD_INFO" As agreed with Serg, renaming class Yesno to Yes_or_empty, to reflect better its behavior.
Note, this part was moved to MDEV-27896. It's not a part of UCA14 patches any more. Anyway, I checked invocation of thd->reset_db() and did not find a general rule quickly. From a glance, they mostly don't seem to need to switch the charset. But it needs to be investigated further. Should I create an MDEV for this?
It can fail in this scenario: CREATE TABLE t1 (a CHAR(10) COLLATE uca1400_cs_ci) CHARACTER SET latin1; UCA collations are not applicable to latin1 yet. Btw, this part now looks differenlty. See HA_CREATE_INFO::resolve_to_charset_collation_context() in sql_table.cc: if (!(default_table_charset= default_cscl.resolved_to_context(ctx))) return true;
This is a good idea. There was one problem: Charset_loader_mysys pushed errors and warnings into the server diagnostics area. So it could not sit in include/my_sys.h as is. I split it into two parts: - Charset_loader_mysys is defined in include/my_sys.h and does not send any errors/warnings. It is self-sufficient and is fully defined in include/my_sys.h. It does not have any method implementations in c++ files. - There is a new class Charset_loader_server. It is defined in lex_charset.h as follows: class Charset_loader_server: public Charset_loader_mysys It sends errors and warnings. And has parts implemented in lex_charset.cc.
Done.
The collation definition file Index.xml is based on the LDML syntax. It uses tags like this: <settings strength="2"/> This function is needed to handle these LDML tags. Btw, to define user-defined _AI_CS collations we'll need to add an LDML extension eventually.
This code migrated to MDEV-27896 and is not a part of UCA14 patches any more. Now it looks differently, there are no ::operator=(cl) any more. There are more constructors instead. Anyway, I'd like to comment: I agree that it does not look like something we often use in the MariaDB sources. But I like it better than set(), because set() would need the reader to jump over the sources to know what set() actually does. On the contrary, the line with the operator is very self descriptive. It's full of information: "default_charset_collation derives from Lex_charset_collation_st and here we initialize the Lex_charset_collation_st part of it". So I think direct use of operator=() makes reading easier. Ading various set() wrappers around operator=() makes reading harder.
Right, we can consider it. Btw, I've checked it all versions: $ ls mysql-test/std_data/unicode/ allkeys1400.txt allkeys400.txt allkeys520.txt So we can generate sources for all three UCA versions from these files. But I suggest we do it separately. Should I create an MDEV for this?
Right, thanks for noticing this. The problem that both DEFAULT CHARACTER SET and CONVERT TO did not work well in some cases existed for a long time. When I moved MDEV-27896 out of the UCA patches, I reported CONVERT TO problems in: MDEV-28644 Unexpected error on ALTER TABLE t1 CONVERT TO CHARACTER SET utf8mb3, DEFAULT CHARACTER SET utf8mb4 The final patch for MDEV-27896 fixed this problem as well, as it was very easy after fixing DEFAULT CHARACTER SET cs [COLLATE cl]. The idea is that both "DEFAULT CHARACTER SET" and "CONVERT TO" clauses are now fully independent, and both use the new class Lex_table_charset_collation_attrs_st as a storage: struct Table_specification_st: public HA_CREATE_INFO, public DDL_options_st { Lex_table_charset_collation_attrs_st default_charset_collation; Lex_table_charset_collation_attrs_st convert_charset_collation;
This code migrated to MDEV-27896 and was changed. There is no a line like this any more. Instead, there are classes Lex_exact_charset, Lex_exact_collation, Lex_context_collation. They catch NULL in constructors, e.g.: class Lex_exact_charset { CHARSET_INFO *m_ci; public: explicit Lex_exact_charset(CHARSET_INFO *ci) :m_ci(ci)
I fixed this in MDEV-27896. The patch for MDEV-27896 has this assert in a couple of places: DBUG_ASSERT(!strncmp(cl.charset_info()->coll_name.str, STRING_WITH_LEN("utf8mb4_uca1400_"))) <cut>
There are two ways to define the version: 1. Using the [version...] option in the tailoring. 2. Using the hardcoded initialization in the charset_info_st definition. Although, built-in collations should normally use #2, the approach #1 also worked without problems for built-in collations. But this just assumed the tailoring is used with one UCA version only! So I changed the old built-in myanmar collation to use #2 instead of #1. It changes nothing for the old myanmar collations. But the tailoring defined in "static const char myanmar[]" can now be reused in combination with multiple UCA versions.

Hi, Alexander, Few comments/questions below. Meanwhile I'm reviewing bb-10.9-bar-uca14 On May 26, Alexander Barkov wrote:
Short names, I guess. First two - for readability, the last three - so that one could join with INFORMATION_SCHEMA.COLLATIONS table.
as you like, but if you want to keep them - add a comment. otherwise anyone can remove them when they'll see those defines are unused
Hmm. Are you saying that LDML cannot describe a UCA 14.0.0 collation? Or L1+L3 without L2 isn't standard?
Why would you want to add to the history a file that is generated? It'll be there forever. You already have uca-dump.c in the tree, running it during the build is a few lines in CMakeLists.txt, basically copy-paste, because we already do it for gen_lex_hash and comp_err. Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org

Hello Sergei, thanks for the review. On 6/8/22 7:26 PM, Sergei Golubchik wrote:
In the branch preview-10.10-MDEV-27009-uca14, all these columns: SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS; SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.PARAMETERS; SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES; SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA; SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.ROUTINES; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.EVENTS; SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.EVENTS; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.ROUTINES; SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.ROUTINES; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.TRIGGERS; SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.TRIGGERS; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.VIEWS; (and corresponding columns in SHOW commands) display long names. So we need to: - Fix these columns to display short names - Add corresponding CHARACTER_SET_XXX columns - Fix system variables @@collation_xxx to understand short names. - Fix mysqldump to set both @@character_set_connection and @@collation_connection. Now it sets only @@collation_connection. It's not absolutely necessary, but would be nice to do it in the same release. Is it too late for 10.10? Or can I try to prepare an additional patch on top of the preview branch? I think 1-2 days should be enough.
L1+L3 is not standard. The closest thing is <settings strength="1" caseLevel="On"> It's not precisely L1+L3, because L3 has more than just case characteristics. It also distinguishes various variants: wide, font, cycle, narrow, sub, super, square, and some other. This is an excerpt from UTR#35 about caseLevel:
I suggest we implement both: - An extension, something like strength="1,3" or strength="AI_CS". This is very easy to do. - Then add caseLevel="On" at some point in the future. This will need more efforts.
I've implemented generation of ctype-uca1400data.h from allkeys1400.txt. In which release should I fix to generate data for version 400 and 520 as well?
participants (2)
-
Alexander Barkov
-
Sergei Golubchik