Re: [Maria-developers] 49ecf935415: MDEV-27009 Add UCA-14.0.0 collations

16 Jun 2022

Hello Sergei,

thanks for the review.


On 6/8/22 7:26 PM, Sergei Golubchik wrote:
...
Hi, Alexander,
Few comments/questions below.
Meanwhile I'm reviewing bb-10.9-bar-uca14
On May 26, Alexander Barkov wrote:
...
By the way, perhaps some of these statements should display
short collation names:
SHOW CREATE TABLE t1;
    SHOW CREATE DATABASE db1;
    SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS;
    SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES;
    SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA;
Can we discuss this?
Short names, I guess. First two - for readability, the last three - so
that one could join with INFORMATION_SCHEMA.COLLATIONS table.
In the branch preview-10.10-MDEV-27009-uca14, all these columns:

       SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS;
       SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.PARAMETERS;
       SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES;
       SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA;
       SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.ROUTINES;
       SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.EVENTS;
       SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.EVENTS;
       SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.ROUTINES;
       SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.ROUTINES;
       SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.TRIGGERS;
       SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.TRIGGERS;
       SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.VIEWS;

(and corresponding columns in SHOW commands) display long names.

So we need to:

- Fix these columns to display short names
- Add corresponding CHARACTER_SET_XXX columns
- Fix system variables @@collation_xxx to understand short names.
- Fix mysqldump to set both @@character_set_connection
   and @@collation_connection. Now it sets only @@collation_connection.

It's not absolutely necessary, but would be nice to do it in the same 
release.

Is it too late for 10.10? Or can I try to prepare an additional
patch on top of the preview branch? I think 1-2 days should be
enough.
...
...
...
...
--- a/strings/ctype-simple.c
+++ b/strings/ctype-simple.c
@@ -1940,13 +1941,26 @@ my_bool my_propagate_complex(CHARSET_INFO *cs __attribute__((unused)),
...
why do you still use the old concept of "strength"? Why not to use
bitmap consistently everywhere?
The collation definition file Index.xml is based on the LDML syntax.
It uses tags like this:
<settings strength="2"/>
This function is needed to handle these LDML tags.
Btw, to define user-defined _AI_CS collations we'll need
to add an LDML extension eventually.
Hmm. Are you saying that LDML cannot describe a UCA 14.0.0 collation?
Or L1+L3 without L2 isn't standard?
L1+L3 is not standard. The closest thing is

<settings strength="1" caseLevel="On">

It's not precisely L1+L3, because L3 has more than just case 
characteristics. It also distinguishes various variants:
wide, font, cycle, narrow, sub, super, square, and some other.

This is an excerpt from UTR#35 about caseLevel:
...
If set to on, a level consisting only of case characteristics will be 
inserted in front of tertiary level. To ignore accents but take cases 
into account, set strength to primary and case level to on.
I suggest we implement both:
- An extension, something like strength="1,3" or strength="AI_CS".
   This is very easy to do.

- Then add caseLevel="On" at some point in the future.
   This will need more efforts.
...
...
...
...
+}
+

diff --git a/strings/ctype-uca1400data.h b/strings/ctype-uca1400data.h
--- /dev/null
+++ b/strings/ctype-uca1400data.h
@@ -0,0 +1,44151 @@
+/*
+  Generated from allkeys.txt version '14.0.0'
+*/
if it's generated, do you need to check it in?
perhaps it should be generated during the build?
you've checked in allkeys1400.txt anyway.
Right, we can consider it.
Btw, I've checked it all versions:
$ ls mysql-test/std_data/unicode/
allkeys1400.txt
allkeys400.txt
allkeys520.txt
So we can generate sources for all three UCA versions
from these files.
But I suggest we do it separately.
Should I create an MDEV for this?
Why would you want to add to the history a file that is generated?
It'll be there forever.
You already have uca-dump.c in the tree, running it during the build is
a few lines in CMakeLists.txt, basically copy-paste, because we already
do it for gen_lex_hash and comp_err.
I've implemented generation of ctype-uca1400data.h from allkeys1400.txt.

In which release should I fix to generate data for version 400 and 520 
as well?
...
Regards,
Sergei
VP of MariaDB Server Engineering
and security@mariadb.org