Re: [Maria-developers] 49ecf935415: MDEV-27009 Add UCA-14.0.0 collations
Hi, Alexander, On Mar 14, Alexander Barkov wrote:
revision-id: 49ecf935415 (mariadb-10.6.1-335-g49ecf935415) parent(s): c67789f63c8 author: Alexander Barkov committer: Alexander Barkov timestamp: 2022-02-28 14:04:58 +0400 message:
MDEV-27009 Add UCA-14.0.0 collations
please, list all user visible changes there. Mainly that collations are now decoupled from charsets. New syntax in CREATE TABLE, changes in I_S tables, etc.
diff --git a/mysql-test/include/ctype_utf_uca1400_ids.inc b/mysql-test/include/ctype_utf_uca1400_ids.inc new file mode 100644 index 00000000000..09cf49fc0e7 --- /dev/null +++ b/mysql-test/include/ctype_utf_uca1400_ids.inc @@ -0,0 +1,17 @@
file names are confusing. better rename ctype_ucs_uca1400_ids.inc to something like ctype_convert_uca1400_ids and ctype_utf_uca1400_ids to ctype_set_names_uca1400_ids or something like that, to show what they do.
+ +--disable_ps_protocol +--enable_metadata +DELIMITER $$; +FOR rec IN (SELECT COLLATION_NAME + FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY + WHERE CHARACTER_SET_NAME=@charset + AND COLLATION_NAME RLIKE 'uca1400' + ORDER BY ID) +DO + EXECUTE IMMEDIATE CONCAT('SET NAMES ',@charset,' COLLATE ', rec.COLLATION_NAME); + SELECT rec.COLLATION_NAME; +END FOR; +$$ +DELIMITER ;$$ +--disable_metadata +--enable_ps_protocol diff --git a/include/m_ctype.h b/include/m_ctype.h index 4c6628b72b3..706764ead2a 100644 --- a/include/m_ctype.h +++ b/include/m_ctype.h @@ -34,7 +34,9 @@ enum loglevel { extern "C" { #endif
-#define MY_CS_NAME_SIZE 32 +#define MY_CS_CHARACTER_SET_NAME_SIZE 32 +#define MY_CS_COLLATION_NAME_SIZE 64
That's FULL_COLLATION_NAME_SIZE, right?
+ #define MY_CS_CTYPE_TABLE_SIZE 257 #define MY_CS_TO_LOWER_TABLE_SIZE 256 #define MY_CS_TO_UPPER_TABLE_SIZE 256 @@ -240,6 +242,46 @@ typedef enum enum_repertoire_t } my_repertoire_t;
+/* ID compatibility */ +typedef enum enum_collation_id_type +{ + MY_COLLATION_ID_TYPE_PRECISE= 0, + MY_COLLATION_ID_TYPE_COMPAT_100800= 1 +} my_collation_id_type_t; + + +/* Collation name display modes */ +typedef enum enum_collation_name_mode +{ + MY_COLLATION_NAME_MODE_FULL= 0, + MY_COLLATION_NAME_MODE_CONTEXT= 1 +} my_collation_name_mode_t; + + +/* Level flags */ +#define MY_CS_LEVEL_BIT_PRIMARY 0x00 +#define MY_CS_LEVEL_BIT_SECONDARY 0x01 +#define MY_CS_LEVEL_BIT_TERTIARY 0x02 +#define MY_CS_LEVEL_BIT_QUATERNARY 0x03 + +#define MY_CS_COLL_LEVELS_S1 (1<<MY_CS_LEVEL_BIT_PRIMARY) + +#define MY_CS_COLL_LEVELS_AI_CS (1<<MY_CS_LEVEL_BIT_PRIMARY)| \ + (1<<MY_CS_LEVEL_BIT_TERTIARY) + +#define MY_CS_COLL_LEVELS_S2 (1<<MY_CS_LEVEL_BIT_PRIMARY)| \ + (1<<MY_CS_LEVEL_BIT_SECONDARY) + +#define MY_CS_COLL_LEVELS_S3 (1<<MY_CS_LEVEL_BIT_PRIMARY)| \ + (1<<MY_CS_LEVEL_BIT_SECONDARY) | \ + (1<<MY_CS_LEVEL_BIT_TERTIARY)
AI_CS and S3 don't seem to be used yet
+ +#define MY_CS_COLL_LEVELS_S4 (1<<MY_CS_LEVEL_BIT_PRIMARY)| \ + (1<<MY_CS_LEVEL_BIT_SECONDARY) | \ + (1<<MY_CS_LEVEL_BIT_TERTIARY) | \ + (1<<MY_CS_LEVEL_BIT_QUATERNARY) + + /* Flags for strxfrm */ #define MY_STRXFRM_LEVEL1 0x00000001 /* for primary weights */ #define MY_STRXFRM_LEVEL2 0x00000002 /* for secondary weights */ diff --git a/sql/sql_alter.cc b/sql/sql_alter.cc index 86c6e9a27f8..9ddd482ad57 100644 --- a/sql/sql_alter.cc +++ b/sql/sql_alter.cc @@ -546,6 +546,7 @@ bool Sql_cmd_alter_table::execute(THD *thd)
result= mysql_alter_table(thd, &select_lex->db, &lex->name, &create_info, + lex->create_info.default_charset_collation,
I don't see why you need a new argument here. It's create_info.default_charset_collation, so, mysql_alter_table already gets it in create_info. All other mysql_alter_table invocations also take create_info argument and can get default_charset_collation from there
first_table, &alter_info, select_lex->order_list.elements, diff --git a/sql/sql_partition_admin.cc b/sql/sql_partition_admin.cc index fb1ae0d5fc7..4188dde252b 100644 --- a/sql/sql_partition_admin.cc +++ b/sql/sql_partition_admin.cc @@ -211,6 +211,7 @@ bool compare_table_with_partition(THD *thd, TABLE *table, TABLE *part_table, part_table->use_all_columns(); table->use_all_columns(); if (unlikely(mysql_prepare_alter_table(thd, part_table, &part_create_info, + Lex_maybe_default_charset_collation(),
Same. Can be in part_create_info
&part_alter_info, &part_alter_ctx))) { my_error(ER_TABLES_DIFFERENT_METADATA, MYF(0)); diff --git a/sql/sql_i_s.h b/sql/sql_i_s.h index bed2e886718..5ff06d32231 100644 --- a/sql/sql_i_s.h +++ b/sql/sql_i_s.h @@ -162,6 +162,11 @@ class Yesno: public Varchar { public: Yesno(): Varchar(3) { } + static LEX_CSTRING value(bool val) + { + return val ? Lex_cstring(STRING_WITH_LEN("Yes")) : + Lex_cstring(); + }
eh... please, rename the class from Yesno to something like Yesempty or Yes_or_empty, something that says that the second should not be Lex_cstring(STRING_WITH_LEN("No"))
};
diff --git a/sql/table.cc b/sql/table.cc index a683a78ff49..c28cb2bd928 100644 --- a/sql/table.cc +++ b/sql/table.cc @@ -3491,6 +3493,16 @@ int TABLE_SHARE::init_from_sql_statement_string(THD *thd, bool write, else thd->set_n_backup_active_arena(arena, &backup);
+ /* + THD::reset_db() does not set THD::db_charset, + so it keeps pointing to the character set and collation + of the current database, rather than the database of the + new initialized table.
Hmm, is that correct? Could you check other invocation of thd->reset_db()? Perhaps they all need to switch charset? In that case it should be done inside THD::reset_db(). Or may be they have to use mysql_change_db_impl() instead?
+ Let's call get_default_db_collation() before reset_db(). + This forces the db.opt file to be loaded. + */ + db_cs= get_default_db_collation(thd, db.str); + thd->reset_db(&db); lex_start(thd);
@@ -3498,6 +3510,11 @@ int TABLE_SHARE::init_from_sql_statement_string(THD *thd, bool write, sql_unusable_for_discovery(thd, hton, sql_copy)))) goto ret;
+ if (!(thd->lex->create_info.default_table_charset= + thd->lex->create_info.default_charset_collation. + resolved_to_character_set(db_cs, db_cs))) + DBUG_RETURN(true);
How could this (and similar if()'s in other files) fail?
+ thd->lex->create_info.db_type= hton; #ifdef WITH_PARTITION_STORAGE_ENGINE thd->work_part_info= 0; // For partitioning diff --git a/sql/mysys_charset.h b/sql/mysys_charset.h new file mode 100644 index 00000000000..86eaeedd432 --- /dev/null +++ b/sql/mysys_charset.h @@ -0,0 +1,44 @@ +#ifndef MYSYS_CHARSET +#define MYSYS_CHARSET + +/* Copyright (c) 2021, MariaDB Corporation. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; version 2 of the License. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335 USA */ + + +#include "my_sys.h" + + +class Charset_loader_mysys: public MY_CHARSET_LOADER +{ +public: + Charset_loader_mysys() + { + my_charset_loader_init_mysys(this); + } + void raise_unknown_collation_error(const char *name, + CHARSET_INFO *name_cs) const; + CHARSET_INFO *get_charset(const char *cs_name, uint cs_flags, myf my_flags); + CHARSET_INFO *get_exact_collation(const char *name, myf utf8_flag); + CHARSET_INFO *get_contextually_typed_collation(CHARSET_INFO *cs, + const char *name); + CHARSET_INFO *get_contextually_typed_collation(const char *name); + CHARSET_INFO *get_contextually_typed_collation_or_error(CHARSET_INFO *cs, + const char *name); + CHARSET_INFO *find_default_collation(CHARSET_INFO *cs); + CHARSET_INFO *find_bin_collation_or_error(CHARSET_INFO *cs); +};
you can have C++ code in mysys too, you know, no need to put it in sql/mysys*
+ +#endif // MYSYS_CHARSET + diff --git a/strings/ctype-simple.c b/strings/ctype-simple.c index b579f0af203..d09dfba86ed 100644 --- a/strings/ctype-simple.c +++ b/strings/ctype-simple.c @@ -1940,13 +1941,26 @@ my_bool my_propagate_complex(CHARSET_INFO *cs __attribute__((unused)), }
+void my_ci_set_strength(struct charset_info_st *cs, uint strength) +{ + DBUG_ASSERT(strength > 0 && strength <= MY_STRXFRM_NLEVELS);
don't use && in asserts, please create two separate asserts instead: DBUG_ASSERT(strength > 0); DBUG_ASSERT(strength <= MY_STRXFRM_NLEVELS);
+ cs->levels_for_order= ((1 << strength) - 1);
why do you still use the old concept of "strength"? Why not to use bitmap consistently everywhere?
+} + + +void my_ci_set_level_flags(struct charset_info_st *cs, uint flags) +{ + DBUG_ASSERT(flags < (1<<MY_STRXFRM_NLEVELS)); + cs->levels_for_order= flags; +} + /* Normalize strxfrm flags
SYNOPSIS: my_strxfrm_flag_normalize() + cs - the CHARSET_INFO pointer flags - non-normalized flags - nlevels - number of levels
NOTES: If levels are omitted, then 1-maximum is assumed. diff --git a/sql/handler.h b/sql/handler.h index 8ad521e189a..1e82f37b1e7 100644 --- a/sql/handler.h +++ b/sql/handler.h @@ -2409,7 +2386,32 @@ struct Table_specification_st: public HA_CREATE_INFO, { HA_CREATE_INFO::options= 0; DDL_options_st::init(); + default_charset_collation.init(); + } + + bool + add_alter_list_item_convert_to_charset(const Lex_charset_collation_st &cl) + { + /* + cs cannot be NULL, as sql_yacc.yy translates + CONVERT TO CHARACTER SET DEFAULT + to + CONVERT TO CHARACTER SET <character-set-of-the-current-database> + TODO: Shouldn't we postpone resolution of DEFAULT until the + character set of the table owner database is loaded from its db.opt? + */ + DBUG_ASSERT(cl.charset_collation()); + DBUG_ASSERT(!cl.is_contextually_typed_collation()); + alter_table_convert_to_charset= cl.charset_collation(); + default_charset_collation.Lex_charset_collation_st::operator=(cl);
looks quite ugly. can you do, like, default_charset_collation.set(cl) ?
+ used_fields|= (HA_CREATE_USED_CHARSET | HA_CREATE_USED_DEFAULT_CHARSET); + return false; } + bool add_table_option_default_charset(CHARSET_INFO *cs); + bool add_table_option_default_collation(const Lex_charset_collation_st &cl); + bool resolve_db_charset_and_collation(THD *thd, + const LEX_CSTRING &db, + bool is_alter); };
diff --git a/strings/ctype-uca1400data.h b/strings/ctype-uca1400data.h new file mode 100644 index 00000000000..da95dcfde54 --- /dev/null +++ b/strings/ctype-uca1400data.h @@ -0,0 +1,44151 @@ +/* + Generated from allkeys.txt version '14.0.0' +*/
if it's generated, do you need to check it in? perhaps it should be generated during the build? you've checked in allkeys1400.txt anyway.
+static const uint16 uca1400_p000[]= { /* 0000 (4 weights per char) */ +0x0000,0x0000,0x0000,0x0000, 0x0000,0x0000,0x0000,0x0000, /* 0000 */ +0x0000,0x0000,0x0000,0x0000, 0x0000,0x0000,0x0000,0x0000, /* 0002 */ diff --git a/sql/sql_lex.cc b/sql/sql_lex.cc index 6ca10267187..d115401a855 100644 --- a/sql/sql_lex.cc +++ b/sql/sql_lex.cc @@ -542,6 +542,30 @@ bool LEX::add_alter_list(LEX_CSTRING name, LEX_CSTRING new_name, bool exists) }
+bool LEX::add_alter_list_item_convert_to_charset( + THD *thd, + CHARSET_INFO *cs, + const Lex_charset_collation_st &cl) +{ + if (!cs) + { + Lex_charset_collation_st tmp; + tmp.set_charset_collate_default(thd->variables.collation_database);
Hmm, what if one is doing ALTER TABLE db.test CHARSET DEFAULT and current db is not `db` but `test` ?
+ if (!(cs= tmp.charset_collation())) + return true; // Should not actually happen
assert?
+ } + + Lex_explicit_charset_opt_collate tmp(cs, false); + if (tmp.merge_opt_collate_or_error(cl) || + create_info.add_alter_list_item_convert_to_charset( + Lex_charset_collation(tmp))) + return true; + + alter_info.flags|= ALTER_CONVERT_TO; + return false; +} + + void LEX::init_last_field(Column_definition *field, const LEX_CSTRING *field_name) { @@ -11871,29 +11869,41 @@ CHARSET_INFO *Lex_collation_st::find_default_collation(CHARSET_INFO *cs) "def" is the upper level CHARACTER SET clause (e.g. of a table) */ CHARSET_INFO * -Lex_collation_st::resolved_to_character_set(CHARSET_INFO *def) const +Lex_charset_collation_st::resolved_to_character_set(CHARSET_INFO *def) const { DBUG_ASSERT(def); - if (m_type != TYPE_CONTEXTUALLY_TYPED) - { - if (!m_collation) - return def; // Empty - not typed at all - return m_collation; // Explicitly typed + + switch (m_type) { + case TYPE_EMPTY: + return def; + case TYPE_CHARACTER_SET: + DBUG_ASSERT(m_ci); + return m_ci; + case TYPE_COLLATE_EXACT: + DBUG_ASSERT(m_ci); + return m_ci; + case TYPE_COLLATE_CONTEXTUALLY_TYPED: + break; }
// Contextually typed - DBUG_ASSERT(m_collation); + DBUG_ASSERT(m_ci);
- if (m_collation == &my_charset_bin) // CHAR(10) BINARY - return find_bin_collation(def); + Charset_loader_mysys loader; + if (is_contextually_typed_binary_style()) // CHAR(10) BINARY + return loader.find_bin_collation_or_error(def);
- if (m_collation == &my_charset_latin1) // CHAR(10) COLLATE DEFAULT - return find_default_collation(def); + if (is_contextually_typed_collate_default()) // CHAR(10) COLLATE DEFAULT + return loader.find_default_collation(def); + + const LEX_CSTRING context_name= collation_name_context_suffix();
I'd rather put this in assert, not in if(). Like - if (!strncasecmp(context_name.str, STRING_WITH_LEN("uca1400_"))) + DBUG_ASSERT(!strncasecmp(context_cl_name.str, STRING_WITH_LEN("uca1400_")));
+ if (!strncasecmp(context_name.str, STRING_WITH_LEN("uca1400_"))) + return loader.get_contextually_typed_collation_or_error(def, + context_name.str);
/* - Non-binary and non-default contextually typed collation. + Non-binary, non-default, non-uca1400 contextually typed collation. We don't have such yet - the parser cannot produce this. - But will have soon, e.g. "uca1400_as_ci". */ DBUG_ASSERT(0); return NULL; @@ -11944,58 +11972,106 @@ bool Lex_collation_st:: CHAR(10) BINARY .. COLLATE latin1_bin CHAR(10) COLLATE uca1400_as_ci .. COLLATE latin1_bin */ - if (collation() == &my_charset_latin1 && - !(cl.collation()->state & MY_CS_PRIMARY)) + if (is_contextually_typed_collate_default() && + !cl.charset_collation()->default_flag()) { - my_error(ER_CONFLICTING_DECLARATIONS, MYF(0), - "COLLATE ", "DEFAULT", "COLLATE ", - cl.collation()->coll_name.str); + error_conflicting_collations_or_styles(*this, cl); return true; } - if (collation() == &my_charset_bin && - !(cl.collation()->state & MY_CS_BINSORT)) + + if (is_contextually_typed_binary_style() && + !cl.charset_collation()->binsort_flag()) { - my_error(ER_CONFLICTING_DECLARATIONS, MYF(0), - "", "BINARY", "COLLATE ", cl.collation()->coll_name.str); + error_conflicting_collations_or_styles(*this, cl); return true; } *this= cl; return false; }
- if (cl.is_contextually_typed_collation()) - { + DBUG_ASSERT(0); + return false; +} + + +bool +Lex_explicit_charset_opt_collate:: + merge_collate_or_error(const Lex_charset_collation_st &cl) +{ + DBUG_ASSERT(cl.type() != Lex_charset_collation_st::TYPE_CHARACTER_SET); + + switch (cl.type()) { + case Lex_charset_collation_st::TYPE_EMPTY: + return false; + case Lex_charset_collation_st::TYPE_CHARACTER_SET: + DBUG_ASSERT(0); + return false; + case Lex_charset_collation_st::TYPE_COLLATE_EXACT: /* - EXPLICIT + CONTEXT - CHAR(10) COLLATE latin1_bin .. COLLATE DEFAULT - not supported - CHAR(10) COLLATE latin1_bin .. COLLATE uca1400_as_ci - not yet + EXPLICIT + EXPLICIT + CHAR(10) CHARACTER SET latin1 .. COLLATE latin1_bin + CHAR(10) CHARACTER SET latin1 COLLATE latin1_bin .. COLLATE latin1_bin + CHAR(10) COLLATE latin1_bin .. COLLATE latin1_bin + CHAR(10) COLLATE latin1_bin .. COLLATE latin1_bin + CHAR(10) CHARACTER SET latin1 BINARY .. COLLATE latin1_bin */ - DBUG_ASSERT(0); // Not possible yet + if (m_with_collate && m_ci != cl.charset_collation()) + { + my_error(ER_CONFLICTING_DECLARATIONS, MYF(0), + "COLLATE ", m_ci->coll_name.str, + "COLLATE ", cl.charset_collation()->coll_name.str); + return true; + } + if (!my_charset_same(m_ci, cl.charset_collation())) + { + my_error(ER_COLLATION_CHARSET_MISMATCH, MYF(0), + cl.charset_collation()->coll_name.str, m_ci->cs_name.str); + return true; + } + m_ci= cl.charset_collation(); + m_with_collate= true; return false; - }
- /* - EXPLICIT + EXPLICIT - CHAR(10) CHARACTER SET latin1 .. COLLATE latin1_bin - CHAR(10) CHARACTER SET latin1 COLLATE latin1_bin .. COLLATE latin1_bin - CHAR(10) COLLATE latin1_bin .. COLLATE latin1_bin - CHAR(10) COLLATE latin1_bin .. COLLATE latin1_bin - CHAR(10) CHARACTER SET latin1 BINARY .. COLLATE latin1_bin - */ - if (type() == TYPE_EXPLICIT && collation() != cl.collation()) - { - my_error(ER_CONFLICTING_DECLARATIONS, MYF(0), - "COLLATE ", collation()->coll_name.str, - "COLLATE ", cl.collation()->coll_name.str); - return true; - } - if (!my_charset_same(collation(), cl.collation())) - { - my_error(ER_COLLATION_CHARSET_MISMATCH, MYF(0), - cl.collation()->coll_name.str, collation()->cs_name.str); - return true; + case Lex_charset_collation_st::TYPE_COLLATE_CONTEXTUALLY_TYPED: + if (cl.is_contextually_typed_collate_default()) + { + /* + SET NAMES latin1 COLLATE DEFAULT; + ALTER TABLE t1 CONVERT TO CHARACTER SET latin1 COLLATE DEFAULT; + */ + CHARSET_INFO *tmp= Charset_loader_mysys().find_default_collation(m_ci); + if (!tmp) + return true; + m_ci= tmp; + m_with_collate= true; + return false; + } + else + { + /* + EXPLICIT + CONTEXT + CHAR(10) COLLATE latin1_bin .. COLLATE DEFAULT not possible yet + CHAR(10) COLLATE latin1_bin .. COLLATE uca1400_as_ci + */ + + const LEX_CSTRING context_cl_name= cl.collation_name_context_suffix(); + if (!strncasecmp(context_cl_name.str, STRING_WITH_LEN("uca1400_")))
Like above, better DBUG_ASSERT(!strncasecmp(context_cl_name.str, STRING_WITH_LEN("uca1400_")))
+ { + CHARSET_INFO *tmp; + Charset_loader_mysys loader; + if (!(tmp= loader.get_contextually_typed_collation_or_error(m_ci, + context_cl_name.str))) + return true; + m_with_collate= true; + m_ci= tmp; + return false; + } + + DBUG_ASSERT(0); // Not possible yet + return false; + } } - *this= cl; + DBUG_ASSERT(0); return false; }
diff --git a/strings/ctype-uca.c b/strings/ctype-uca.c index b89916f3b20..3e6b4e4ce43 100644 --- a/strings/ctype-uca.c +++ b/strings/ctype-uca.c @@ -30542,7 +30613,7 @@ static const char vietnamese[]= Myanmar, according to CLDR Revision 8900. http://unicode.org/cldr/trac/browser/trunk/common/collation/my.xml */ -static const char myanmar[]= "[shift-after-method expand][version 5.2.0]" +static const char myanmar[]= "[shift-after-method expand]"
What's going on with myanmar? You removed a version here and added &my_uca_v520 below in its charset_info_st. What does this change mean?
/* Tones */ "&\\u108C" "<\\u1037" @@ -37627,7 +37825,7 @@ struct charset_info_st my_charset_utf32_myanmar_uca_ci= NULL, /* to_lower */ NULL, /* to_upper */ NULL, /* sort_order */ - NULL, /* uca */ + &my_uca_v520, /* uca */
What does this change?
NULL, /* tab_to_uni */ NULL, /* tab_from_uni */ &my_unicase_unicode520,/* caseinfo */
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hello Sergei, Thanks for the review. Please review the new set of UCA-14.0.0 patches here: https://github.com/MariaDB/server/tree/bb-10.9-bar-uca14 Please see comments below: On 3/16/22 10:19 PM, Sergei Golubchik wrote:
Hi, Alexander,
On Mar 14, Alexander Barkov wrote:
revision-id: 49ecf935415 (mariadb-10.6.1-335-g49ecf935415) parent(s): c67789f63c8 author: Alexander Barkov committer: Alexander Barkov timestamp: 2022-02-28 14:04:58 +0400 message:
MDEV-27009 Add UCA-14.0.0 collations
please, list all user visible changes there. Mainly that collations are now decoupled from charsets. New syntax in CREATE TABLE, changes in I_S tables, etc.
Added. By the way, perhaps some of these statements should display short collation names: SHOW CREATE TABLE t1; SHOW CREATE DATABASE db1; SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS; SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES; SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA; Can we discuss this?
diff --git a/mysql-test/include/ctype_utf_uca1400_ids.inc b/mysql-test/include/ctype_utf_uca1400_ids.inc new file mode 100644 index 00000000000..09cf49fc0e7 --- /dev/null +++ b/mysql-test/include/ctype_utf_uca1400_ids.inc @@ -0,0 +1,17 @@
file names are confusing. better rename ctype_ucs_uca1400_ids.inc to something like ctype_convert_uca1400_ids and ctype_utf_uca1400_ids to ctype_set_names_uca1400_ids or something like that, to show what they do.
Renamed to ctype_uca1400_ids_using_convert.inc ctype_uca1400_ids_using_set_names.inc
+ +--disable_ps_protocol +--enable_metadata +DELIMITER $$; +FOR rec IN (SELECT COLLATION_NAME + FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY + WHERE CHARACTER_SET_NAME=@charset + AND COLLATION_NAME RLIKE 'uca1400' + ORDER BY ID) +DO + EXECUTE IMMEDIATE CONCAT('SET NAMES ',@charset,' COLLATE ', rec.COLLATION_NAME); + SELECT rec.COLLATION_NAME; +END FOR; +$$ +DELIMITER ;$$ +--disable_metadata +--enable_ps_protocol diff --git a/include/m_ctype.h b/include/m_ctype.h index 4c6628b72b3..706764ead2a 100644 --- a/include/m_ctype.h +++ b/include/m_ctype.h @@ -34,7 +34,9 @@ enum loglevel { extern "C" { #endif
-#define MY_CS_NAME_SIZE 32 +#define MY_CS_CHARACTER_SET_NAME_SIZE 32 +#define MY_CS_COLLATION_NAME_SIZE 64
That's FULL_COLLATION_NAME_SIZE, right?
I think we can have just one at this point, which fits any collation name (full and short).
+ #define MY_CS_CTYPE_TABLE_SIZE 257 #define MY_CS_TO_LOWER_TABLE_SIZE 256 #define MY_CS_TO_UPPER_TABLE_SIZE 256 @@ -240,6 +242,46 @@ typedef enum enum_repertoire_t } my_repertoire_t;
+/* ID compatibility */ +typedef enum enum_collation_id_type +{ + MY_COLLATION_ID_TYPE_PRECISE= 0, + MY_COLLATION_ID_TYPE_COMPAT_100800= 1 +} my_collation_id_type_t; + + +/* Collation name display modes */ +typedef enum enum_collation_name_mode +{ + MY_COLLATION_NAME_MODE_FULL= 0, + MY_COLLATION_NAME_MODE_CONTEXT= 1 +} my_collation_name_mode_t; + + +/* Level flags */ +#define MY_CS_LEVEL_BIT_PRIMARY 0x00 +#define MY_CS_LEVEL_BIT_SECONDARY 0x01 +#define MY_CS_LEVEL_BIT_TERTIARY 0x02 +#define MY_CS_LEVEL_BIT_QUATERNARY 0x03 + +#define MY_CS_COLL_LEVELS_S1 (1<<MY_CS_LEVEL_BIT_PRIMARY) + +#define MY_CS_COLL_LEVELS_AI_CS (1<<MY_CS_LEVEL_BIT_PRIMARY)| \ + (1<<MY_CS_LEVEL_BIT_TERTIARY) + +#define MY_CS_COLL_LEVELS_S2 (1<<MY_CS_LEVEL_BIT_PRIMARY)| \ + (1<<MY_CS_LEVEL_BIT_SECONDARY) + +#define MY_CS_COLL_LEVELS_S3 (1<<MY_CS_LEVEL_BIT_PRIMARY)| \ + (1<<MY_CS_LEVEL_BIT_SECONDARY) | \ + (1<<MY_CS_LEVEL_BIT_TERTIARY)
AI_CS and S3 don't seem to be used yet
Right, there are no old _AI_CS and _AS_CS (aka S3) collations. New _AI_CS and _AS_CS collations definitions are initialized by this function: my_uca1400_collation_definition_init(MY_CHARSET_LOADER *loader, struct charset_info_st *dst, uint id) Level flags are calculated by this function from "id". So there are no hard-coded definitions with MY_CS_COLL_LEVELS_AI_CS and MY_CS_COLL_LEVELS_S3 either. Should I remove these definitions?
+ +#define MY_CS_COLL_LEVELS_S4 (1<<MY_CS_LEVEL_BIT_PRIMARY)| \ + (1<<MY_CS_LEVEL_BIT_SECONDARY) | \ + (1<<MY_CS_LEVEL_BIT_TERTIARY) | \ + (1<<MY_CS_LEVEL_BIT_QUATERNARY) + + /* Flags for strxfrm */ #define MY_STRXFRM_LEVEL1 0x00000001 /* for primary weights */ #define MY_STRXFRM_LEVEL2 0x00000002 /* for secondary weights */ diff --git a/sql/sql_alter.cc b/sql/sql_alter.cc index 86c6e9a27f8..9ddd482ad57 100644 --- a/sql/sql_alter.cc +++ b/sql/sql_alter.cc @@ -546,6 +546,7 @@ bool Sql_cmd_alter_table::execute(THD *thd)
result= mysql_alter_table(thd, &select_lex->db, &lex->name, &create_info, + lex->create_info.default_charset_collation,
I don't see why you need a new argument here. It's create_info.default_charset_collation, so, mysql_alter_table already gets it in create_info. All other mysql_alter_table invocations also take create_info argument and can get default_charset_collation from there
I extracted this part and pushed it separately under terms of this bug fix: commit 208addf48444c0a36a2cc16cd2558ae694e905d5 Author: Alexander Barkov <bar@mariadb.com> Date: Tue May 17 12:52:23 2022 +0400 Main patch MDEV-27896 Wrong result upon `COLLATE latin1_bin CHARACTER SET latin1` on the table or the database level As you suggested, I did not add the new paramenter, I changed the data type of "create_info" instead: bool mysql_alter_table(THD *thd, const LEX_CSTRING *new_db, const LEX_CSTRING *new_name, - HA_CREATE_INFO *create_info, + Table_specification_st *create_info,
first_table, &alter_info, select_lex->order_list.elements, diff --git a/sql/sql_partition_admin.cc b/sql/sql_partition_admin.cc index fb1ae0d5fc7..4188dde252b 100644 --- a/sql/sql_partition_admin.cc +++ b/sql/sql_partition_admin.cc @@ -211,6 +211,7 @@ bool compare_table_with_partition(THD *thd, TABLE *table, TABLE *part_table, part_table->use_all_columns(); table->use_all_columns(); if (unlikely(mysql_prepare_alter_table(thd, part_table, &part_create_info, + Lex_maybe_default_charset_collation(),
Same. Can be in part_create_info
Same here: mysql_prepare_alter_table(THD *thd, TABLE *table, - HA_CREATE_INFO *create_info, + Table_specification_st *create_info,
&part_alter_info, &part_alter_ctx))) { my_error(ER_TABLES_DIFFERENT_METADATA, MYF(0)); diff --git a/sql/sql_i_s.h b/sql/sql_i_s.h index bed2e886718..5ff06d32231 100644 --- a/sql/sql_i_s.h +++ b/sql/sql_i_s.h @@ -162,6 +162,11 @@ class Yesno: public Varchar { public: Yesno(): Varchar(3) { } + static LEX_CSTRING value(bool val) + { + return val ? Lex_cstring(STRING_WITH_LEN("Yes")) : + Lex_cstring(); + }
eh... please, rename the class from Yesno to something like Yesempty or Yes_or_empty, something that says that the second should not be Lex_cstring(STRING_WITH_LEN("No"))
Renamed and pushed as a separate commit: commit 821808c45dd3c5d4bc98cd04810732f647872747 (origin/bb-10.5-bar) Author: Alexander Barkov <bar@mariadb.com> Date: Thu Apr 28 11:23:12 2022 +0400 A clean-up for "MDEV-19772 Add helper classes for ST_FIELD_INFO" As agreed with Serg, renaming class Yesno to Yes_or_empty, to reflect better its behavior.
};
diff --git a/sql/table.cc b/sql/table.cc index a683a78ff49..c28cb2bd928 100644 --- a/sql/table.cc +++ b/sql/table.cc @@ -3491,6 +3493,16 @@ int TABLE_SHARE::init_from_sql_statement_string(THD *thd, bool write, else thd->set_n_backup_active_arena(arena, &backup);
+ /* + THD::reset_db() does not set THD::db_charset, + so it keeps pointing to the character set and collation + of the current database, rather than the database of the + new initialized table.
Hmm, is that correct? Could you check other invocation of thd->reset_db()? Perhaps they all need to switch charset? In that case it should be done inside THD::reset_db().
Or may be they have to use mysql_change_db_impl() instead?
Note, this part was moved to MDEV-27896. It's not a part of UCA14 patches any more. Anyway, I checked invocation of thd->reset_db() and did not find a general rule quickly. From a glance, they mostly don't seem to need to switch the charset. But it needs to be investigated further. Should I create an MDEV for this?
+ Let's call get_default_db_collation() before reset_db(). + This forces the db.opt file to be loaded. + */ + db_cs= get_default_db_collation(thd, db.str); + thd->reset_db(&db); lex_start(thd);
@@ -3498,6 +3510,11 @@ int TABLE_SHARE::init_from_sql_statement_string(THD *thd, bool write, sql_unusable_for_discovery(thd, hton, sql_copy)))) goto ret;
+ if (!(thd->lex->create_info.default_table_charset= + thd->lex->create_info.default_charset_collation. + resolved_to_character_set(db_cs, db_cs))) + DBUG_RETURN(true);
How could this (and similar if()'s in other files) fail?
It can fail in this scenario: CREATE TABLE t1 (a CHAR(10) COLLATE uca1400_cs_ci) CHARACTER SET latin1; UCA collations are not applicable to latin1 yet. Btw, this part now looks differenlty. See HA_CREATE_INFO::resolve_to_charset_collation_context() in sql_table.cc: if (!(default_table_charset= default_cscl.resolved_to_context(ctx))) return true;
+ thd->lex->create_info.db_type= hton; #ifdef WITH_PARTITION_STORAGE_ENGINE thd->work_part_info= 0; // For partitioning diff --git a/sql/mysys_charset.h b/sql/mysys_charset.h new file mode 100644 index 00000000000..86eaeedd432 --- /dev/null +++ b/sql/mysys_charset.h @@ -0,0 +1,44 @@ +#ifndef MYSYS_CHARSET +#define MYSYS_CHARSET + +/* Copyright (c) 2021, MariaDB Corporation. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; version 2 of the License. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335 USA */ + + +#include "my_sys.h" + + +class Charset_loader_mysys: public MY_CHARSET_LOADER +{ +public: + Charset_loader_mysys() + { + my_charset_loader_init_mysys(this); + } + void raise_unknown_collation_error(const char *name, + CHARSET_INFO *name_cs) const; + CHARSET_INFO *get_charset(const char *cs_name, uint cs_flags, myf my_flags); + CHARSET_INFO *get_exact_collation(const char *name, myf utf8_flag); + CHARSET_INFO *get_contextually_typed_collation(CHARSET_INFO *cs, + const char *name); + CHARSET_INFO *get_contextually_typed_collation(const char *name); + CHARSET_INFO *get_contextually_typed_collation_or_error(CHARSET_INFO *cs, + const char *name); + CHARSET_INFO *find_default_collation(CHARSET_INFO *cs); + CHARSET_INFO *find_bin_collation_or_error(CHARSET_INFO *cs); +};
you can have C++ code in mysys too, you know, no need to put it in sql/mysys*
This is a good idea. There was one problem: Charset_loader_mysys pushed errors and warnings into the server diagnostics area. So it could not sit in include/my_sys.h as is. I split it into two parts: - Charset_loader_mysys is defined in include/my_sys.h and does not send any errors/warnings. It is self-sufficient and is fully defined in include/my_sys.h. It does not have any method implementations in c++ files. - There is a new class Charset_loader_server. It is defined in lex_charset.h as follows: class Charset_loader_server: public Charset_loader_mysys It sends errors and warnings. And has parts implemented in lex_charset.cc.
+ +#endif // MYSYS_CHARSET + diff --git a/strings/ctype-simple.c b/strings/ctype-simple.c index b579f0af203..d09dfba86ed 100644 --- a/strings/ctype-simple.c +++ b/strings/ctype-simple.c @@ -1940,13 +1941,26 @@ my_bool my_propagate_complex(CHARSET_INFO *cs __attribute__((unused)), }
+void my_ci_set_strength(struct charset_info_st *cs, uint strength) +{ + DBUG_ASSERT(strength > 0 && strength <= MY_STRXFRM_NLEVELS);
don't use && in asserts, please create two separate asserts instead:
DBUG_ASSERT(strength > 0); DBUG_ASSERT(strength <= MY_STRXFRM_NLEVELS);
Done.
+ cs->levels_for_order= ((1 << strength) - 1);
why do you still use the old concept of "strength"? Why not to use bitmap consistently everywhere?
The collation definition file Index.xml is based on the LDML syntax. It uses tags like this: <settings strength="2"/> This function is needed to handle these LDML tags. Btw, to define user-defined _AI_CS collations we'll need to add an LDML extension eventually.
+} + + +void my_ci_set_level_flags(struct charset_info_st *cs, uint flags) +{ + DBUG_ASSERT(flags < (1<<MY_STRXFRM_NLEVELS)); + cs->levels_for_order= flags; +} + /* Normalize strxfrm flags
SYNOPSIS: my_strxfrm_flag_normalize() + cs - the CHARSET_INFO pointer flags - non-normalized flags - nlevels - number of levels
NOTES: If levels are omitted, then 1-maximum is assumed. diff --git a/sql/handler.h b/sql/handler.h index 8ad521e189a..1e82f37b1e7 100644 --- a/sql/handler.h +++ b/sql/handler.h @@ -2409,7 +2386,32 @@ struct Table_specification_st: public HA_CREATE_INFO, { HA_CREATE_INFO::options= 0; DDL_options_st::init(); + default_charset_collation.init(); + } + + bool + add_alter_list_item_convert_to_charset(const Lex_charset_collation_st &cl) + { + /* + cs cannot be NULL, as sql_yacc.yy translates + CONVERT TO CHARACTER SET DEFAULT + to + CONVERT TO CHARACTER SET <character-set-of-the-current-database> + TODO: Shouldn't we postpone resolution of DEFAULT until the + character set of the table owner database is loaded from its db.opt? + */ + DBUG_ASSERT(cl.charset_collation()); + DBUG_ASSERT(!cl.is_contextually_typed_collation()); + alter_table_convert_to_charset= cl.charset_collation(); + default_charset_collation.Lex_charset_collation_st::operator=(cl);
looks quite ugly. can you do, like, default_charset_collation.set(cl) ?
This code migrated to MDEV-27896 and is not a part of UCA14 patches any more. Now it looks differently, there are no ::operator=(cl) any more. There are more constructors instead. Anyway, I'd like to comment: I agree that it does not look like something we often use in the MariaDB sources. But I like it better than set(), because set() would need the reader to jump over the sources to know what set() actually does. On the contrary, the line with the operator is very self descriptive. It's full of information: "default_charset_collation derives from Lex_charset_collation_st and here we initialize the Lex_charset_collation_st part of it". So I think direct use of operator=() makes reading easier. Ading various set() wrappers around operator=() makes reading harder.
+ used_fields|= (HA_CREATE_USED_CHARSET | HA_CREATE_USED_DEFAULT_CHARSET); + return false; } + bool add_table_option_default_charset(CHARSET_INFO *cs); + bool add_table_option_default_collation(const Lex_charset_collation_st &cl); + bool resolve_db_charset_and_collation(THD *thd, + const LEX_CSTRING &db, + bool is_alter); };
diff --git a/strings/ctype-uca1400data.h b/strings/ctype-uca1400data.h new file mode 100644 index 00000000000..da95dcfde54 --- /dev/null +++ b/strings/ctype-uca1400data.h @@ -0,0 +1,44151 @@ +/* + Generated from allkeys.txt version '14.0.0' +*/
if it's generated, do you need to check it in? perhaps it should be generated during the build? you've checked in allkeys1400.txt anyway.
Right, we can consider it. Btw, I've checked it all versions: $ ls mysql-test/std_data/unicode/ allkeys1400.txt allkeys400.txt allkeys520.txt So we can generate sources for all three UCA versions from these files. But I suggest we do it separately. Should I create an MDEV for this?
+static const uint16 uca1400_p000[]= { /* 0000 (4 weights per char) */ +0x0000,0x0000,0x0000,0x0000, 0x0000,0x0000,0x0000,0x0000, /* 0000 */ +0x0000,0x0000,0x0000,0x0000, 0x0000,0x0000,0x0000,0x0000, /* 0002 */ diff --git a/sql/sql_lex.cc b/sql/sql_lex.cc index 6ca10267187..d115401a855 100644 --- a/sql/sql_lex.cc +++ b/sql/sql_lex.cc @@ -542,6 +542,30 @@ bool LEX::add_alter_list(LEX_CSTRING name, LEX_CSTRING new_name, bool exists) }
+bool LEX::add_alter_list_item_convert_to_charset( + THD *thd, + CHARSET_INFO *cs, + const Lex_charset_collation_st &cl) +{ + if (!cs) + { + Lex_charset_collation_st tmp; + tmp.set_charset_collate_default(thd->variables.collation_database);
Hmm, what if one is doing ALTER TABLE db.test CHARSET DEFAULT and current db is not `db` but `test` ?
Right, thanks for noticing this. The problem that both DEFAULT CHARACTER SET and CONVERT TO did not work well in some cases existed for a long time. When I moved MDEV-27896 out of the UCA patches, I reported CONVERT TO problems in: MDEV-28644 Unexpected error on ALTER TABLE t1 CONVERT TO CHARACTER SET utf8mb3, DEFAULT CHARACTER SET utf8mb4 The final patch for MDEV-27896 fixed this problem as well, as it was very easy after fixing DEFAULT CHARACTER SET cs [COLLATE cl]. The idea is that both "DEFAULT CHARACTER SET" and "CONVERT TO" clauses are now fully independent, and both use the new class Lex_table_charset_collation_attrs_st as a storage: struct Table_specification_st: public HA_CREATE_INFO, public DDL_options_st { Lex_table_charset_collation_attrs_st default_charset_collation; Lex_table_charset_collation_attrs_st convert_charset_collation;
+ if (!(cs= tmp.charset_collation())) + return true; // Should not actually happen
assert?
This code migrated to MDEV-27896 and was changed. There is no a line like this any more. Instead, there are classes Lex_exact_charset, Lex_exact_collation, Lex_context_collation. They catch NULL in constructors, e.g.: class Lex_exact_charset { CHARSET_INFO *m_ci; public: explicit Lex_exact_charset(CHARSET_INFO *ci) :m_ci(ci)
+ } + + Lex_explicit_charset_opt_collate tmp(cs, false); + if (tmp.merge_opt_collate_or_error(cl) || + create_info.add_alter_list_item_convert_to_charset( + Lex_charset_collation(tmp))) + return true; + + alter_info.flags|= ALTER_CONVERT_TO; + return false; +} + + void LEX::init_last_field(Column_definition *field, const LEX_CSTRING *field_name) { @@ -11871,29 +11869,41 @@ CHARSET_INFO *Lex_collation_st::find_default_collation(CHARSET_INFO *cs) "def" is the upper level CHARACTER SET clause (e.g. of a table) */ CHARSET_INFO * -Lex_collation_st::resolved_to_character_set(CHARSET_INFO *def) const +Lex_charset_collation_st::resolved_to_character_set(CHARSET_INFO *def) const { DBUG_ASSERT(def); - if (m_type != TYPE_CONTEXTUALLY_TYPED) - { - if (!m_collation) - return def; // Empty - not typed at all - return m_collation; // Explicitly typed + + switch (m_type) { + case TYPE_EMPTY: + return def; + case TYPE_CHARACTER_SET: + DBUG_ASSERT(m_ci); + return m_ci; + case TYPE_COLLATE_EXACT: + DBUG_ASSERT(m_ci); + return m_ci; + case TYPE_COLLATE_CONTEXTUALLY_TYPED: + break; }
// Contextually typed - DBUG_ASSERT(m_collation); + DBUG_ASSERT(m_ci);
- if (m_collation == &my_charset_bin) // CHAR(10) BINARY - return find_bin_collation(def); + Charset_loader_mysys loader; + if (is_contextually_typed_binary_style()) // CHAR(10) BINARY + return loader.find_bin_collation_or_error(def);
- if (m_collation == &my_charset_latin1) // CHAR(10) COLLATE DEFAULT - return find_default_collation(def); + if (is_contextually_typed_collate_default()) // CHAR(10) COLLATE DEFAULT + return loader.find_default_collation(def); + + const LEX_CSTRING context_name= collation_name_context_suffix();
I'd rather put this in assert, not in if(). Like
I fixed this in MDEV-27896. The patch for MDEV-27896 has this assert in a couple of places: DBUG_ASSERT(!strncmp(cl.charset_info()->coll_name.str, STRING_WITH_LEN("utf8mb4_uca1400_"))) <cut>
diff --git a/strings/ctype-uca.c b/strings/ctype-uca.c index b89916f3b20..3e6b4e4ce43 100644 --- a/strings/ctype-uca.c +++ b/strings/ctype-uca.c @@ -30542,7 +30613,7 @@ static const char vietnamese[]= Myanmar, according to CLDR Revision 8900. http://unicode.org/cldr/trac/browser/trunk/common/collation/my.xml */ -static const char myanmar[]= "[shift-after-method expand][version 5.2.0]" +static const char myanmar[]= "[shift-after-method expand]"
What's going on with myanmar? You removed a version here and added &my_uca_v520 below in its charset_info_st. What does this change mean?
/* Tones */ "&\\u108C" "<\\u1037" @@ -37627,7 +37825,7 @@ struct charset_info_st my_charset_utf32_myanmar_uca_ci= NULL, /* to_lower */ NULL, /* to_upper */ NULL, /* sort_order */ - NULL, /* uca */ + &my_uca_v520, /* uca */
What does this change?
There are two ways to define the version: 1. Using the [version...] option in the tailoring. 2. Using the hardcoded initialization in the charset_info_st definition. Although, built-in collations should normally use #2, the approach #1 also worked without problems for built-in collations. But this just assumed the tailoring is used with one UCA version only! So I changed the old built-in myanmar collation to use #2 instead of #1. It changes nothing for the old myanmar collations. But the tailoring defined in "static const char myanmar[]" can now be reused in combination with multiple UCA versions.
NULL, /* tab_to_uni */ NULL, /* tab_from_uni */ &my_unicase_unicode520,/* caseinfo */
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi, Alexander, Few comments/questions below. Meanwhile I'm reviewing bb-10.9-bar-uca14 On May 26, Alexander Barkov wrote:
By the way, perhaps some of these statements should display short collation names:
SHOW CREATE TABLE t1; SHOW CREATE DATABASE db1; SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS; SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES; SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA;
Can we discuss this?
Short names, I guess. First two - for readability, the last three - so that one could join with INFORMATION_SCHEMA.COLLATIONS table.
diff --git a/include/m_ctype.h b/include/m_ctype.h index 4c6628b72b3..706764ead2a 100644 --- a/include/m_ctype.h +++ b/include/m_ctype.h @@ -240,6 +242,46 @@ typedef enum enum_repertoire_t } my_repertoire_t;
+/* ID compatibility */ +typedef enum enum_collation_id_type +{ + MY_COLLATION_ID_TYPE_PRECISE= 0, + MY_COLLATION_ID_TYPE_COMPAT_100800= 1 +} my_collation_id_type_t; + +/* Collation name display modes */ +typedef enum enum_collation_name_mode +{ + MY_COLLATION_NAME_MODE_FULL= 0, + MY_COLLATION_NAME_MODE_CONTEXT= 1 +} my_collation_name_mode_t; + +/* Level flags */ +#define MY_CS_LEVEL_BIT_PRIMARY 0x00 +#define MY_CS_LEVEL_BIT_SECONDARY 0x01 +#define MY_CS_LEVEL_BIT_TERTIARY 0x02 +#define MY_CS_LEVEL_BIT_QUATERNARY 0x03 + +#define MY_CS_COLL_LEVELS_S1 (1<<MY_CS_LEVEL_BIT_PRIMARY) + +#define MY_CS_COLL_LEVELS_AI_CS (1<<MY_CS_LEVEL_BIT_PRIMARY)| \ + (1<<MY_CS_LEVEL_BIT_TERTIARY) + +#define MY_CS_COLL_LEVELS_S2 (1<<MY_CS_LEVEL_BIT_PRIMARY)| \ + (1<<MY_CS_LEVEL_BIT_SECONDARY) + +#define MY_CS_COLL_LEVELS_S3 (1<<MY_CS_LEVEL_BIT_PRIMARY)| \ + (1<<MY_CS_LEVEL_BIT_SECONDARY) | \ + (1<<MY_CS_LEVEL_BIT_TERTIARY)
AI_CS and S3 don't seem to be used yet
Right, there are no old _AI_CS and _AS_CS (aka S3) collations.
New _AI_CS and _AS_CS collations definitions are initialized by this function:
my_uca1400_collation_definition_init(MY_CHARSET_LOADER *loader, struct charset_info_st *dst, uint id)
Level flags are calculated by this function from "id".
So there are no hard-coded definitions with MY_CS_COLL_LEVELS_AI_CS and MY_CS_COLL_LEVELS_S3 either.
Should I remove these definitions?
as you like, but if you want to keep them - add a comment. otherwise anyone can remove them when they'll see those defines are unused
+ +#define MY_CS_COLL_LEVELS_S4 (1<<MY_CS_LEVEL_BIT_PRIMARY)| \ + (1<<MY_CS_LEVEL_BIT_SECONDARY) | \ + (1<<MY_CS_LEVEL_BIT_TERTIARY) | \ + (1<<MY_CS_LEVEL_BIT_QUATERNARY) + /* Flags for strxfrm */ #define MY_STRXFRM_LEVEL1 0x00000001 /* for primary weights */ #define MY_STRXFRM_LEVEL2 0x00000002 /* for secondary weights */ diff --git a/strings/ctype-simple.c b/strings/ctype-simple.c index b579f0af203..d09dfba86ed 100644 --- a/strings/ctype-simple.c +++ b/strings/ctype-simple.c @@ -1940,13 +1941,26 @@ my_bool my_propagate_complex(CHARSET_INFO *cs __attribute__((unused)), ...
why do you still use the old concept of "strength"? Why not to use bitmap consistently everywhere?
The collation definition file Index.xml is based on the LDML syntax.
It uses tags like this:
<settings strength="2"/>
This function is needed to handle these LDML tags.
Btw, to define user-defined _AI_CS collations we'll need to add an LDML extension eventually.
Hmm. Are you saying that LDML cannot describe a UCA 14.0.0 collation? Or L1+L3 without L2 isn't standard?
+} + diff --git a/strings/ctype-uca1400data.h b/strings/ctype-uca1400data.h --- /dev/null +++ b/strings/ctype-uca1400data.h @@ -0,0 +1,44151 @@ +/* + Generated from allkeys.txt version '14.0.0' +*/
if it's generated, do you need to check it in? perhaps it should be generated during the build? you've checked in allkeys1400.txt anyway.
Right, we can consider it.
Btw, I've checked it all versions:
$ ls mysql-test/std_data/unicode/
allkeys1400.txt allkeys400.txt allkeys520.txt
So we can generate sources for all three UCA versions from these files.
But I suggest we do it separately. Should I create an MDEV for this?
Why would you want to add to the history a file that is generated? It'll be there forever. You already have uca-dump.c in the tree, running it during the build is a few lines in CMakeLists.txt, basically copy-paste, because we already do it for gen_lex_hash and comp_err. Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hello Sergei, thanks for the review. On 6/8/22 7:26 PM, Sergei Golubchik wrote:
Hi, Alexander,
Few comments/questions below. Meanwhile I'm reviewing bb-10.9-bar-uca14
On May 26, Alexander Barkov wrote:
By the way, perhaps some of these statements should display short collation names:
SHOW CREATE TABLE t1; SHOW CREATE DATABASE db1; SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS; SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES; SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA;
Can we discuss this?
Short names, I guess. First two - for readability, the last three - so that one could join with INFORMATION_SCHEMA.COLLATIONS table.
In the branch preview-10.10-MDEV-27009-uca14, all these columns: SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS; SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.PARAMETERS; SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES; SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA; SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.ROUTINES; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.EVENTS; SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.EVENTS; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.ROUTINES; SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.ROUTINES; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.TRIGGERS; SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.TRIGGERS; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.VIEWS; (and corresponding columns in SHOW commands) display long names. So we need to: - Fix these columns to display short names - Add corresponding CHARACTER_SET_XXX columns - Fix system variables @@collation_xxx to understand short names. - Fix mysqldump to set both @@character_set_connection and @@collation_connection. Now it sets only @@collation_connection. It's not absolutely necessary, but would be nice to do it in the same release. Is it too late for 10.10? Or can I try to prepare an additional patch on top of the preview branch? I think 1-2 days should be enough.
--- a/strings/ctype-simple.c +++ b/strings/ctype-simple.c @@ -1940,13 +1941,26 @@ my_bool my_propagate_complex(CHARSET_INFO *cs __attribute__((unused)), ...
why do you still use the old concept of "strength"? Why not to use bitmap consistently everywhere?
The collation definition file Index.xml is based on the LDML syntax.
It uses tags like this:
<settings strength="2"/>
This function is needed to handle these LDML tags.
Btw, to define user-defined _AI_CS collations we'll need to add an LDML extension eventually.
Hmm. Are you saying that LDML cannot describe a UCA 14.0.0 collation?
Or L1+L3 without L2 isn't standard?
L1+L3 is not standard. The closest thing is <settings strength="1" caseLevel="On"> It's not precisely L1+L3, because L3 has more than just case characteristics. It also distinguishes various variants: wide, font, cycle, narrow, sub, super, square, and some other. This is an excerpt from UTR#35 about caseLevel:
If set to on, a level consisting only of case characteristics will be inserted in front of tertiary level. To ignore accents but take cases into account, set strength to primary and case level to on.
I suggest we implement both: - An extension, something like strength="1,3" or strength="AI_CS". This is very easy to do. - Then add caseLevel="On" at some point in the future. This will need more efforts.
+} + diff --git a/strings/ctype-uca1400data.h b/strings/ctype-uca1400data.h --- /dev/null +++ b/strings/ctype-uca1400data.h @@ -0,0 +1,44151 @@ +/* + Generated from allkeys.txt version '14.0.0' +*/
if it's generated, do you need to check it in? perhaps it should be generated during the build? you've checked in allkeys1400.txt anyway.
Right, we can consider it.
Btw, I've checked it all versions:
$ ls mysql-test/std_data/unicode/
allkeys1400.txt allkeys400.txt allkeys520.txt
So we can generate sources for all three UCA versions from these files.
But I suggest we do it separately. Should I create an MDEV for this?
Why would you want to add to the history a file that is generated? It'll be there forever.
You already have uca-dump.c in the tree, running it during the build is a few lines in CMakeLists.txt, basically copy-paste, because we already do it for gen_lex_hash and comp_err.
I've implemented generation of ctype-uca1400data.h from allkeys1400.txt. In which release should I fix to generate data for version 400 and 520 as well?
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
participants (2)
-
Alexander Barkov
-
Sergei Golubchik