[Maria-developers] MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character"
Hi Sergei, For simplicity I decided to split: MDEV-8036 Fix all collations to compare broken bytes as "greater than any non-broken character" into sub-tasks. Please review a patch for the first sub-task MDEV-8214. The overall plan is: - MDEV-8214, for the Asian charsets with mbmaxlen=2 - MDEV-8215, for the Asian charsets with mbmaxlen=3, based on MDEV-8214 - MDEV-8036, the rest (Unicode character sets) Thanks.
Hi, Alexander! On May 25, Alexander Barkov wrote:
Hi Sergei,
For simplicity I decided to split:
MDEV-8036 Fix all collations to compare broken bytes as "greater than any non-broken character"
into sub-tasks.
Please review a patch for the first sub-task MDEV-8214.
I don't know why you wanted me to review it :) Looks pretty ok. Of course there were few comments and questions, but nothing serious.
diff --git a/strings/ctype-big5.c b/strings/ctype-big5.c index eda81c0..aa06a7a 100644 --- a/strings/ctype-big5.c +++ b/strings/ctype-big5.c @@ -6853,11 +6771,29 @@ my_mb_wc_big5(CHARSET_INFO *cs __attribute__((unused)), }
-static MY_COLLATION_HANDLER my_collation_big5_chinese_ci_handler = +#undef MY_FUNCTION_NAME +#undef WEIGHT_MB1 +#undef WEIGHT_MB2
why do you start from undef-ing macros that couldn't be defined here?
+#define MY_FUNCTION_NAME(x) my_ ## x ## _big5_chinese_ci +#define WEIGHT_MB1(x) (sort_order_big5[(uchar) (x)]) +#define WEIGHT_MB2(x,y) (big5code(x, y)) +#include "ctype-strcoll.ic" + + +#undef MY_FUNCTION_NAME +#undef WEIGHT_MB1 +#undef WEIGHT_MB2
alternatively you could undef them at the end of ctype-strcoll.ic it'll also be a protection against including this file twice (although it cannot be done now anyway)
+#define MY_FUNCTION_NAME(x) my_ ## x ## _big5_bin +#define WEIGHT_MB1(x) ((uchar) (x)) +#define WEIGHT_MB2(x,y) (big5code(x, y)) +#include "ctype-strcoll.ic" + + +static MY_COLLATION_HANDLER my_collation_handler_big5_chinese_ci= { NULL, /* init */ - my_strnncoll_big5, - my_strnncollsp_big5, + my_strnncoll_big5_chinese_ci, + my_strnncollsp_big5_chinese_ci, my_strnxfrm_big5, my_strnxfrmlen_simple, my_like_range_mb, diff --git a/strings/ctype-euc_kr.c b/strings/ctype-euc_kr.c index a2c95bf..db47b3d 100644 --- a/strings/ctype-euc_kr.c +++ b/strings/ctype-euc_kr.c @@ -9938,21 +9940,56 @@ my_mb_wc_euc_kr(CHARSET_INFO *cs __attribute__((unused)), }
-static MY_COLLATION_HANDLER my_collation_ci_handler = +#undef MY_FUNCTION_NAME +#undef WEIGHT_MB1 +#undef WEIGHT_MB2 +#define MY_FUNCTION_NAME(x) my_ ## x ## _euckr_korean_ci +#define WEIGHT_MB1(x) (sort_order_euc_kr[(uchar) (x)]) +#define WEIGHT_MB2(x,y) (euckrcode(x, y)) +#include "ctype-strcoll.ic" + + +#undef MY_FUNCTION_NAME +#undef WEIGHT_MB1 +#undef WEIGHT_MB2 +#define MY_FUNCTION_NAME(x) my_ ## x ## _euckr_bin +#define WEIGHT_MB1(x) ((uchar) (x)) +#define WEIGHT_MB2(x,y) (euckrcode(x, y)) +#include "ctype-strcoll.ic" + + +static MY_COLLATION_HANDLER my_collation_handler_euckr_korean_ci= { - NULL, /* init */ - my_strnncoll_simple, /* strnncoll */ - my_strnncollsp_simple, - my_strnxfrm_mb, /* strnxfrm */ + NULL, /* init */ + my_strnncoll_euckr_korean_ci, + my_strnncollsp_euckr_korean_ci,
hmm, so is euckr multi-byte or not? wikipedia says it is, my_strnxfrm_mb. but why my_strnncoll_simple?
+ my_strnxfrm_mb, my_strnxfrmlen_simple, - my_like_range_mb, /* like_range */ - my_wildcmp_mb, /* wildcmp */ + my_like_range_mb, + my_wildcmp_mb, my_strcasecmp_mb, my_instr_mb, my_hash_sort_simple, my_propagate_simple };
+ +static MY_COLLATION_HANDLER my_collation_handler_euckr_bin= +{ + NULL, /* init */ + my_strnncoll_euckr_bin, + my_strnncollsp_euckr_bin, + my_strnxfrm_mb, + my_strnxfrmlen_simple, + my_like_range_mb, + my_wildcmp_mb_bin, + my_strcasecmp_mb_bin, + my_instr_mb, + my_hash_sort_mb_bin, + my_propagate_simple +}; + + static MY_CHARSET_HANDLER my_charset_handler= { NULL, /* init */ diff --git a/strings/ctype-strcoll.ic b/strings/ctype-strcoll.ic new file mode 100644 index 0000000..7217f99 --- /dev/null +++ b/strings/ctype-strcoll.ic @@ -0,0 +1,208 @@ +/* + Copyright (c) 2015, MariaDB Foundation + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; version 2 of the License. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +*/ + + +#ifndef MY_FUNCTION_NAME +#error MY_FUNCTION_NAME is not defined +#endif + + +/* + The weight for automatically padded spaces when comparing strings with + the PAD SPACE property. + Should normally be equal to the weight of a regular space.
it is not a space for sjis and cp932? what does that mean? padding space is not the same as a regular space? why?
+*/ +#ifndef WEIGHT_PAD_SPACE +#define WEIGHT_PAD_SPACE (' ') +#endif + + +/* + Weight of an illegal byte. + Must be greater than weight of any normal character. + Two bad bytes are compared binary. +*/ +#ifndef WEIGHT_ILSEQ
why #ifndef? it creates an impression that the caller (includer :) can redefine it. better define WEIGHT_ILSEQ here unconditionally and undefine it at the end
+#define WEIGHT_ILSEQ(x) (0xFF00 + (x)) +#endif + + +/** + Scan a valid character, or a bad byte, or an auto-padded space + from a string and calculate the weight of the scanned sequence. + + @param [OUT] weight - the weight is returned here + @param str - the string + @param end - the end of the string + @return - the number of bytes scanned + + The including source file must define the following macros: + IS_MB1_CHAR(x) + IS_MB2_CHAR(x,y) + WEIGHT_PAD_SPACE + WEIGHT_MB1(x) + WEIGHT_MB2(x,y) + WEIGHT_ILSEQ(x) +*/ +static inline uint +MY_FUNCTION_NAME(scan_weight)(int *weight, const uchar *str, const uchar *end) +{ + if (str >= end) + { + *weight= WEIGHT_PAD_SPACE; + return 0; + } + + if (IS_MB1_CHAR(*str)) + { + *weight= WEIGHT_MB1(*str); /* A valid single byte character*/ + return 1; + } + + if (str + 2 > end) /* The string ended unexpectedly */ + goto bad; /* Treat as a bad byte */ + + if (IS_MB2_CHAR(str[0], str[1])) + { + *weight= WEIGHT_MB2(str[0], str[1]); + return 2; /* A valid two-byte character */ + } + +bad: + *weight= WEIGHT_ILSEQ(str[0]); /* Bad byte */ + return 1; +} + + +/** + Compare two strings according to the collation, + without handling the PAD SPACE property. + + Note, cs->coll->strnncoll() is usually used to compare identifiers. + Perhaps we should eventually (in 10.2?) create a new collation + my_charset_utf8_general_ci_no_pad and have only one comparison function + in MY_COLLATION_HANDLER. + + @param cs - the character set and collation + @param a - the left string + @param a_length - the length of the left string + @param b - the right string + @param b_length - the length of the right string + @param b_is_prefix - if the caller wants to check if "b" is a prefix of "a" + @return - the comparison result +*/ +static int +MY_FUNCTION_NAME(strnncoll)(CHARSET_INFO *cs __attribute__((unused)), + const uchar *a, size_t a_length, + const uchar *b, size_t b_length, + my_bool b_is_prefix) +{ + const uchar *a_end= a + a_length; + const uchar *b_end= b + b_length; + for ( ; ; ) + { + int a_weight, b_weight, res; + uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end); + uint b_wlen= MY_FUNCTION_NAME(scan_weight)(&b_weight, b, b_end); + /* + a_wlen b_wlen Comment + ------ ------ ------- + 0 0 Strings ended simultaneously, "a" and "b" are equal. + 0 >0 "a" is a prefix of "b", so "a" is smaller. + >0 0 "b" is a prefix of "a", check b_is_prefix. + >0 >0 Two weights were scanned, check weight difference. + */ + if (!a_wlen) + return b_wlen ? -b_weight : 0;
can weight be negative? if not, the code will be clearer if you declare it unsigned
+ + if (!b_wlen) + return b_is_prefix ? 0 : a_weight; + + if ((res= (a_weight - b_weight))) + return res; + /* + None of the strings has ended yet. + */ + DBUG_ASSERT(a < a_end); + DBUG_ASSERT(b < b_end); + a+= a_wlen; + b+= b_wlen; + } + DBUG_ASSERT(0); + return 0; +} + + +/** + Compare two strings according to the collation, with PAD SPACE handling. + + @param cs - the character set and collation + @param a - the left string + @param a_length - the length of the left string + @param b - the right string + @param b_length - the length of the right string + @param diff_if_only_endspace_difference - not used in the code. + TODO: this should be eventually removed (in 10.2?)
why not now (in a separate changeset, but in 10.1 still)?
+ @return - the comparison result +*/ + +static int +MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)), + const uchar *a, size_t a_length, + const uchar *b, size_t b_length, + my_bool diff_if_only_endspace_difference + __attribute__((unused))) +{ + const uchar *a_end= a + a_length; + const uchar *b_end= b + b_length; + for ( ; ; ) + { + int a_weight, b_weight, res; + uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end); + uint b_wlen= MY_FUNCTION_NAME(scan_weight)(&b_weight, b, b_end); + if ((res= (a_weight - b_weight))) + { + /* + Got two different weights. Each weight can be generated by either of: + - a real character + - a bad byte sequence or an incomplete byte sequence + - an auto-generated trailing space (PAD SPACE) + It does not matter how exactly each weight was generated. + Just return the weight difference. + */ + return res; + } + if (!a_wlen && !b_wlen) + { + /* + Got two auto-generated trailing spaces, i.e. + both strings have now ended, so they are equal. + */ + DBUG_ASSERT(a == a_end); + DBUG_ASSERT(b == b_end); + return 0; + } + /* + At least one of the strings has not ended yet, continue comparison. + */ + DBUG_ASSERT(a < a_end || b < b_end); + a+= a_wlen; + b+= b_wlen; + } + DBUG_ASSERT(0); + return 0; +}
Regards, Sergei
Hi Sergei, Thanks for your review. Please see comments inline. On 06/19/2015 12:04 AM, Sergei Golubchik wrote:
Hi, Alexander!
On May 25, Alexander Barkov wrote:
Hi Sergei,
For simplicity I decided to split:
MDEV-8036 Fix all collations to compare broken bytes as "greater than any non-broken character"
into sub-tasks.
Please review a patch for the first sub-task MDEV-8214.
I don't know why you wanted me to review it :)
I felt so insecure and felt that I needed help like I'd never done before :) Seriously, there are some thing that did not look nice, for example this undef/ifdef stuff.
Looks pretty ok. Of course there were few comments and questions, but nothing serious.
diff --git a/strings/ctype-big5.c b/strings/ctype-big5.c index eda81c0..aa06a7a 100644 --- a/strings/ctype-big5.c +++ b/strings/ctype-big5.c @@ -6853,11 +6771,29 @@ my_mb_wc_big5(CHARSET_INFO *cs __attribute__((unused)), }
-static MY_COLLATION_HANDLER my_collation_big5_chinese_ci_handler = +#undef MY_FUNCTION_NAME +#undef WEIGHT_MB1 +#undef WEIGHT_MB2
why do you start from undef-ing macros that couldn't be defined here?
ctype-strcoll.ic is included two times from every file, for case insensitive and for binary collations respectively. My intention was to make every inclusion of ctype-strcoll.ic look exactly the same, no matter if it's the first or the second inclusion. So whenever I need to include it from a new place, I don't have to think and just copy and paste from any place.
+#define MY_FUNCTION_NAME(x) my_ ## x ## _big5_chinese_ci +#define WEIGHT_MB1(x) (sort_order_big5[(uchar) (x)]) +#define WEIGHT_MB2(x,y) (big5code(x, y)) +#include "ctype-strcoll.ic" + + +#undef MY_FUNCTION_NAME +#undef WEIGHT_MB1 +#undef WEIGHT_MB2
alternatively you could undef them at the end of ctype-strcoll.ic it'll also be a protection against including this file twice (although it cannot be done now anyway)
Thanks for the idea. This can certainly be done for WEIGHT_MB1 and WEIGHT_MB2. What do you suggest to do with MY_FUNCTION_NAME? It's now used in ctype-strcoll.ic in collation name context, and in ctype-mb.ic in character name context. So the structure of a file is about like this: #define MY_FUNCTION_NAME(x) my_ ## x ## _sjis #define ... character set specific definitions #include "ctype-mb.ic" ... #undef MY_FUNCTION_NAME #define MY_FUNCTION_NAME(x) my_ ## x ## _sjis_japanese_ci #define ... ci collation specific definitions #include "ctype-strcoll.ic" ... #undef MY_FUNCTION_NAME #define MY_FUNCTION_NAME(x) my_ ## x ## _sjis_bin #define ... bin collation specific definitions #include "ctype-strcoll.ic" ... Perhaps I should undef MY_FUNCTION_NAME in both ctype-mb.ic and ctype-strncoll.ic Another approach would be to have two things: MY_CHARSET_FUNCTION_NAME MY_COLLATION_FUNCTION_NAME What will be easier to read?
+#define MY_FUNCTION_NAME(x) my_ ## x ## _big5_bin +#define WEIGHT_MB1(x) ((uchar) (x)) +#define WEIGHT_MB2(x,y) (big5code(x, y)) +#include "ctype-strcoll.ic" + + +static MY_COLLATION_HANDLER my_collation_handler_big5_chinese_ci= { NULL, /* init */ - my_strnncoll_big5, - my_strnncollsp_big5, + my_strnncoll_big5_chinese_ci, + my_strnncollsp_big5_chinese_ci, my_strnxfrm_big5, my_strnxfrmlen_simple, my_like_range_mb, diff --git a/strings/ctype-euc_kr.c b/strings/ctype-euc_kr.c index a2c95bf..db47b3d 100644 --- a/strings/ctype-euc_kr.c +++ b/strings/ctype-euc_kr.c @@ -9938,21 +9940,56 @@ my_mb_wc_euc_kr(CHARSET_INFO *cs __attribute__((unused)), }
-static MY_COLLATION_HANDLER my_collation_ci_handler = +#undef MY_FUNCTION_NAME +#undef WEIGHT_MB1 +#undef WEIGHT_MB2 +#define MY_FUNCTION_NAME(x) my_ ## x ## _euckr_korean_ci +#define WEIGHT_MB1(x) (sort_order_euc_kr[(uchar) (x)]) +#define WEIGHT_MB2(x,y) (euckrcode(x, y)) +#include "ctype-strcoll.ic" + + +#undef MY_FUNCTION_NAME +#undef WEIGHT_MB1 +#undef WEIGHT_MB2 +#define MY_FUNCTION_NAME(x) my_ ## x ## _euckr_bin +#define WEIGHT_MB1(x) ((uchar) (x)) +#define WEIGHT_MB2(x,y) (euckrcode(x, y)) +#include "ctype-strcoll.ic" + + +static MY_COLLATION_HANDLER my_collation_handler_euckr_korean_ci= { - NULL, /* init */ - my_strnncoll_simple, /* strnncoll */ - my_strnncollsp_simple, - my_strnxfrm_mb, /* strnxfrm */ + NULL, /* init */ + my_strnncoll_euckr_korean_ci, + my_strnncollsp_euckr_korean_ci,
hmm, so is euckr multi-byte or not? wikipedia says it is, my_strnxfrm_mb. but why my_strnncoll_simple?
It is multi-byte, but a very simple one: - a byte in the range 0x00..0x7F cannot be a part of a multi-byte character, - a weight of a multi-byte character is just equal to its multi-byte representation - all multi-byte characters have its own distinct weight So before this change strnncoll[sp] just needed to do some mapping on the byte range 0x00..0x7F and do not do any mapping on the range 0x80..0xFF, and this is what my_strnncoll[sp]_simple can perfectly do. Now my_strnncoll[sp]_simple do not work any more, because we need to handle bad byte sequences differently.
+ my_strnxfrm_mb, my_strnxfrmlen_simple, - my_like_range_mb, /* like_range */ - my_wildcmp_mb, /* wildcmp */ + my_like_range_mb, + my_wildcmp_mb, my_strcasecmp_mb, my_instr_mb, my_hash_sort_simple, my_propagate_simple };
+ +static MY_COLLATION_HANDLER my_collation_handler_euckr_bin= +{ + NULL, /* init */ + my_strnncoll_euckr_bin, + my_strnncollsp_euckr_bin, + my_strnxfrm_mb, + my_strnxfrmlen_simple, + my_like_range_mb, + my_wildcmp_mb_bin, + my_strcasecmp_mb_bin, + my_instr_mb, + my_hash_sort_mb_bin, + my_propagate_simple +}; + + static MY_CHARSET_HANDLER my_charset_handler= { NULL, /* init */ diff --git a/strings/ctype-strcoll.ic b/strings/ctype-strcoll.ic new file mode 100644 index 0000000..7217f99 --- /dev/null +++ b/strings/ctype-strcoll.ic @@ -0,0 +1,208 @@ +/* + Copyright (c) 2015, MariaDB Foundation + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; version 2 of the License. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +*/ + + +#ifndef MY_FUNCTION_NAME +#error MY_FUNCTION_NAME is not defined +#endif + + +/* + The weight for automatically padded spaces when comparing strings with + the PAD SPACE property. + Should normally be equal to the weight of a regular space.
it is not a space for sjis and cp932? what does that mean? padding space is not the same as a regular space? why?
It is the same. - For other character sets it looks like: #define WEIGHT_MB1(x) (sort_order_euc_kr[(uchar) (x)]) #define WEIGHT_PAD_SPACE ' ' So regular and pad spaces have the same weight: 0x20. - For sjis and cp932 wight for a single byte character is just defined in a more complex way: #define WEIGHT_MB1(x) (256 * (int) (uchar) (x)) #define WEIGHT_PAD_SPACE (256 * (int) ' ') So regular and pad spaces have the same weight again: 0x2000 This is to preserve this orderding: /* cp932_chinese_ci and cp932_bin sort character blocks in this order: 1. [00..7F] - 7BIT characters (ASCII) 2. [81..9F][40..7E,80..FC] - MB2 characters, part1 3. [A1..DF] - 8BIT characters (Kana) 4. [E0..FC][40..7E,80..FC] - MB2 characters, part2 */ I just decided to use 16 bit weights to handle all single byte characters (both 00..7F and A1..DF) in the same way, to make the code simpler.
+*/ +#ifndef WEIGHT_PAD_SPACE +#define WEIGHT_PAD_SPACE (' ') +#endif + + +/* + Weight of an illegal byte. + Must be greater than weight of any normal character. + Two bad bytes are compared binary. +*/ +#ifndef WEIGHT_ILSEQ
why #ifndef? it creates an impression that the caller (includer :) can redefine it. better define WEIGHT_ILSEQ here unconditionally and undefine it at the end
You got a correct impression. The next step is to implement Asian MB3 character sets, eucjpms and ujis. WEIGHT_ILSEQ will be a 24-bit value for them, so they will redefine it, most likely like this: #define WEIGHT_ILSEQ(x) (0xFFFF00 + (x))
+#define WEIGHT_ILSEQ(x) (0xFF00 + (x)) +#endif + + +/** + Scan a valid character, or a bad byte, or an auto-padded space + from a string and calculate the weight of the scanned sequence. + + @param [OUT] weight - the weight is returned here + @param str - the string + @param end - the end of the string + @return - the number of bytes scanned + + The including source file must define the following macros: + IS_MB1_CHAR(x) + IS_MB2_CHAR(x,y) + WEIGHT_PAD_SPACE + WEIGHT_MB1(x) + WEIGHT_MB2(x,y) + WEIGHT_ILSEQ(x) +*/ +static inline uint +MY_FUNCTION_NAME(scan_weight)(int *weight, const uchar *str, const uchar *end) +{ + if (str >= end) + { + *weight= WEIGHT_PAD_SPACE; + return 0; + } + + if (IS_MB1_CHAR(*str)) + { + *weight= WEIGHT_MB1(*str); /* A valid single byte character*/ + return 1; + } + + if (str + 2 > end) /* The string ended unexpectedly */ + goto bad; /* Treat as a bad byte */ + + if (IS_MB2_CHAR(str[0], str[1])) + { + *weight= WEIGHT_MB2(str[0], str[1]); + return 2; /* A valid two-byte character */ + } + +bad: + *weight= WEIGHT_ILSEQ(str[0]); /* Bad byte */ + return 1; +} + + +/** + Compare two strings according to the collation, + without handling the PAD SPACE property. + + Note, cs->coll->strnncoll() is usually used to compare identifiers. + Perhaps we should eventually (in 10.2?) create a new collation + my_charset_utf8_general_ci_no_pad and have only one comparison function + in MY_COLLATION_HANDLER. + + @param cs - the character set and collation + @param a - the left string + @param a_length - the length of the left string + @param b - the right string + @param b_length - the length of the right string + @param b_is_prefix - if the caller wants to check if "b" is a prefix of "a" + @return - the comparison result +*/ +static int +MY_FUNCTION_NAME(strnncoll)(CHARSET_INFO *cs __attribute__((unused)), + const uchar *a, size_t a_length, + const uchar *b, size_t b_length, + my_bool b_is_prefix) +{ + const uchar *a_end= a + a_length; + const uchar *b_end= b + b_length; + for ( ; ; ) + { + int a_weight, b_weight, res; + uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end); + uint b_wlen= MY_FUNCTION_NAME(scan_weight)(&b_weight, b, b_end); + /* + a_wlen b_wlen Comment + ------ ------ ------- + 0 0 Strings ended simultaneously, "a" and "b" are equal. + 0 >0 "a" is a prefix of "b", so "a" is smaller. + >0 0 "b" is a prefix of "a", check b_is_prefix. + >0 >0 Two weights were scanned, check weight difference. + */ + if (!a_wlen) + return b_wlen ? -b_weight : 0;
can weight be negative? if not, the code will be clearer if you declare it unsigned
Weights cannot be negative. But as I need things like: (-b_weight) and (a_weight-b_weight) I thought signed will be properer here.
+ + if (!b_wlen) + return b_is_prefix ? 0 : a_weight; + + if ((res= (a_weight - b_weight))) + return res; + /* + None of the strings has ended yet. + */ + DBUG_ASSERT(a < a_end); + DBUG_ASSERT(b < b_end); + a+= a_wlen; + b+= b_wlen; + } + DBUG_ASSERT(0); + return 0; +} + + +/** + Compare two strings according to the collation, with PAD SPACE handling. + + @param cs - the character set and collation + @param a - the left string + @param a_length - the length of the left string + @param b - the right string + @param b_length - the length of the right string + @param diff_if_only_endspace_difference - not used in the code. + TODO: this should be eventually removed (in 10.2?)
why not now (in a separate changeset, but in 10.1 still)?
I created: MDEV-8360 Clean-up CHARSET_INFO: strnncollsp: diff_if_only_endspace_difference
+ @return - the comparison result +*/ + +static int +MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)), + const uchar *a, size_t a_length, + const uchar *b, size_t b_length, + my_bool diff_if_only_endspace_difference + __attribute__((unused))) +{ + const uchar *a_end= a + a_length; + const uchar *b_end= b + b_length; + for ( ; ; ) + { + int a_weight, b_weight, res; + uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end); + uint b_wlen= MY_FUNCTION_NAME(scan_weight)(&b_weight, b, b_end); + if ((res= (a_weight - b_weight))) + { + /* + Got two different weights. Each weight can be generated by either of: + - a real character + - a bad byte sequence or an incomplete byte sequence + - an auto-generated trailing space (PAD SPACE) + It does not matter how exactly each weight was generated. + Just return the weight difference. + */ + return res; + } + if (!a_wlen && !b_wlen) + { + /* + Got two auto-generated trailing spaces, i.e. + both strings have now ended, so they are equal. + */ + DBUG_ASSERT(a == a_end); + DBUG_ASSERT(b == b_end); + return 0; + } + /* + At least one of the strings has not ended yet, continue comparison. + */ + DBUG_ASSERT(a < a_end || b < b_end); + a+= a_wlen; + b+= b_wlen; + } + DBUG_ASSERT(0); + return 0; +}
Regards, Sergei
Hi, Alexander! On Jun 23, Alexander Barkov wrote:
+#define MY_FUNCTION_NAME(x) my_ ## x ## _big5_chinese_ci +#define WEIGHT_MB1(x) (sort_order_big5[(uchar) (x)]) +#define WEIGHT_MB2(x,y) (big5code(x, y)) +#include "ctype-strcoll.ic" + + +#undef MY_FUNCTION_NAME +#undef WEIGHT_MB1 +#undef WEIGHT_MB2
alternatively you could undef them at the end of ctype-strcoll.ic it'll also be a protection against including this file twice (although it cannot be done now anyway)
Thanks for the idea. This can certainly be done for WEIGHT_MB1 and WEIGHT_MB2.
What do you suggest to do with MY_FUNCTION_NAME? It's now used in ctype-strcoll.ic in collation name context, and in ctype-mb.ic in character name context.
So the structure of a file is about like this:
#define MY_FUNCTION_NAME(x) my_ ## x ## _sjis #define ... character set specific definitions #include "ctype-mb.ic" ... #undef MY_FUNCTION_NAME #define MY_FUNCTION_NAME(x) my_ ## x ## _sjis_japanese_ci #define ... ci collation specific definitions #include "ctype-strcoll.ic" ... #undef MY_FUNCTION_NAME #define MY_FUNCTION_NAME(x) my_ ## x ## _sjis_bin #define ... bin collation specific definitions #include "ctype-strcoll.ic" ...
Perhaps I should undef MY_FUNCTION_NAME in both ctype-mb.ic and ctype-strncoll.ic
Yes, I would've done that, I guess.
Another approach would be to have two things:
MY_CHARSET_FUNCTION_NAME MY_COLLATION_FUNCTION_NAME
What will be easier to read?
I don't think so. Unless you need both MY_CHARSET_FUNCTION_NAME and MY_COLLATION_FUNCTION_NAME in one include file. But you apparently don't. By the way, could you please rename sql/sql_plugin_services.h to sql/sql_plugin_services.ic and sql/sys_vars.h to sql/sys_vars.ic? As you're introducing a new convention, let's follow it consistently.
+#define MY_FUNCTION_NAME(x) my_ ## x ## _big5_bin +#define WEIGHT_MB1(x) ((uchar) (x)) +#define WEIGHT_MB2(x,y) (big5code(x, y)) +#include "ctype-strcoll.ic" + diff --git a/strings/ctype-strcoll.ic b/strings/ctype-strcoll.ic new file mode 100644 index 0000000..7217f99 --- /dev/null +++ b/strings/ctype-strcoll.ic @@ -0,0 +1,208 @@ +/* + Copyright (c) 2015, MariaDB Foundation + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; version 2 of the License. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +*/ + + +#ifndef MY_FUNCTION_NAME +#error MY_FUNCTION_NAME is not defined +#endif + + +/* + The weight for automatically padded spaces when comparing strings with + the PAD SPACE property. + Should normally be equal to the weight of a regular space.
it is not a space for sjis and cp932? what does that mean? padding space is not the same as a regular space? why?
It is the same.
- For sjis and cp932 wight for a single byte character is just defined in a more complex way:
#define WEIGHT_MB1(x) (256 * (int) (uchar) (x)) #define WEIGHT_PAD_SPACE (256 * (int) ' ')
So regular and pad spaces have the same weight again: 0x2000
In that case it might've been easier to read (less questions asked) if you'd define it unconditionally as #define WEIGHT_PAD_SPACE WEIGHT_MB1(' ')
+/* + Weight of an illegal byte. + Must be greater than weight of any normal character. + Two bad bytes are compared binary. +*/ +#ifndef WEIGHT_ILSEQ +#define WEIGHT_ILSEQ(x) (0xFF00 + (x))
why #ifndef? it creates an impression that the caller (includer :) can redefine it. better define WEIGHT_ILSEQ here unconditionally and undefine it at the end
You got a correct impression.
The next step is to implement Asian MB3 character sets, eucjpms and ujis. WEIGHT_ILSEQ will be a 24-bit value for them, so they will redefine it, most likely like this:
#define WEIGHT_ILSEQ(x) (0xFFFF00 + (x))
I don't particularly like that. Same as with WEIGHT_PAD_SPACE - one thinks that WEIGHT_ILSEQ depends on the charset. While in fact it only depends on the max_char_length. I cannot immediately suggest a better definition though (constant and easy to understand), so it's ok to keep it like you've done. Regards, Sergei
Hi Sergei, On 06/24/2015 08:46 PM, Sergei Golubchik wrote:
Hi, Alexander!
On Jun 23, Alexander Barkov wrote:
+#define MY_FUNCTION_NAME(x) my_ ## x ## _big5_chinese_ci +#define WEIGHT_MB1(x) (sort_order_big5[(uchar) (x)]) +#define WEIGHT_MB2(x,y) (big5code(x, y)) +#include "ctype-strcoll.ic" + + +#undef MY_FUNCTION_NAME +#undef WEIGHT_MB1 +#undef WEIGHT_MB2
alternatively you could undef them at the end of ctype-strcoll.ic it'll also be a protection against including this file twice (although it cannot be done now anyway)
Thanks for the idea. This can certainly be done for WEIGHT_MB1 and WEIGHT_MB2.
Done. Now strcoll.ic ends with: #undef MY_FUNCTION_NAME #undef WEIGHT_ILSEQ #undef WEIGHT_MB1 #undef WEIGHT_MB2 #undef WEIGHT_PAD_SPACE So the files that include strcoll.ic do not have to undefine. Looks better. <skip>
Perhaps I should undef MY_FUNCTION_NAME in both ctype-mb.ic and ctype-strncoll.ic
Yes, I would've done that, I guess.
Done. <skip>
By the way, could you please rename sql/sql_plugin_services.h to sql/sql_plugin_services.ic and sql/sys_vars.h to sql/sys_vars.ic?
As you're introducing a new convention, let's follow it consistently.
Agree. Note, I stole the idea from InnoDB, they use this *.ic extension widely. I created a separate task to rename the other files: MDEV-8384 Rename included source files to *.ic <skip>
+/* + The weight for automatically padded spaces when comparing strings with + the PAD SPACE property. + Should normally be equal to the weight of a regular space.
it is not a space for sjis and cp932? what does that mean? padding space is not the same as a regular space? why?
It is the same.
- For sjis and cp932 wight for a single byte character is just defined in a more complex way:
#define WEIGHT_MB1(x) (256 * (int) (uchar) (x)) #define WEIGHT_PAD_SPACE (256 * (int) ' ')
So regular and pad spaces have the same weight again: 0x2000
In that case it might've been easier to read (less questions asked) if you'd define it unconditionally as
#define WEIGHT_PAD_SPACE WEIGHT_MB1(' ')
In case of a _ci collation WEIGHT_MB1 uses a conversion table, to provide case insensitivity: #define WEIGHT_MB1(x) (256 * (int) sort_order_sjis[(uchar) (x)]) So WEIGHT_PAD_SPACE defined through WEIGHT_MB1 would be: (256 * (int) sort_order_sjis[(uchar) ' ']) instead of just: (256 * (int) (uchar) (x)) So I decided to keep it in the way I've done, and added more comments near the default definition in strcoll.ic
+/* + Weight of an illegal byte. + Must be greater than weight of any normal character. + Two bad bytes are compared binary. +*/ +#ifndef WEIGHT_ILSEQ +#define WEIGHT_ILSEQ(x) (0xFF00 + (x))
why #ifndef? it creates an impression that the caller (includer :) can redefine it. better define WEIGHT_ILSEQ here unconditionally and undefine it at the end
You got a correct impression.
The next step is to implement Asian MB3 character sets, eucjpms and ujis. WEIGHT_ILSEQ will be a 24-bit value for them, so they will redefine it, most likely like this:
#define WEIGHT_ILSEQ(x) (0xFFFF00 + (x))
I don't particularly like that. Same as with WEIGHT_PAD_SPACE - one thinks that WEIGHT_ILSEQ depends on the charset. While in fact it only depends on the max_char_length.
I cannot immediately suggest a better definition though (constant and easy to understand), so it's ok to keep it like you've done.
Okey. Running mtr last time now, then will push it. Thanks for your suggestions!
Regards, Sergei
participants (2)
-
Alexander Barkov
-
Sergei Golubchik