Hi Monty, thanks for review! Pleas find the new version of the patch attached. I tried to address most of your suggestions. Please see details inline: On 03/17/2013 03:26 PM, Michael Widenius wrote:
Hi!
I cc: mariadb-developers to ensure that others can also comment on the patch. Because the patch was not originally submitted there, I am not cutting out ok code parts as I usually do.
"Alexander" == Alexander Barkov <bar@mariadb.org> writes:
Alexander> Hi Monty, Alexander> Please review a merge from MySQL-5.6 for utf16le.
Alexander> I also merged a few related bug fixes, Alexander> which affected the test "ctype_utf16le":
Alexander> - Unreadable warnings in CAST(<ucs2, utf16, utf32 value>
AS SIGNED)
Alexander> - Errors about unknown collation and unknown charset in cases Alexander> when the SET queries should actually succeed:
Alexander> SET collation_connection=<ucs2, utf16, utf32 value>; Alexander> SET character_set_connection=<ucs2, utf16, utf32 value>;
Alexander> - MySQL Bug#59145 valgrind warnings for uninitialized
values in
Alexander> my_strtoll10_mb2
Alexander> - MySQL Bug#32859 Character sets: no warning with non-fitting chariot wheel
<cut>
=== added file 'mysql-test/include/ctype_heap.inc' --- mysql-test/include/ctype_heap.inc 1970-01-01 00:00:00 +0000 +++ mysql-test/include/ctype_heap.inc 2012-12-24 16:09:42 +0000 @@ -0,0 +1,11 @@ +SELECT @@collation_connection; +CREATE TABLE t1 ENGINE=HEAP AS SELECT REPEAT (' ', 10) AS a LIMIT 0; +ALTER TABLE t1 ADD KEY (a); +CREATE TABLE t2 (a VARCHAR(10)); +INSERT INTO t2 VALUES ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'); +INSERT INTO t2 VALUES ('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'); +INSERT INTO t1 SELECT CONCAT('a',t21.a,t22.a) FROM t2 t21, t2 t22
ORDER BY 1;
+DROP TABLE t2; +INSERT INTO t1 VALUES ('a '); +SELECT a, HEX(a) FROM t1 WHERE a='a'; +DROP TABLE t1;
Monty: What new code / issue does the above test? Please add a comment before the test.
Added this comment: --echo # --echo # Test that cs->coll->hash_sort() ignores trailing spaces --echo # <cut>
=== added file 'mysql-test/r/ctype_utf16le.result' --- mysql-test/r/ctype_utf16le.result 1970-01-01 00:00:00 +0000 +++ mysql-test/r/ctype_utf16le.result 2013-03-14 15:04:17 +0000
<skip>
+# +# Check LPAD/RPAD +# +CREATE TABLE t1 (a VARCHAR(10), pad INT, b VARCHAR(10)) CHARACTER
SET utf16le;
+INSERT INTO t1 VALUES (_ucs2 X'0420', 10, _ucs2 X'0421'); +INSERT INTO t1 VALUES (_ucs2 X'0420', 10, _ucs2 X'04210422'); +INSERT INTO t1 VALUES (_ucs2 X'0420', 10, _ucs2 X'042104220423'); +INSERT INTO t1 VALUES (_ucs2 X'0420042104220423042404250426042704280429042A042B',10,_ucs2 X'042104220423'); +Warnings: +Warning 1265 Data truncated for column 'a' at row 1 +INSERT INTO t1 VALUES (_utf32 X'010000', 10, _ucs2 X'0421'); +INSERT INTO t1 VALUES (_ucs2 X'0421', 10, _utf32 X'010000'); +SELECT a, pad, b, LPAD(a, pad, b), HEX(LPAD(a, pad, b)) FROM t1; +a pad b LPAD(a, pad, b) HEX(LPAD(a, pad, b)) +Р10 С СССССССССР2104210421042104210421042104210421042004 +Р10 СТ СТСТСТСТСР2104220421042204210422042104220421042004 +Р10 СТУ СТУСТУСТУР2104220423042104220423042104220423042004 +РСТУФХЦЧШЩ 10 СТУ РСТУФХЦЧШЩ 2004210422042304240425042604270428042904 +? 10 С ССССССССС? 21042104210421042104210421042104210400D800DC +С 10 ? ?????????С 00D800DC00D800DC00D800DC00D800DC00D800DC00D800DC00D800DC00D800DC00D800DC2104
Monty: Is the last line with '?' right
The '?' was right: The test ctype_utf16le uses "SET NAMES utf8" all around the code. Character set "utf8" in MySQL covers only Basic Multilingual Plane ("BMP" Unicode range U+0000..U+FFFF). The character which is printed as '?' is U+10000, it's outside of BMP. It's not a problem from the coverage point of view, because the test also prints HEX version of the value. I fixed this piece of the test to do "SET NAMES utf8mb4" instead. "utf8mb4" covers the full Unicode range (U+0000..U+10FFFF). Now the character U+10000 is printed with its real shape instead of '?'.
Note that for test cases, it may sometimes be better to just output the HEX part.
Result files with 'wrong' letters will sooner or later cause a problem when someone tries to edit the file for fixing a test result case as the editor may 'scramble some characters.
You saw 'wrong' letters because the email program interpreted the diff file as latin1, while in fact it is utf8. If you safe the diff file and open it in a utf8 editor, you should see 'good' letters instead :) I agree that HEX output is very helpful. But using *only* HEX output for character set related tests will make the tests very hard to read and modify. When it makes sense, I try to use both plain text and HEX output in these tests. Note, *.test and *.result files for these tests: - ctype_utf16 - ctype_utf16le - ctype_utf32 are encoded in utf8, which is now the standard character set on Linux. If someone edits the files on Linux, there should not be any problems. However, there is a chance that someone can break the files using a non-utf8 editor on Windows. Probably we should add a comment in the beginning of these test files asking to make sure to use an utf8 editor only. Ideally, it would be nice to have a bzr trigger that would check certain files for their content types, which should be "UTF-8 Unicode text" (as reported by file(1)) for the mentioned files. <cut>
=== added file 'mysql-test/t/ctype_utf16le.test' --- mysql-test/t/ctype_utf16le.test 1970-01-01 00:00:00 +0000 +++ mysql-test/t/ctype_utf16le.test 2013-03-14 15:03:57 +0000
<cut>
+--echo # +--echo # Testing cs->cset->strntoll and cs->cset->strntoull +--echo # +SET NAMES latin1; +SELECT HEX(CONV(CONVERT('123' USING utf16le), -10, 16)); +SELECT HEX(CONV(CONVERT('123' USING utf16le), 10, 16)); +SET NAMES utf8, collation_connection=utf16le_general_ci;
Monty: You have here two idenitcal SET NAMES in a row.
Yes, there is some redundancy in "SET NAMES" statements. This makes easier to move the test chucks within the *.test file, or move the chucks into shared *.inc file. I'd keep them. They are not harmful anyway.
=== modified file 'sql/sys_vars.cc' --- sql/sys_vars.cc 2013-02-25 23:20:17 +0000 +++ sql/sys_vars.cc 2013-03-15 04:23:14 +0000 @@ -385,16 +385,19 @@ if (var->value->result_type() == STRING_RESULT) { String str(buff, sizeof(buff), system_charset_info), *res; - if (!(res=var->value->val_str(&str))) + if (!(res= var->value->val_str(&str))) var->save_result.ptr= NULL; - else if (!(var->save_result.ptr=
get_charset_by_csname(res->c_ptr(),
- MY_CS_PRIMARY, - MYF(0))) && - !(var->save_result.ptr=get_old_charset_by_name(res->c_ptr()))) + else { - ErrConvString err(res); - my_error(ER_UNKNOWN_CHARACTER_SET, MYF(0), err.ptr()); - return true; + ErrConvString err(res); /* Get utf8 '\0' terminated string */ + if (!(var->save_result.ptr= get_charset_by_csname(err.ptr(), + MY_CS_PRIMARY, + MYF(0))) && + !(var->save_result.ptr= get_old_charset_by_name(err.ptr()))) + { + my_error(ER_UNKNOWN_CHARACTER_SET, MYF(0), err.ptr()); + return true; + } }
Monty: Please add a comment why we need to use ErrConvString instead of the original string. (As ErrConvString should normally only be used for errors).
I assume this is because of:
SET collation_connection=<ucs2, utf16, utf32 value>; SET character_set_connection=<ucs2, utf16, utf32 value>;
Yes, the functions "get_charset_by_[cs]name()" expect a '\0'-terminated C-string. Character sets ucs2, utf16, utf16le, utf32 are not compatible with C-strings, this is why conversion is needed.
Wouldn't it be better to have a function
String change_character_set(CHARSET_INFO *cs)
for cases like this?
(I am mostly worried someone will try to copy the above code for things where ErrConvString is not suitable)
After all, instead of using ErrConvString() I assume one could simple
have:
char buff[256]; String utf8(buff, sizeof(buff), system_charset_info); utf8.copy(res);
Which I think is probably faster and more clear than using ErrConvString().
I had three reasons to use ErrConvString here: - Using ErrConvString() should be as fast as using String::copy(), or even faster! because ErrConvString() constructor has to do even less initialization comparing to the String constructor. - Also, ErrConvString() it's very convenient: a one line call for ErrConvString::ErrConvString() makes everything we need. In case of String you needed 3 lines. - And finally, in case when "get_charset_by_[cs]name()" returns an error, the charset/collation name should be passed to my_error(), which needs ErrConvString() anyway. I think ErrConvString() quite well suites this particular piece of the code. Should I put this explanation as a comment?
} else // INT_RESULT @@ -505,11 +508,14 @@ String str(buff, sizeof(buff), system_charset_info), *res; if (!(res= var->value->val_str(&str))) var->save_result.ptr= NULL; - else if (!(var->save_result.ptr=
get_charset_by_name(res->c_ptr(), MYF(0))))
+ else { - ErrConvString err(res); - my_error(ER_UNKNOWN_COLLATION, MYF(0), err.ptr()); - return true; + ErrConvString err(res); /* Get utf8 '\0'-terminated string */ + if (!(var->save_result.ptr= get_charset_by_name(err.ptr(), MYF(0)))) + { + my_error(ER_UNKNOWN_COLLATION, MYF(0), err.ptr()); + return true; + } } }
Monty: same argument as above.
else // INT_RESULT
=== modified file 'strings/ctype-ucs2.c' --- strings/ctype-ucs2.c 2013-01-29 14:10:47 +0000 +++ strings/ctype-ucs2.c 2013-03-15 04:27:10 +0000
<cut>
@@ -738,13 +751,14 @@
/* Check for a sign. */ negative= 0; - if (!s[0] && s[1] == '-') + if (wc == '-') { *error= -1; /* Mark as
negative number */
negative= 1; - s+= 2; - if (s == end) + res= cs->cset->mb_wc(cs, &wc, (const uchar *) s, (const uchar
*) end);
+ if (res < 0) goto no_conv; + s+= res;
Monty: Add a comment:
/* wc should contain the first number in the code below */
I added this comment: /* The code below assumes that 'wc' holds the first digit and 's' points to the next character after it. */
It would also be nice to have a comment why we can't assume for this code that each character is 2 bytes (like the old code assumed). I assume it's to detect wrong characters ?
The old assumption was a trick for simplicity. One character can actually occupy 2 or 4 bytes in utf16: - Characters U+0000..U+FFFF occupy 2 bytes. - Characters U+10000..U+10FFFF occupy 4 bytes. Before adding of utf16le we did not really have to distinguish between 2-byte and 4-byte characters, because in case of a 4-byte characters the conversion would stop anyway: the leading 2 bytes would not represent a digit. The utf16le patch switches from direct manipulation on the input byte stream to calls for cs->cset->mb_wc(), which is needed to scan these two charsets types properly: - ucs2 and utf16 (which use big endian) versus - utf16le (which uses little endian).
cutoff= MAX_NEGATIVE_NUMBER / LFACTOR2; cutoff2= (MAX_NEGATIVE_NUMBER % LFACTOR2) / 100; cutoff3= MAX_NEGATIVE_NUMBER % 100; @@ -752,46 +766,55 @@ else { *error= 0; - if (!s[0] && s[1] == '+') + if (wc == '+') { - s+= 2; - if (s == end) + res= cs->cset->mb_wc(cs, &wc, (const uchar *) s, (const
uchar *) end);
+ if (res < 0) goto no_conv; + s+= res;
Monty: Add same comment as for the earlier comment
Done.
} cutoff= ULONGLONG_MAX / LFACTOR2; cutoff2= ULONGLONG_MAX % LFACTOR2 / 100; cutoff3= ULONGLONG_MAX % 100; }
+ /* Handle case where we have a lot of pre-zero */ - if (!s[0] && s[1] == '0') + if (wc == '0') { i= 0; - do + for ( ; ; s+= res) { - s+= 2; if (s == end) goto end_i; /* Return 0 */ + res= cs->cset->mb_wc(cs, &wc, (const uchar *) s, (const
uchar *) end);
+ if (res < 0) + goto no_conv; + if (wc != '0') + break; } - while (!s[0] && s[1] == '0'); + while (wc == '0');
Monty: The above line is a bug (even if condition is impossible). Remove it! (The above code is probably wrongly left there when you converted the do-while to a for()).
Exactly, this is what happened. Thanks for noticing this! Fixed.
n_end= s + 2 * INIT_CNT; } else { /* Read first digit to check that it's a valid number */ - if (s[0] || (c= (s[1]-'0')) > 9) + if ((c= (wc - '0')) > 9) goto no_conv; i= c; - s+= 2; n_end= s + 2 * (INIT_CNT-1); }
Monty: Above you can set 'i' directly. No reason to use 'c'
Done.
/* Handle first 9 digits and store them in i */ if (n_end > end) n_end= end; - for (; s != n_end ; s+= 2) + for ( ; ; ) { - if (s[0] || (c= (s[1]-'0')) > 9) + res= cs->cset->mb_wc(cs, &wc, (const uchar *) s, (const uchar
*) n_end);
+ if (res < 0) + break;
Monty: In general I prefer the code:
if ((res= cs->cset->mb_wc(cs, &wc, (const uchar *) s, (const uchar *) n_end) < 0)
Me too :) Done.
By the way, wouldn't the code be simpler if you would change the type of 's' and 'end' and 'n_end' to const uchar * ?
This is a good idea. It reduced the number of casts from 8 to 3 and made the long lines shorter. Together with adding a function pointer variable: "my_charset_conv_mb_wc mb_wc= cs->cset->mb_wc;" it made the code as short as: if ((res= mb_wc(cs, &wc, s, end)) <= 0) goto no_conv; Btw, I also changed "<" to "<=". In case of utf16/utf16le it does not really matter (even on wrong input sequence "wc" does get initialized to ((hi<<8)+lo) anyway, so the loop just breaks on non-digit). But logically "<=" is more correct. That will be important if we ever decide to reuse this function for other character sets as well.
+ s+= res; + if ((c= (wc - '0')) > 9) goto end_i; i= i*10+c; }
<cut>
@@ -817,20 +843,28 @@ goto end_i_and_j; goto end3; } - if (s[0] || (c= (s[1]-'0')) > 9) + res= cs->cset->mb_wc(cs, &wc, (const uchar *) s, (const uchar *)
end);
+ if (res < 0) + goto no_conv; + s+= res; + if ((c= (wc - '0')) > 9) goto end3;
Monty: Here you could store directly in 'k' and remove next line
Done.
Isn't it wrong to increment 's' before going to to end3? 's' should point to the first not-number character. Now it will point one character past it.
Why wasn't this shown in the test cases? Maybe because you didn't test with big numbers?
Right, my_strtoll10_mb2() was not well covered in the tests. I added a shared test file include/ctype_strtoll10.inc and included it from ctype_utf16.test and ctype_utf16le.test. I also included it from ctype_utf32.test. It revealed a bug in handling big numbers in my_strtoll10_utf32() too. Fixed. Btw, we could join my_strtoll10_mb2() and my_strtoll10_utf32(), to reduce the amount of very similar code, but this may slightly slow down utf32 (because of switching from direct operation on the bytes to call for mb_wc()). Don't know what is better: a slightly faster but duplicate code, or a slightly slower code, but reused in multiple places.
/* Handle the next 1 or 2 digits and store them in k */ k=c; - s+= 2; - if (s == end || s[0] || (c= (s[1]-'0')) > 9) + if (s == end) + goto end4; + res= cs->cset->mb_wc(cs, &wc, (const uchar *) s, (const uchar *)
end);
+ if (res < 0) + goto no_conv; + s+= res; + if ((c= (wc - '0')) > 9) goto end4;
Monty: Isn't it wrong to increment 's' before going to to end4? 's' should point to the first not-number character. Now it will point one character past it.
Why wasn't this shown in the test cases?
Please check the same issue with all the other calls to end_i, end_i_and_j, end3 and end4.
Done. Incrementing should happen after "goto" in all cases. Fixed.
Even some of the earlier code looks not safe.
k= k*10+c; - s+= 2; *endptr= (char*) s;
/* number string should have ended here */ - if (s != end && !s[0] && (c= (s[1]-'0')) <= 9) + if (s != end && (c= (wc - '0')) <= 9) goto overflow;
/* Check that we didn't get an overflow with the last digit */
<skip>
@@ -900,11 +938,34 @@
static void -my_fill_mb2(CHARSET_INFO *cs __attribute__((unused)), - char *s, size_t l, int fill) +my_fill_mb2(CHARSET_INFO *cs, char *s, size_t slen, int fill) { - DBUG_ASSERT(fill <= 0xFFFF); - for ( ; l >= 2; s[0]= (fill >> 8), s[1]= (fill & 0xFF), s+= 2,
l-= 2);
+ char buf[10]; + int buflen; + + DBUG_ASSERT((slen % 2) == 0); + + buflen= cs->cset->wc_mb(cs, (my_wc_t) fill, (uchar*) buf, + (uchar*) buf + sizeof(buf)); + + DBUG_ASSERT(buflen > 0); + + while (slen >= (size_t) buflen) + { + /* Enough space for the characer */ + memcpy(s, buf, (size_t) buflen); + s+= buflen; + slen-= buflen; + }
Monty: I generally prefer loops like:
end= s + slen; for ( ; s < end; s+=buflen) memcpy(s, buf, (size_t) buflen);
Better to only have to decrement one variable instead of two.
Done.
+ + /* + If there are some more space which is not enough + for the whole multibyte character, then add trailing zeros. + */ + for ( ; slen; slen--) + { + *s++= 0x00; + }
Monty: Use instead:
if (slen) bzero(s, slen);
Done. <cut>
@@ -1141,10 +1208,10 @@ DBUG_ASSERT(src == dst && srclen == dstlen);
while ((src < srcend) && - (res= my_utf16_uni(cs, &wc, (uchar *)src, (uchar*)
srcend)) > 0)
+ (res= cs->cset->mb_wc(cs, &wc, (uchar *) src, (uchar *) srcend)) > 0) { my_toupper_utf16(uni_plane, &wc); - if (res != my_uni_utf16(cs, wc, (uchar*) src, (uchar*) srcend)) + if (res != cs->cset->wc_mb(cs, wc, (uchar *) src, (uchar *) srcend)) break; src+= res; }
Monty: Check the assembler if it would make sence to store cs->cset->mb_wc in a register.
(It probably would as the assembler can't assume that the pointer is unchanged over calls).
This applies to all cases where you use 'cs->cset->mb_wc' in a loop.
Done. I did not check assember. But this should be surely faster than dereferencing of two pointers (in case if compiler does not assume). Also, the code looks nicer this way, because of shorter lines. <cut>
+my_uni_utf16le(const CHARSET_INFO *cs __attribute__((unused)), + my_wc_t wc, uchar *s, uchar *e) +{ + if (wc < MY_UTF16_SURROGATE_HIGH_FIRST || + (wc > MY_UTF16_SURROGATE_LOW_LAST && + wc <= 0xFFFF)) + { + if (s + 2 > e) + return MY_CS_TOOSMALL2; + int2store(s, wc); + return 2; /* [0000-D7FF,E000-FFFF] */ + } + + if (wc < 0xFFFF || wc > 0x10FFFF) + return MY_CS_ILUNI; /* [D800-DFFF,10FFFF+] */ + + if (s + 4 > e) + return MY_CS_TOOSMALL4; + + wc-= 0x10000; + int2store(s, (0xD800 | ((wc >> 10) & 0x3FF))); s+= 2; + int2store(s, (0xDC00 | (wc & 0x3FF)));
Monty: Use int2store(s+2... for the second row and remove s+2. Makes the line easier to read...
Note that as int2store is a macro, doing it this is maybe faster:
uint16 first_part, second_part; uint32 total; first_part= (0xD800 | ((wc >> 10) & 0x3FF))); second_part= (0xDC00 | (wc & 0x3FF))); total= first_part + (second_part << 16) int4store(s, total);
Done: changed to int4store(). It copies 4 bytes at a time on i386, so it should be faster.
+ return 4; /* [010000-10FFFF] */ +} + + +static size_t +my_lengthsp_utf16le(const CHARSET_INFO *cs __attribute__((unused)), + const char *ptr, size_t length) +{ + const char *end= ptr + length; + while (end > ptr + 1 && uint2korr(end - 2) == 0x20)
Monty Using ' ' instead of 0x20 is maybe more clear?
Done. <cut>
@@ -2708,6 +2924,35 @@ }
+static inline void +my_tolower_ucs2(MY_UNICASE_INFO *const *uni_plane, my_wc_t *wc) +{ + int page= *wc >> 8; + DBUG_ASSERT(page < 256);
Monty: make page uint both here and in other places.
Done. <cut>
+static void +my_fill_ucs2(CHARSET_INFO *cs __attribute__((unused)), + char *s, size_t l, int fill)
Monty: Is it time to make 'fill' an unsigned int?
"my_wc_t" would probably be even better. I'd suggest to do it in a separate patch, as all strings/ctype-*.c will be affected. But do you really want to do so? It will diverge the code between MySQL and MariaDB without giving any advantages.
+{ + DBUG_ASSERT(fill <= 0xFFFF); + for ( ; l >= 2; s[0]= (fill >> 8), s[1]= (fill & 0xFF), s+= 2,
l-= 2);
+}
Monty: You can make the above faster if you store fill >>8 and fill & 0xff in registers.
Done. Rewrote using swapping bytes followed by int2store(). Should be even faster. <cut>