Hi Tom, On 03/18/2016 07:30 PM, Tom Worster wrote:
This conversation made me curious.
Which code points does a PADSPACE collation treat as trailing spaces?
Usually, just U+0020 SPACE. But this actually depends on an exact collation. This one treats U+00A0 NO-BREAK SPACE as equal to U+0020: http://collation-charts.org/mysql60/mysql604.cp1250_general_ci.html and thus ignores both trailing regular spaces and no-break spaces.
Would the imagined utf8_*_nopad_* collations treat all of these code points as significant?
Yes, this is the idea.
Tom
On 3/11/16, 6:20 AM, "Maria-discuss on behalf of Alexander Barkov" <maria-discuss-bounces+fsb=thefsb.org@lists.launchpad.net on behalf of bar@mariadb.org> wrote:
Hello Binarus, Kristian,
On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
Binarus <lists@binarus.de> writes:
"All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the
Yes, I have always found this terminally stupid as well. But I think it comes from the SQL standard.
Probably this comes from here:
2) A <standard character set name> specifies the name of a character set that is defined by a national or international standard. The character repertoire of CS is defined by the standard defining the character set identified by that <standard character set name>. The default collation of the character set is defined by the order of the characters in the standard and has the PAD SPACE characteristic.
So we follow the PAD SPACE requirement for default collations here.
I guess the reasoning here is to treat CHAR and VARCHAR in a similar way by default. I agree with the standard here. We don't want different behavior for CHAR and VARCHAR.
Btw, we don't follow the requirement that the default collation must be defined according to "the order of the characters in the standard". Default collations are traditionally case insensitive in MariaDB/MySQL.
The only workaround I know of is to use VARBINARY instead of VARCHAR. I think it works much the same in most respects. But obviously some semantics is lost when the server no longer is aware of the character set used.
Correct.
Since the index behaviour obviously depends on the collation, would building an own collation which does not PADSPACE be an option? I have
That would be interesting, actually. I don't know what support there is for non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
We don't have NO PAD collations yet. I don't remember that anybody ever asked for this before.
From a glance, this should not be too hard to implement.
You previously wrote you need utf8. So there are three options to make a new collation, based on one of these existing collations:
- utf8_general_ci - utf8_unicode_ci - utf8_bin
but with NO PAD characteristics.
Suppose you need utf8_general_nopad_ci (i.e. based on utf8_general_ci)
utf8_general_ci is implemented in strings/ctype-utf8.c
A new collation handler should be added, similar to this one:
static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler = { NULL, /* init */ my_strnncoll_utf8_general_ci, my_strnncollsp_utf8_general_ci, my_strnxfrm_unicode, my_strnxfrmlen_unicode, my_like_range_mb, my_wildcmp_utf8, my_strcasecmp_utf8, my_instr_mb, my_hash_sort_utf8, my_propagate_complex };
but these three virtual functions must be redefined to new similar functions that do not ignore trailing spaces:
- my_strnncollsp_utf8_general_ci - this is used for BTREE indexes - my_hash_sort_utf8 - this is used for HASH indexes - my_strnxfrm_unicode - this is used for filesort (non-indexed ORDER BY)
All other functions can be reused from the existing PAD SPACE collation.
Then a new "struct charset_info_st" should be defined, similar to my_charset_utf8_general_ci.
Sounds like a few hours of work.
But then thorough testing will be needed, which will be the most time consuming part.
Also, for consistency, it's worthy to implement at least utf8_nopad_bin and utf8_unicode_nopad_ci at once, and then eventually NO PAD collations for all other character sets.
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp