[Maria-discuss] Collations, trailing spaces and unique indexes
Dear all, I already have done some MySQL development and now eventually would like to switch to MariaDB because I no longer can stand the following: "All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the LIKE pattern-matching operator, for which trailing spaces are significant." "For those cases where trailing pad characters are stripped or comparisons ignore them, if a column has an index that requires unique values, inserting into the column values that differ only in number of trailing pad characters will result in a duplicate-key error. For example, if a table contains 'a', an attempt to store 'a ' causes a duplicate-key error." (taken from https://dev.mysql.com/doc/refman/5.5/en/char.html) I did not find any information whether the collations which MariaDB provides are of type PADSPACE as in MySQL (none of the collation tables from MariaDB's documentation does contain this information) or not. To make a long story short: I would like to have CHAR and VARCHAR fields with unique indexes which consider 'a' and 'a ' to be different values (which is the only reasonable point of view IMHO). This is not possible with MySQL. Is it possible with MariaDB? I am generally working with charset UTF8 and would be happy with a general solution, but also with a non-general solution which only works with UTF8. Since the index behaviour obviously depends on the collation, would building an own collation which does not PADSPACE be an option? I have read vague hints about that in MariaDB's documentation, but didn't try yet (before doing so, I would like to know if there is an easy, clean solution). Thank you very much for any ideas, Binarus
Am 11.03.2016 um 09:19 schrieb Binarus:
To make a long story short: I would like to have CHAR and VARCHAR fields with unique indexes which consider 'a' and 'a ' to be different values (which is the only reasonable point of view IMHO). This is not possible with MySQL. Is it possible with MariaDB?
since MariaDB is supposed to be a drop-in-replacement a different behavior would be a bug and break existing applications
On 11.03.2016 10:30, Reindl Harald wrote:
Am 11.03.2016 um 09:19 schrieb Binarus:
To make a long story short: I would like to have CHAR and VARCHAR fields with unique indexes which consider 'a' and 'a ' to be different values (which is the only reasonable point of view IMHO). This is not possible with MySQL. Is it possible with MariaDB?
since MariaDB is supposed to be a drop-in-replacement a different behavior would be a bug and break existing applications
Not necessarily IMHO. If you could configure it (i.e. if you had to turn on such behaviour explicitly) then it wouldn't be a bug. Regards, Binarus
Just a quick idea... Perhaps you can include an additional field in the unique index containing the length on the string. So Str, str_len 'a', 1 'a ', 2 Unq index on (str, str_len). Rhys -----Original Message----- From: Maria-discuss [mailto:maria-discuss-bounces+rhys.campbell=swisscom.com@lists.launchpad.net] On Behalf Of Binarus Sent: Friday, March 11, 2016 9:19 AM To: maria-discuss@lists.launchpad.net Subject: [Maria-discuss] Collations, trailing spaces and unique indexes Dear all, I already have done some MySQL development and now eventually would like to switch to MariaDB because I no longer can stand the following: "All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the LIKE pattern-matching operator, for which trailing spaces are significant." "For those cases where trailing pad characters are stripped or comparisons ignore them, if a column has an index that requires unique values, inserting into the column values that differ only in number of trailing pad characters will result in a duplicate-key error. For example, if a table contains 'a', an attempt to store 'a ' causes a duplicate-key error." (taken from https://dev.mysql.com/doc/refman/5.5/en/char.html) I did not find any information whether the collations which MariaDB provides are of type PADSPACE as in MySQL (none of the collation tables from MariaDB's documentation does contain this information) or not. To make a long story short: I would like to have CHAR and VARCHAR fields with unique indexes which consider 'a' and 'a ' to be different values (which is the only reasonable point of view IMHO). This is not possible with MySQL. Is it possible with MariaDB? I am generally working with charset UTF8 and would be happy with a general solution, but also with a non-general solution which only works with UTF8. Since the index behaviour obviously depends on the collation, would building an own collation which does not PADSPACE be an option? I have read vague hints about that in MariaDB's documentation, but didn't try yet (before doing so, I would like to know if there is an easy, clean solution). Thank you very much for any ideas, Binarus _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
On 11.03.2016 10:56, Rhys.Campbell@swisscom.com wrote:
Just a quick idea... Perhaps you can include an additional field in the unique index containing the length on the string. So
Str, str_len 'a', 1 'a ', 2
Unq index on (str, str_len).
Thanks for the suggestion. I already have read about that idea when researching the same problem for MySQL. I think I could try this as a last resort, but it would require changes in the existing applications, and -more difficult- I should find out the maximum index length in indexes which contain several columns. As far as I know, the maximum index length with InnoDB is 767 bytes, and I already need these because my columns are mostly VARCHAR(255), charset UTF8. Now, if I add one of those columns to a unique index, thereby using all available index bytes, could I add a second column to the same index at all? If I have to go this way as a last resort, I'll investigate ... Binarus
I think I could try this as a last resort, but it would require changes in the existing applications. Could do it with a trigger to avoid application changes although that's something I'd probably try to avoid. If you can afford it, then you could only index part of your VRAHCAR columns to make space for the extra field, i.e. CREATE INDEX part_of_name ON customer (name(200), str(200), str_length); -----Original Message----- From: Maria-discuss [mailto:maria-discuss-bounces+rhys.campbell=swisscom.com@lists.launchpad.net] On Behalf Of Binarus Sent: Friday, March 11, 2016 11:55 AM To: maria-discuss@lists.launchpad.net Subject: Re: [Maria-discuss] Collations, trailing spaces and unique indexes On 11.03.2016 10:56, Rhys.Campbell@swisscom.com wrote:
Just a quick idea... Perhaps you can include an additional field in the unique index containing the length on the string. So
Str, str_len 'a', 1 'a ', 2
Unq index on (str, str_len).
Thanks for the suggestion. I already have read about that idea when researching the same problem for MySQL. I think I could try this as a last resort, but it would require changes in the existing applications, and -more difficult- I should find out the maximum index length in indexes which contain several columns. As far as I know, the maximum index length with InnoDB is 767 bytes, and I already need these because my columns are mostly VARCHAR(255), charset UTF8. Now, if I add one of those columns to a unique index, thereby using all available index bytes, could I add a second column to the same index at all? If I have to go this way as a last resort, I'll investigate ... Binarus _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
An idea that may work for you could be to TRIM() - http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_trim - the original varchar column into a virtual (PERSISTENT) column, and add an index on this column? Peter On Fri, Mar 11, 2016 at 12:59 PM, <Rhys.Campbell@swisscom.com> wrote:
I think I could try this as a last resort, but it would require changes in the existing applications.
Could do it with a trigger to avoid application changes although that's something I'd probably try to avoid.
If you can afford it, then you could only index part of your VRAHCAR columns to make space for the extra field, i.e.
CREATE INDEX part_of_name ON customer (name(200), str(200), str_length);
-----Original Message----- From: Maria-discuss [mailto:maria-discuss-bounces+rhys.campbell= swisscom.com@lists.launchpad.net] On Behalf Of Binarus Sent: Friday, March 11, 2016 11:55 AM To: maria-discuss@lists.launchpad.net Subject: Re: [Maria-discuss] Collations, trailing spaces and unique indexes
On 11.03.2016 10:56, Rhys.Campbell@swisscom.com wrote:
Just a quick idea... Perhaps you can include an additional field in the unique index containing the length on the string. So
Str, str_len 'a', 1 'a ', 2
Unq index on (str, str_len).
Thanks for the suggestion. I already have read about that idea when researching the same problem for MySQL.
I think I could try this as a last resort, but it would require changes in the existing applications, and -more difficult- I should find out the maximum index length in indexes which contain several columns. As far as I know, the maximum index length with InnoDB is 767 bytes, and I already need these because my columns are mostly VARCHAR(255), charset UTF8. Now, if I add one of those columns to a unique index, thereby using all available index bytes, could I add a second column to the same index at all?
If I have to go this way as a last resort, I'll investigate ...
Binarus
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
On 11.03.2016 13:06, Peter Laursen wrote:
An idea that may work for you could be to TRIM() - http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_trim - the original varchar column into a virtual (PERSISTENT) column, and add an index on this column?
I'm not sure if I understand you correctly, but using TRIM() on that column would make the value 'a' the same as the value 'a ' which is exactly how MariaDB's indexes already work by default and the thing which I would like to avoid. Have I misunderstood something? Regards, Binarus
On 11.03.2016 12:59, Rhys.Campbell@swisscom.com wrote:
I think I could try this as a last resort, but it would require changes in the existing applications.
Could do it with a trigger to avoid application changes although that's something I'd probably try to avoid.
Agreed. I want to avoid this, too.
If you can afford it, then you could only index part of your VRAHCAR columns to make space for the extra field, i.e.
CREATE INDEX part_of_name ON customer (name(200), str(200), str_length);
The problem with this special application is that a) the column is VARCHAR(255), b) there are rows where the column is filled completely, i.e. where 255 chars are in the column, c) that the index is a unique one, d) 254*'a' + ' ' must (index wise) be another value than 254*'a'. Furthermore, the changes in the applications which are required for that probably are substantial. I know that this might be a special requirement, but on the other hand please see my reply to Alexander Barkov's post where I (hopefully) can explain why ignoring these trailing spaces in unique indexes is a faulty design from the beginning on, may it be standard or not. So, even if I could circumvent the problem with this special application by some mad index constructions and programming techniques, I still would really like to have a clean solution. Alexander has made me hope that he or one of his fellows might be willing to implement this, and I will try to give him arguments for doing so. Therefore, I now have two options: 1) If it will be implemented in MariaDB, I'll switch our applications from MySQL to MariaDB. 2) If it will not be implemented, I will stick with MySQL and switch all respective rows in all tables from VARCHAR(255) to VARBINARY(765). Thank you very much again, Binarus
Le 11 Mar 2016 à 14:39, Binarus <lists@binarus.de> a écrit :
On 11.03.2016 12:59, Rhys.Campbell@swisscom.com wrote:
I think I could try this as a last resort, but it would require changes in the existing applications.
Could do it with a trigger to avoid application changes although that's something I'd probably try to avoid.
Agreed. I want to avoid this, too.
If you can afford it, then you could only index part of your VRAHCAR columns to make space for the extra field, i.e.
CREATE INDEX part_of_name ON customer (name(200), str(200), str_length);
The problem with this special application is that a) the column is VARCHAR(255), b) there are rows where the column is filled completely, i.e. where 255 chars are in the column, c) that the index is a unique one, d) 254*'a' + ' ' must (index wise) be another value than 254*'a'. Furthermore, the changes in the applications which are required for that probably are substantial.
I know that this might be a special requirement, but on the other hand please see my reply to Alexander Barkov's post where I (hopefully) can explain why ignoring these trailing spaces in unique indexes is a faulty design from the beginning on, may it be standard or not.
So, even if I could circumvent the problem with this special application by some mad index constructions and programming techniques, I still would really like to have a clean solution.
Alexander has made me hope that he or one of his fellows might be willing to implement this, and I will try to give him arguments for doing so.
Therefore, I now have two options:
1) If it will be implemented in MariaDB, I'll switch our applications from MySQL to MariaDB.
2) If it will not be implemented, I will stick with MySQL and switch all respective rows in all tables from VARCHAR(255) to VARBINARY(765). Hi Binarus,
I see 3 options here: 1) MariaDB captains can help you implementing this, we always welcome contributions and Alexander already point you where to start. 2) You really need this feature and can switch to MariaDB, please contact one of our sale reps to quote you some NRE work . 3) Captains fill this is a generic issue that need to be fixed for the benefit of all, but also prioritise it on other tasks they are working on. Tell us the time you are ready to wait, How much you can afford to help etc. ..? An other option is that you switch to MariaDB and use VarBinary, that would help our captains not to forget your request! Tx /svar
Thank you very much again,
Binarus
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Hi Binarus,
I see 3 options here:
1) MariaDB captains can help you implementing this, we always welcome contributions and Alexander already point you where to start.
2) You really need this feature and can switch to MariaDB, please contact one of our sale reps to quote you some NRE work .
3) Captains fill this is a generic issue that need to be fixed for the benefit of all, but also prioritise it on other tasks they are working on.
Tell us the time you are ready to wait, How much you can afford to help etc. ..?
An other option is that you switch to MariaDB and use VarBinary, that would help our captains not to forget your request!
I'll try to take option 1) and have a look into the code over the weekend. I would be proud to make a small contribution to your project. But since I never have read any of the sources of MySQL or MariaDB until now, I am unsure if I will be able to do anything meaningful to the code. Furthermore, I (currently) don't know how to submit patches. But I'll find out ... Regards, Binarus
Binarus <lists@binarus.de> writes:
"All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the
Yes, I have always found this terminally stupid as well. But I think it comes from the SQL standard. The only workaround I know of is to use VARBINARY instead of VARCHAR. I think it works much the same in most respects. But obviously some semantics is lost when the server no longer is aware of the character set used.
Since the index behaviour obviously depends on the collation, would building an own collation which does not PADSPACE be an option? I have
That would be interesting, actually. I don't know what support there is for non-PADSPACE collations. Maybe bar knows (Cc:'ed)? - Kristian.
Hello Binarus, Kristian, On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
Binarus <lists@binarus.de> writes:
"All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the
Yes, I have always found this terminally stupid as well. But I think it comes from the SQL standard.
Probably this comes from here:
2) A <standard character set name> specifies the name of a character set that is defined by a national or international standard. The character repertoire of CS is defined by the standard defining the character set identified by that <standard character set name>. The default collation of the character set is defined by the order of the characters in the standard and has the PAD SPACE characteristic.
So we follow the PAD SPACE requirement for default collations here. I guess the reasoning here is to treat CHAR and VARCHAR in a similar way by default. I agree with the standard here. We don't want different behavior for CHAR and VARCHAR. Btw, we don't follow the requirement that the default collation must be defined according to "the order of the characters in the standard". Default collations are traditionally case insensitive in MariaDB/MySQL.
The only workaround I know of is to use VARBINARY instead of VARCHAR. I think it works much the same in most respects. But obviously some semantics is lost when the server no longer is aware of the character set used.
Correct.
Since the index behaviour obviously depends on the collation, would building an own collation which does not PADSPACE be an option? I have
That would be interesting, actually. I don't know what support there is for non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
We don't have NO PAD collations yet. I don't remember that anybody ever asked for this before.
From a glance, this should not be too hard to implement.
You previously wrote you need utf8. So there are three options to make a new collation, based on one of these existing collations: - utf8_general_ci - utf8_unicode_ci - utf8_bin but with NO PAD characteristics. Suppose you need utf8_general_nopad_ci (i.e. based on utf8_general_ci) utf8_general_ci is implemented in strings/ctype-utf8.c A new collation handler should be added, similar to this one: static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler = { NULL, /* init */ my_strnncoll_utf8_general_ci, my_strnncollsp_utf8_general_ci, my_strnxfrm_unicode, my_strnxfrmlen_unicode, my_like_range_mb, my_wildcmp_utf8, my_strcasecmp_utf8, my_instr_mb, my_hash_sort_utf8, my_propagate_complex }; but these three virtual functions must be redefined to new similar functions that do not ignore trailing spaces: - my_strnncollsp_utf8_general_ci - this is used for BTREE indexes - my_hash_sort_utf8 - this is used for HASH indexes - my_strnxfrm_unicode - this is used for filesort (non-indexed ORDER BY) All other functions can be reused from the existing PAD SPACE collation. Then a new "struct charset_info_st" should be defined, similar to my_charset_utf8_general_ci. Sounds like a few hours of work. But then thorough testing will be needed, which will be the most time consuming part. Also, for consistency, it's worthy to implement at least utf8_nopad_bin and utf8_unicode_nopad_ci at once, and then eventually NO PAD collations for all other character sets.
- Kristian.
FYI, I have added a new task for this: https://jira.mariadb.org/browse/MDEV-9711 On 03/11/2016 03:20 PM, Alexander Barkov wrote:
Hello Binarus, Kristian,
On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
Binarus <lists@binarus.de> writes:
"All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the
Yes, I have always found this terminally stupid as well. But I think it comes from the SQL standard.
Probably this comes from here:
2) A <standard character set name> specifies the name of a character set that is defined by a national or international standard. The character repertoire of CS is defined by the standard defining the character set identified by that <standard character set name>. The default collation of the character set is defined by the order of the characters in the standard and has the PAD SPACE characteristic.
So we follow the PAD SPACE requirement for default collations here.
I guess the reasoning here is to treat CHAR and VARCHAR in a similar way by default. I agree with the standard here. We don't want different behavior for CHAR and VARCHAR.
Btw, we don't follow the requirement that the default collation must be defined according to "the order of the characters in the standard". Default collations are traditionally case insensitive in MariaDB/MySQL.
The only workaround I know of is to use VARBINARY instead of VARCHAR. I think it works much the same in most respects. But obviously some semantics is lost when the server no longer is aware of the character set used.
Correct.
Since the index behaviour obviously depends on the collation, would building an own collation which does not PADSPACE be an option? I have
That would be interesting, actually. I don't know what support there is for non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
We don't have NO PAD collations yet. I don't remember that anybody ever asked for this before.
From a glance, this should not be too hard to implement.
You previously wrote you need utf8. So there are three options to make a new collation, based on one of these existing collations:
- utf8_general_ci - utf8_unicode_ci - utf8_bin
but with NO PAD characteristics.
Suppose you need utf8_general_nopad_ci (i.e. based on utf8_general_ci)
utf8_general_ci is implemented in strings/ctype-utf8.c
A new collation handler should be added, similar to this one:
static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler = { NULL, /* init */ my_strnncoll_utf8_general_ci, my_strnncollsp_utf8_general_ci, my_strnxfrm_unicode, my_strnxfrmlen_unicode, my_like_range_mb, my_wildcmp_utf8, my_strcasecmp_utf8, my_instr_mb, my_hash_sort_utf8, my_propagate_complex };
but these three virtual functions must be redefined to new similar functions that do not ignore trailing spaces:
- my_strnncollsp_utf8_general_ci - this is used for BTREE indexes - my_hash_sort_utf8 - this is used for HASH indexes - my_strnxfrm_unicode - this is used for filesort (non-indexed ORDER BY)
All other functions can be reused from the existing PAD SPACE collation.
Then a new "struct charset_info_st" should be defined, similar to my_charset_utf8_general_ci.
Sounds like a few hours of work.
But then thorough testing will be needed, which will be the most time consuming part.
Also, for consistency, it's worthy to implement at least utf8_nopad_bin and utf8_unicode_nopad_ci at once, and then eventually NO PAD collations for all other character sets.
- Kristian.
On 11.03.2016 15:45, Alexander Barkov wrote:
FYI, I have added a new task for this:
Thank you very much! Exactly what I hoped ...
On 11.03.2016 15:45, Alexander Barkov wrote:
FYI, I have added a new task for this:
Alexander, not sure if my question makes any sense, but what about the _cs collations? Like latin1_general_cs, latin2_czech_cs? Don't we like to have these, too? Regards, Binarus
On 03/11/2016 03:20 PM, Alexander Barkov wrote:
Hello Binarus, Kristian,
On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
Binarus <lists@binarus.de> writes:
"All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the
Yes, I have always found this terminally stupid as well. But I think it comes from the SQL standard.
Probably this comes from here:
2) A <standard character set name> specifies the name of a character set that is defined by a national or international standard. The character repertoire of CS is defined by the standard defining the character set identified by that <standard character set name>. The default collation of the character set is defined by the order of the characters in the standard and has the PAD SPACE characteristic.
So we follow the PAD SPACE requirement for default collations here.
I guess the reasoning here is to treat CHAR and VARCHAR in a similar way by default. I agree with the standard here. We don't want different behavior for CHAR and VARCHAR.
Btw, we don't follow the requirement that the default collation must be defined according to "the order of the characters in the standard". Default collations are traditionally case insensitive in MariaDB/MySQL.
The only workaround I know of is to use VARBINARY instead of VARCHAR. I think it works much the same in most respects. But obviously some semantics is lost when the server no longer is aware of the character set used.
Correct.
Since the index behaviour obviously depends on the collation, would building an own collation which does not PADSPACE be an option? I have
That would be interesting, actually. I don't know what support there is for non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
We don't have NO PAD collations yet. I don't remember that anybody ever asked for this before.
From a glance, this should not be too hard to implement.
You previously wrote you need utf8. So there are three options to make a new collation, based on one of these existing collations:
- utf8_general_ci - utf8_unicode_ci - utf8_bin
but with NO PAD characteristics.
Suppose you need utf8_general_nopad_ci (i.e. based on utf8_general_ci)
utf8_general_ci is implemented in strings/ctype-utf8.c
A new collation handler should be added, similar to this one:
static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler = { NULL, /* init */ my_strnncoll_utf8_general_ci, my_strnncollsp_utf8_general_ci, my_strnxfrm_unicode, my_strnxfrmlen_unicode, my_like_range_mb, my_wildcmp_utf8, my_strcasecmp_utf8, my_instr_mb, my_hash_sort_utf8, my_propagate_complex };
but these three virtual functions must be redefined to new similar functions that do not ignore trailing spaces:
- my_strnncollsp_utf8_general_ci - this is used for BTREE indexes - my_hash_sort_utf8 - this is used for HASH indexes - my_strnxfrm_unicode - this is used for filesort (non-indexed ORDER BY)
All other functions can be reused from the existing PAD SPACE collation.
Then a new "struct charset_info_st" should be defined, similar to my_charset_utf8_general_ci.
Sounds like a few hours of work.
But then thorough testing will be needed, which will be the most time consuming part.
Also, for consistency, it's worthy to implement at least utf8_nopad_bin and utf8_unicode_nopad_ci at once, and then eventually NO PAD collations for all other character sets.
- Kristian.
Hi Binarus, On 03/11/2016 09:25 PM, Binarus wrote:
On 11.03.2016 15:45, Alexander Barkov wrote:
FYI, I have added a new task for this:
Alexander, not sure if my question makes any sense, but what about the _cs collations? Like latin1_general_cs, latin2_czech_cs? Don't we like to have these, too?
We do, eventually. I just wanted to split the task into these steps: 1. Default and _bin collations. This is what MDEV-9711 is about. 2. Unicode Collation Algorithm (UCA) based collations (e.g. utf8_unicode_nopad_ci) 3. The rest (not covered by the previous steps) This will include latin1_general_nopad_cs, latin2_czech_nopad_cs This is why asked in the previous letter if utf8_unicode_nopad_ci would work for you. Also, which MariaDB version do you need this for? Will 10.1 work?
Regards,
Binarus
On 03/11/2016 03:20 PM, Alexander Barkov wrote:
Hello Binarus, Kristian,
On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
Binarus <lists@binarus.de> writes:
"All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the
Yes, I have always found this terminally stupid as well. But I think it comes from the SQL standard.
Probably this comes from here:
2) A <standard character set name> specifies the name of a character set that is defined by a national or international standard. The character repertoire of CS is defined by the standard defining the character set identified by that <standard character set name>. The default collation of the character set is defined by the order of the characters in the standard and has the PAD SPACE characteristic.
So we follow the PAD SPACE requirement for default collations here.
I guess the reasoning here is to treat CHAR and VARCHAR in a similar way by default. I agree with the standard here. We don't want different behavior for CHAR and VARCHAR.
Btw, we don't follow the requirement that the default collation must be defined according to "the order of the characters in the standard". Default collations are traditionally case insensitive in MariaDB/MySQL.
The only workaround I know of is to use VARBINARY instead of VARCHAR. I think it works much the same in most respects. But obviously some semantics is lost when the server no longer is aware of the character set used.
Correct.
Since the index behaviour obviously depends on the collation, would building an own collation which does not PADSPACE be an option? I have
That would be interesting, actually. I don't know what support there is for non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
We don't have NO PAD collations yet. I don't remember that anybody ever asked for this before.
From a glance, this should not be too hard to implement.
You previously wrote you need utf8. So there are three options to make a new collation, based on one of these existing collations:
- utf8_general_ci - utf8_unicode_ci - utf8_bin
but with NO PAD characteristics.
Suppose you need utf8_general_nopad_ci (i.e. based on utf8_general_ci)
utf8_general_ci is implemented in strings/ctype-utf8.c
A new collation handler should be added, similar to this one:
static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler = { NULL, /* init */ my_strnncoll_utf8_general_ci, my_strnncollsp_utf8_general_ci, my_strnxfrm_unicode, my_strnxfrmlen_unicode, my_like_range_mb, my_wildcmp_utf8, my_strcasecmp_utf8, my_instr_mb, my_hash_sort_utf8, my_propagate_complex };
but these three virtual functions must be redefined to new similar functions that do not ignore trailing spaces:
- my_strnncollsp_utf8_general_ci - this is used for BTREE indexes - my_hash_sort_utf8 - this is used for HASH indexes - my_strnxfrm_unicode - this is used for filesort (non-indexed ORDER BY)
All other functions can be reused from the existing PAD SPACE collation.
Then a new "struct charset_info_st" should be defined, similar to my_charset_utf8_general_ci.
Sounds like a few hours of work.
But then thorough testing will be needed, which will be the most time consuming part.
Also, for consistency, it's worthy to implement at least utf8_nopad_bin and utf8_unicode_nopad_ci at once, and then eventually NO PAD collations for all other character sets.
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
I just wanted to split the task into these steps:
1. Default and _bin collations. This is what MDEV-9711 is about.
2. Unicode Collation Algorithm (UCA) based collations (e.g. utf8_unicode_nopad_ci)
3. The rest (not covered by the previous steps) This will include latin1_general_nopad_cs, latin2_czech_nopad_cs
This is why asked in the previous letter if utf8_unicode_nopad_ci would work for you. Also, which MariaDB version do you need this for? Will 10.1 work?
Thank you very much for this explanation. Regarding the version: Our O/S usually is Debian which generally uses software which is quite old. This means that I will have to compile MariaDB myself in every case. The current project where this new collations are needed will probably take about three months before being finished. It would be very nice if we could have the new collations until then. In the meantime, we'll continue development with MySQL and switch to MariaDB as soon as the new collations are available. _ci collations would be sufficient as a starting point. Nevertheless, I think there must be a _nopad_ collation for every collation which already exists. Personally, since we are using utf8 (and no other encoding) in every part of every project, I would prefer to have *all utf8* _nopad_ collations before having any other _nopad_ collation. But that's just my personal opinion. We will be grateful for any progress regardless of the order of things being implemented. Binarus
Thank you very much for this explanation.
Regarding the version: Our O/S usually is Debian which generally uses software which is quite old. This means that I will have to compile MariaDB myself in every case. The current project where this new collations are needed will probably take about three months before being finished. It would be very nice if we could have the new collations until then. In the meantime, we'll continue development with MySQL and switch to MariaDB as soon as the new collations are available.
_ci collations would be sufficient as a starting point. Nevertheless, I think there must be a _nopad_ collation for every collation which already exists. Personally, since we are using utf8 (and no other encoding) in every part of every project, I would prefer to have *all utf8* _nopad_ collations before having any other _nopad_ collation. But that's just my personal opinion. We will be grateful for any progress regardless of the order of things being implemented.
I just noticed that my MUA underlined tow lines of my text completely, and I would like to apologize for that. I really did not not want to shout ... The underlining obviously happens because I used some underscores in my text ...
On 11.03.2016 15:45, Alexander Barkov wrote:
FYI, I have added a new task for this:
Alexander, I couldn't resist taking a quick look into the sources. - I have found my_hash_sort_utf8 in strings/ctype-utf8.c and am convinced that the change is incredibly easy there. - I have found my_strnxfrm_unicode in the same file and will need more time to make my opinion of how difficult it will be (I don't know what a weight is, so I currently try to understand what the function does at all). - My main problem: I did not find my_strnncollsp_utf8_general_ci anywhere (nor in the same neither in any other file). Where is it? Furthermore, studying the code has led to some questions; for example, there already seems to be a #define which controls the padding-when-comparing mode, but only for the _cs collations? Should we continue our conversation on the developer mailing list? Regards, Binarus
Hi Binarus, On 03/11/2016 10:19 PM, Binarus wrote:
On 11.03.2016 15:45, Alexander Barkov wrote:
FYI, I have added a new task for this:
Alexander,
I couldn't resist taking a quick look into the sources.
- I have found my_hash_sort_utf8 in strings/ctype-utf8.c and am convinced that the change is incredibly easy there.
- I have found my_strnxfrm_unicode in the same file and will need more time to make my opinion of how difficult it will be (I don't know what a weight is, so I currently try to understand what the function does at all).
This function is used to create sort keys for non-indexed ORDER BY, for these cases: - ORDER BY on an expression - ORDER BY on a column that does not have an index The idea is exactly the same with the C function strxfrm. See "man strxfrm". The code implements non-indexed ORDER BY in filesort.cc as follows: 1. It calls *_strnxfrm_* functions for all records and converts CHAR/VARCHAR/TEXT values into their fixed length binary sortable keys. 2 . Then executes binary sorting on these keys. By the way, fixing this function might be tricky. Currently my_strnxfrm_unicode() pads the tail using weights of the SPACE character. The NO PAD version will need to pad the tail using a weight which is less than the weight of the smallest possible character. This should be easy for UCA bases collations (e.g. utf8_unicode_nopad_ci), because the smallest possible character in UCA collations is "U+0009 HORIZONTAL TABULATION", and its weight is 0x0201. So we can just pad the sort key using a smaller value 0x0200. But I'm not sure yet what to do with 8-bit collations, which usually use 0x00 as weight for the smallest character. So we don't have a smaller value. There are two options here: 1. Pad with 0x00. But this will mean that 'aaa<min>' and just 'aaa' will have unpredictable order when doing ORDER BY without an index (where <min> is the smallest possible character in the collation). As the smallest character in non-UCA collations is usually "U+0000 NULL", this will mean that 'aaa\0' and just 'aaa' will have unpredictable order. 2. Reserve extra bytes at the end of the key, to store the true length, so - 'aaa\0' will have the key '4141410004' - 'aaa' will have the key '4141410003', and therefore will always be sorted before 'aaa\0'. I'm inclined towards #2, to have consistent ORDER BY behavior with and without indexes.
- My main problem: I did not find my_strnncollsp_utf8_general_ci anywhere (nor in the same neither in any other file). Where is it?
The function name is just "my_strnncollsp_utf8".
Furthermore, studying the code has led to some questions; for example, there already seems to be a #define which controls the padding-when-comparing mode, but only for the _cs collations?
Can you please clarify which lines do you mean?
Should we continue our conversation on the developer mailing list?
Sure.
Regards,
Binarus
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Hi, On 03/16/2016 09:59 AM, Alexander Barkov wrote: ...
- My main problem: I did not find my_strnncollsp_utf8_general_ci anywhere (nor in the same neither in any other file). Where is it?
The function name is just "my_strnncollsp_utf8".
A correction. In 10.0 and earlier versions the name was "my_strnncollsp_utf8". In 10.1 it was renamed to "my_strnncollsp_utf8_general_ci". It's defined in this piece of the code in ctype-utf8.c: #define MY_FUNCTION_NAME(x) my_ ## x ## _utf8_general_ci #define WEIGHT_ILSEQ(x) (0xFF0000 + (uchar) (x)) #define WEIGHT_MB1(x) my_weight_mb1_utf8_general_ci(x) #define WEIGHT_MB2(x,y) my_weight_mb2_utf8_general_ci(x,y) #define WEIGHT_MB3(x,y,z) my_weight_mb3_utf8_general_ci(x,y,z) #include "strcoll.ic" and the function body is defined in a shared file strcoll.ic, which is reused for many character sets to avoid duplicate code, here:
static int MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)), const uchar *a, size_t a_length, const uchar *b, size_t b_length, my_bool diff_if_only_endspace_difference __attribute__((unused))) {
Hi, You can add an md5 of the string and unique on that. You can do this with a generated column. Sent from my iPhone
On Mar 11, 2016, at 10:19 AM, Binarus <lists@binarus.de> wrote:
On 11.03.2016 15:45, Alexander Barkov wrote: FYI, I have added a new task for this:
Alexander,
I couldn't resist taking a quick look into the sources.
- I have found my_hash_sort_utf8 in strings/ctype-utf8.c and am convinced that the change is incredibly easy there.
- I have found my_strnxfrm_unicode in the same file and will need more time to make my opinion of how difficult it will be (I don't know what a weight is, so I currently try to understand what the function does at all).
- My main problem: I did not find my_strnncollsp_utf8_general_ci anywhere (nor in the same neither in any other file). Where is it?
Furthermore, studying the code has led to some questions; for example, there already seems to be a #define which controls the padding-when-comparing mode, but only for the _cs collations?
Should we continue our conversation on the developer mailing list?
Regards,
Binarus
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Hello Kristian, hello Alexander, On 11.03.2016 12:20, Alexander Barkov wrote:
So we follow the PAD SPACE requirement for default collations here.
I think it's reasonable that the default collations are implemented according to the standard. Nevertheless, the standard enforces an exceptionally stupid policy here (IMHO), and therefore I think we are in urgent need of additional (non-default) collations which solve that problem (and what you have written below makes me hope ...).
The only workaround I know of is to use VARBINARY instead of VARCHAR. I think it works much the same in most respects. But obviously some semantics is lost when the server no longer is aware of the character set used.
Correct.
I'll go that way, sticking with MySQL, if the problem won't be solved by new collations in MariaDB.
Since the index behaviour obviously depends on the collation, would building an own collation which does not PADSPACE be an option? I have
That would be interesting, actually. I don't know what support there is for non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
We don't have NO PAD collations yet.
Thanks for making this clear; I haven't found something about it in the documents. By the way, the document sections which deal with the char data types could eventually mention these pitfalls (like the MySQL documentation) :-)
I don't remember that anybody ever asked for this before.
I wonder why, given the dozens (or hundreds?) of threads and tutorials around the web which try to explain how to circumvent these issues (none of them provides a general solution besides switching to ...BINARY).
[...] Sounds like a few hours of work.
This is very good news. I will now try to give some arguments why this really should be done. 1) Avoid programming mistakes People might be programming for decades, but when not being database experts and thrown into a web development project, even those experienced programmers won't come to the idea that a database which returns trailing spaces in the result set of SELECT statements throws away these trailing spaces when building the index or when comparing (e.g. for evaluation the conditions of the WHERE CLAUSE). This is completely illogical and dangerous and makes debugging substantially more difficult. 2) Developers (and not DBMSs) should be in control of what happens to trailing spaces A DBMS must not silently change data, nor for comparing nor for other purposes. It is solely the decision of the application logic if trailing spaces are acceptable or desired in data values (possibly entered by users), but not the decision of the DBMS. Nothing else than the application logic has to decide what to do with trailing spaces (trim them, keep them, reject user input, return error messages etc.), and the DBMS has to store, retrieve, compare and index the results of the application logic as-is. Period. However, there is no argument against additional and optional functions or configuration flags of the DBMS which allow easy trimming or other (automatic and silent) transformation of the values and all sorts of weird compare modes, if these are optional. But in the first place, there must also be a mode / column type for every sort of data, notably for strings, in which data is stored, compared and indexed as-is. We already have the binary data types which in this respect behave like desired, but all string functions fail with binary data values or behave surprisingly / impractically, so nearly all string functions (which are very important in many applications) are lost for binary data type columns. 3) There are use cases for string data with trailing spaces I have seen some examples of trailing spaces, some of them useless, but some quite meaningful. In every case, the application logic relied on a string with trailing space being not equal to the same string without the trailing space. In some cases, the developers had not thought about the issues the trailing spaces could cause, i.e. their programs were faulty; in the other cases, they have put a lot of effort and used the weirdest tricks and index combinations to circumvent the problems. I can't remember a case where they have switched to ...BINARY data types. Two of the funniest cases (just because it's Friday - skip the next two sections if you don't want to laugh): Some application actually stored an internal status bit at the end of existing user input, probably because the programmers were in a hurry or had no administrative access to the MySQL server, i.e. they couldn't create a new status column in the respective table. Now how could you save a status bit within an existing string without making it visible to the users? The answer is of course: Append a space at the end of the string to set the status, remove it to clear the status. Of course, the respective column had to be UNIQUE in the sense that two identical user inputs should be allowed if the "status bit" differed ... Some other application generated session keys in string form and used a part of the ASCII character set for the keys (e.g. a-z, A-Z, 0-9, special chars like $% and so on, including the space). Of course, as chance would have it, there could be a trailing space at the end of the session key. I can't remember if this led to problems in this case, but theoretically, it surely could have. Unfortunately, my own use case is too difficult to explain here. To summarize, there are all sorts of serious pitfalls which might be very difficult to debug and very difficult to foresee if the DBMS ignores trailing spaces when storing, retrieving, comparing or indexing strings, so I would be very grateful, and I would immediately switch to MariaDB, if appropriate collations would be added there.
But then thorough testing will be needed, which will be the most time consuming part.
Probably, but I have a feeling that *not* including trailing spaces in comparisons / collations is much more difficult than including them. A trailing space is just a char as every other char, isn't it? So I suppose the structure of the respective functions for _nopad_ collations will be much less complicated than the structure of the existing functions. Given that, the testing eventually wouldn't consume too much time ...
Also, for consistency, it's worthy to implement at least utf8_nopad_bin and utf8_unicode_nopad_ci at once, and then eventually NO PAD collations for all other character sets.
Personally, I would be totally happy if I had the three new utf8 collations (I probably will never use anything else than utf8 for strings). Thank you very much! Binarus
Hi Binarus, Thanks for a detailed answer. It seems that NO PAD collations are going to be a very welcome feature. Please see my replies and questions below. On 03/11/2016 07:56 PM, Binarus wrote:
Hello Kristian, hello Alexander,
On 11.03.2016 12:20, Alexander Barkov wrote:
So we follow the PAD SPACE requirement for default collations here.
I think it's reasonable that the default collations are implemented according to the standard. Nevertheless, the standard enforces an exceptionally stupid policy here (IMHO),
Using PAD SPACE by default makes it safe to ALTER TABLE and switch: - from VARCHAR(N) to CHAR(M) - from CHAR(N) to VARCHAR(M) where M>=N. I think that was the main reason for make PAD SPACE by default. So I agree with the standard here.
and therefore I think we are in urgent need of additional (non-default) collations which solve that problem (and what you have written below makes me hope ...).
The only workaround I know of is to use VARBINARY instead of VARCHAR. I think it works much the same in most respects. But obviously some semantics is lost when the server no longer is aware of the character set used.
Correct.
I'll go that way, sticking with MySQL, if the problem won't be solved by new collations in MariaDB.
Since the index behaviour obviously depends on the collation, would building an own collation which does not PADSPACE be an option? I have
That would be interesting, actually. I don't know what support there is for non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
We don't have NO PAD collations yet.
Thanks for making this clear; I haven't found something about it in the documents. By the way, the document sections which deal with the char data types could eventually mention these pitfalls (like the MySQL documentation) :-)
Sure, we'll mention this clearer when we add NO PAD collations.
I don't remember that anybody ever asked for this before.
I wonder why, given the dozens (or hundreds?) of threads and tutorials around the web which try to explain how to circumvent these issues (none of them provides a general solution besides switching to ...BINARY).
[...] Sounds like a few hours of work.
This is very good news. I will now try to give some arguments why this really should be done.
1) Avoid programming mistakes
People might be programming for decades, but when not being database experts and thrown into a web development project, even those experienced programmers won't come to the idea that a database which returns trailing spaces in the result set of SELECT statements throws away these trailing spaces when building the index or when comparing (e.g. for evaluation the conditions of the WHERE CLAUSE). This is completely illogical and dangerous and makes debugging substantially more difficult.
2) Developers (and not DBMSs) should be in control of what happens to trailing spaces
Exactly. The DBMS should supply a full variety of useful collations, so the users have enough flexibility to choose the one that suites best. So I like the idea of new NO PAD collations very much.
A DBMS must not silently change data, nor for comparing nor for other purposes. It is solely the decision of the application logic if trailing spaces are acceptable or desired in data values (possibly entered by users), but not the decision of the DBMS. Nothing else than the application logic has to decide what to do with trailing spaces (trim them, keep them, reject user input, return error messages etc.), and the DBMS has to store, retrieve, compare and index the results of the application logic as-is. Period.
However, there is no argument against additional and optional functions or configuration flags of the DBMS which allow easy trimming or other (automatic and silent) transformation of the values and all sorts of weird compare modes, if these are optional. But in the first place, there must also be a mode / column type for every sort of data, notably for strings, in which data is stored, compared and indexed as-is.
We already have the binary data types which in this respect behave like desired, but all string functions fail with binary data values or behave surprisingly / impractically, so nearly all string functions (which are very important in many applications) are lost for binary data type columns.
3) There are use cases for string data with trailing spaces
I have seen some examples of trailing spaces, some of them useless, but some quite meaningful. In every case, the application logic relied on a string with trailing space being not equal to the same string without the trailing space. In some cases, the developers had not thought about the issues the trailing spaces could cause, i.e. their programs were faulty; in the other cases, they have put a lot of effort and used the weirdest tricks and index combinations to circumvent the problems. I can't remember a case where they have switched to ...BINARY data types.
Two of the funniest cases (just because it's Friday - skip the next two sections if you don't want to laugh):
Some application actually stored an internal status bit at the end of existing user input, probably because the programmers were in a hurry or had no administrative access to the MySQL server, i.e. they couldn't create a new status column in the respective table. Now how could you save a status bit within an existing string without making it visible to the users? The answer is of course: Append a space at the end of the string to set the status, remove it to clear the status. Of course, the respective column had to be UNIQUE in the sense that two identical user inputs should be allowed if the "status bit" differed ...
Some other application generated session keys in string form and used a part of the ASCII character set for the keys (e.g. a-z, A-Z, 0-9, special chars like $% and so on, including the space). Of course, as chance would have it, there could be a trailing space at the end of the session key. I can't remember if this led to problems in this case, but theoretically, it surely could have.
Unfortunately, my own use case is too difficult to explain here.
Thanks for use examples. I liked the one about "status bit" very much :)
To summarize, there are all sorts of serious pitfalls which might be very difficult to debug and very difficult to foresee if the DBMS ignores trailing spaces when storing, retrieving, comparing or indexing strings, so I would be very grateful, and I would immediately switch to MariaDB, if appropriate collations would be added there.
But then thorough testing will be needed, which will be the most time consuming part.
Probably, but I have a feeling that *not* including trailing spaces in comparisons / collations is much more difficult than including them. A trailing space is just a char as every other char, isn't it? So I suppose the structure of the respective functions for _nopad_ collations will be much less complicated than the structure of the existing functions.
Yes. I think so. The idea is just to disable the code that ignores trailing spaces.
Given that, the testing eventually wouldn't consume too much time ...
We'll need to cover thoroughly many thing, including but not limited to: - DISTINCT - UNION - ORDER BY with indexes - ORDER BY without indexes - GROUP BY - Mixing NO PAD and PAD collations - Function output - GROUP_CONCAT(DISTINCT)
Also, for consistency, it's worthy to implement at least utf8_nopad_bin and utf8_unicode_nopad_ci at once, and then eventually NO PAD collations for all other character sets.
Personally, I would be totally happy if I had the three new utf8 collations (I probably will never use anything else than utf8 for strings).
We're interested in implementing NO PAD counterparts for the default and the _bin collations under terms of Google Summer Of Code 2016 (GSoC). By this will take a few months. I guess you need this more urgently. Question: would utf8_unicode_nopad_ci work for you for now? We could implement it separately, without having to wait until the end of GSoC.
Thank you very much!
Binarus
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
On 11/03/2016 17:56, Binarus wrote:
We don't have NO PAD collations yet. Thanks for making this clear; I haven't found something about it in the documents. By the way, the document sections which deal with the char data types could eventually mention these pitfalls (like the MySQL documentation) :-)
Thanks for this suggestion. I've started adding some details to the Knowledge Base. In general, if you come across areas of documentation that could be improved, you can let us know (there's also a docs mailing list), or feel free to make some changes yourself (the Knowledge base is a wiki, so anyone can edit).
This conversation made me curious. Which code points does a PADSPACE collation treat as trailing spaces? Would the imagined utf8_*_nopad_* collations treat all of these code points as significant? Tom On 3/11/16, 6:20 AM, "Maria-discuss on behalf of Alexander Barkov" <maria-discuss-bounces+fsb=thefsb.org@lists.launchpad.net on behalf of bar@mariadb.org> wrote:
Hello Binarus, Kristian,
On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
Binarus <lists@binarus.de> writes:
"All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the
Yes, I have always found this terminally stupid as well. But I think it comes from the SQL standard.
Probably this comes from here:
2) A <standard character set name> specifies the name of a character set that is defined by a national or international standard. The character repertoire of CS is defined by the standard defining the character set identified by that <standard character set name>. The default collation of the character set is defined by the order of the characters in the standard and has the PAD SPACE characteristic.
So we follow the PAD SPACE requirement for default collations here.
I guess the reasoning here is to treat CHAR and VARCHAR in a similar way by default. I agree with the standard here. We don't want different behavior for CHAR and VARCHAR.
Btw, we don't follow the requirement that the default collation must be defined according to "the order of the characters in the standard". Default collations are traditionally case insensitive in MariaDB/MySQL.
The only workaround I know of is to use VARBINARY instead of VARCHAR. I think it works much the same in most respects. But obviously some semantics is lost when the server no longer is aware of the character set used.
Correct.
Since the index behaviour obviously depends on the collation, would building an own collation which does not PADSPACE be an option? I have
That would be interesting, actually. I don't know what support there is for non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
We don't have NO PAD collations yet. I don't remember that anybody ever asked for this before.
From a glance, this should not be too hard to implement.
You previously wrote you need utf8. So there are three options to make a new collation, based on one of these existing collations:
- utf8_general_ci - utf8_unicode_ci - utf8_bin
but with NO PAD characteristics.
Suppose you need utf8_general_nopad_ci (i.e. based on utf8_general_ci)
utf8_general_ci is implemented in strings/ctype-utf8.c
A new collation handler should be added, similar to this one:
static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler = { NULL, /* init */ my_strnncoll_utf8_general_ci, my_strnncollsp_utf8_general_ci, my_strnxfrm_unicode, my_strnxfrmlen_unicode, my_like_range_mb, my_wildcmp_utf8, my_strcasecmp_utf8, my_instr_mb, my_hash_sort_utf8, my_propagate_complex };
but these three virtual functions must be redefined to new similar functions that do not ignore trailing spaces:
- my_strnncollsp_utf8_general_ci - this is used for BTREE indexes - my_hash_sort_utf8 - this is used for HASH indexes - my_strnxfrm_unicode - this is used for filesort (non-indexed ORDER BY)
All other functions can be reused from the existing PAD SPACE collation.
Then a new "struct charset_info_st" should be defined, similar to my_charset_utf8_general_ci.
Sounds like a few hours of work.
But then thorough testing will be needed, which will be the most time consuming part.
Also, for consistency, it's worthy to implement at least utf8_nopad_bin and utf8_unicode_nopad_ci at once, and then eventually NO PAD collations for all other character sets.
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Hi Tom, On 03/18/2016 07:30 PM, Tom Worster wrote:
This conversation made me curious.
Which code points does a PADSPACE collation treat as trailing spaces?
Usually, just U+0020 SPACE. But this actually depends on an exact collation. This one treats U+00A0 NO-BREAK SPACE as equal to U+0020: http://collation-charts.org/mysql60/mysql604.cp1250_general_ci.html and thus ignores both trailing regular spaces and no-break spaces.
Would the imagined utf8_*_nopad_* collations treat all of these code points as significant?
Yes, this is the idea.
Tom
On 3/11/16, 6:20 AM, "Maria-discuss on behalf of Alexander Barkov" <maria-discuss-bounces+fsb=thefsb.org@lists.launchpad.net on behalf of bar@mariadb.org> wrote:
Hello Binarus, Kristian,
On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
Binarus <lists@binarus.de> writes:
"All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces. “Comparison” in this context does not include the
Yes, I have always found this terminally stupid as well. But I think it comes from the SQL standard.
Probably this comes from here:
2) A <standard character set name> specifies the name of a character set that is defined by a national or international standard. The character repertoire of CS is defined by the standard defining the character set identified by that <standard character set name>. The default collation of the character set is defined by the order of the characters in the standard and has the PAD SPACE characteristic.
So we follow the PAD SPACE requirement for default collations here.
I guess the reasoning here is to treat CHAR and VARCHAR in a similar way by default. I agree with the standard here. We don't want different behavior for CHAR and VARCHAR.
Btw, we don't follow the requirement that the default collation must be defined according to "the order of the characters in the standard". Default collations are traditionally case insensitive in MariaDB/MySQL.
The only workaround I know of is to use VARBINARY instead of VARCHAR. I think it works much the same in most respects. But obviously some semantics is lost when the server no longer is aware of the character set used.
Correct.
Since the index behaviour obviously depends on the collation, would building an own collation which does not PADSPACE be an option? I have
That would be interesting, actually. I don't know what support there is for non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
We don't have NO PAD collations yet. I don't remember that anybody ever asked for this before.
From a glance, this should not be too hard to implement.
You previously wrote you need utf8. So there are three options to make a new collation, based on one of these existing collations:
- utf8_general_ci - utf8_unicode_ci - utf8_bin
but with NO PAD characteristics.
Suppose you need utf8_general_nopad_ci (i.e. based on utf8_general_ci)
utf8_general_ci is implemented in strings/ctype-utf8.c
A new collation handler should be added, similar to this one:
static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler = { NULL, /* init */ my_strnncoll_utf8_general_ci, my_strnncollsp_utf8_general_ci, my_strnxfrm_unicode, my_strnxfrmlen_unicode, my_like_range_mb, my_wildcmp_utf8, my_strcasecmp_utf8, my_instr_mb, my_hash_sort_utf8, my_propagate_complex };
but these three virtual functions must be redefined to new similar functions that do not ignore trailing spaces:
- my_strnncollsp_utf8_general_ci - this is used for BTREE indexes - my_hash_sort_utf8 - this is used for HASH indexes - my_strnxfrm_unicode - this is used for filesort (non-indexed ORDER BY)
All other functions can be reused from the existing PAD SPACE collation.
Then a new "struct charset_info_st" should be defined, similar to my_charset_utf8_general_ci.
Sounds like a few hours of work.
But then thorough testing will be needed, which will be the most time consuming part.
Also, for consistency, it's worthy to implement at least utf8_nopad_bin and utf8_unicode_nopad_ci at once, and then eventually NO PAD collations for all other character sets.
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
participants (10)
-
Alexander Barkov
-
Binarus
-
Ian Gilfillan
-
Justin Swanhart
-
Kristian Nielsen
-
Peter Laursen
-
Reindl Harald
-
Rhys.Campbell@swisscom.com
-
Stephane VAROQUI
-
Tom Worster