Re: [Maria-developers] [Maria-discuss] Collations, trailing spaces and unique indexes

16 Mar 2016

      Alexander,

I'd like to continue our conversation from Maria-discuss.
...
...
- I have found my_strnxfrm_unicode in the same file and will need more time to make my opinion of how difficult it will be
(I don't know what a weight is, so I currently try to understand what the function does at all).
This function is used to create sort keys for non-indexed ORDER BY,
for these cases:
- ORDER BY on an expression
- ORDER BY on a column that does not have an index
The idea is exactly the same with the C function strxfrm.
See "man strxfrm".
The code implements non-indexed ORDER BY in filesort.cc
as follows:
1. It calls *_strnxfrm_* functions for all records and converts
CHAR/VARCHAR/TEXT values into their fixed length binary sortable keys.
2 . Then executes binary sorting on these keys.
By the way, fixing this function might be tricky.
Currently my_strnxfrm_unicode() pads the tail using weights of the SPACE
character.
The NO PAD version will need to pad the tail using a weight which is
less than the weight of the smallest possible character.
This should be easy for UCA bases collations (e.g.
utf8_unicode_nopad_ci), because the smallest possible
character in UCA collations is "U+0009 HORIZONTAL TABULATION",
and its weight is 0x0201. So we can just pad the sort key
using a smaller value 0x0200.
But I'm not sure yet what to do with 8-bit collations,
which usually use 0x00 as weight for the smallest character.
So we don't have a smaller value.
There are two options here:
1. Pad with 0x00. But this will mean that 'aaa<min>' and just 'aaa'
will  have unpredictable order when doing ORDER BY without an index
(where <min> is the smallest possible character in the collation).
As the smallest character in non-UCA collations is usually
"U+0000 NULL", this will mean that 'aaa\0' and just 'aaa'
will have unpredictable order.
2. Reserve extra bytes at the end of the key, to store the true length, so
- 'aaa\0' will have the key '4141410004'
- 'aaa'   will have the key '4141410003', and therefore will always
   be sorted before 'aaa\0'.
I'm inclined towards #2, to have consistent ORDER BY behavior
with and without indexes.
Wow, thanks for explaining. I now have learned that I probably can't assist you in solving that problem (I haven't been in unicode, weights, sorting and character sets very deep until now).

But if it helps, you eventually could break the problem into two steps:

1) Fix the problem with unique indexes

2) Fix ordering for any possible situation

The problems we have encountered in real life so far are all related to 1), i.e. we would be happy if unique indexes would consider 'a' and 'a ' being different values. In contrast, ordering is not that important for us. I don't know what others are thinking about that, though.
...
...
- My main problem: I did not find my_strnncollsp_utf8_general_ci anywhere (nor in the same neither in any other file). Where is it?
The function name is just "my_strnncollsp_utf8".
OK, thank you very much. I must have misunderstood something.
...
...
Furthermore, studying the code has led to some questions; for example, there already seems to be a #define which controls the padding-when-comparing mode, but only for the _cs collations?
Can you please clarify which lines do you mean?
I am referring to the file ctype-utf8.c, line 5698. There is:

#ifndef VARCHAR_WITH_DIFF_ENDSPACE_ARE_DIFFERENT_FOR_UNIQUE
  diff_if_only_endspace_difference= 0;
#endif

I haven't been able yet to thoroughly analyze where and how this is used, but at least somebody obviously already had some thoughts whether or not trailing spaces should make a difference in UNIQUEs :-)

Regards,

Binarus