Hi Binarus, On 03/11/2016 10:19 PM, Binarus wrote:
On 11.03.2016 15:45, Alexander Barkov wrote:
FYI, I have added a new task for this:
Alexander,
I couldn't resist taking a quick look into the sources.
- I have found my_hash_sort_utf8 in strings/ctype-utf8.c and am convinced that the change is incredibly easy there.
- I have found my_strnxfrm_unicode in the same file and will need more time to make my opinion of how difficult it will be (I don't know what a weight is, so I currently try to understand what the function does at all).
This function is used to create sort keys for non-indexed ORDER BY, for these cases: - ORDER BY on an expression - ORDER BY on a column that does not have an index The idea is exactly the same with the C function strxfrm. See "man strxfrm". The code implements non-indexed ORDER BY in filesort.cc as follows: 1. It calls *_strnxfrm_* functions for all records and converts CHAR/VARCHAR/TEXT values into their fixed length binary sortable keys. 2 . Then executes binary sorting on these keys. By the way, fixing this function might be tricky. Currently my_strnxfrm_unicode() pads the tail using weights of the SPACE character. The NO PAD version will need to pad the tail using a weight which is less than the weight of the smallest possible character. This should be easy for UCA bases collations (e.g. utf8_unicode_nopad_ci), because the smallest possible character in UCA collations is "U+0009 HORIZONTAL TABULATION", and its weight is 0x0201. So we can just pad the sort key using a smaller value 0x0200. But I'm not sure yet what to do with 8-bit collations, which usually use 0x00 as weight for the smallest character. So we don't have a smaller value. There are two options here: 1. Pad with 0x00. But this will mean that 'aaa<min>' and just 'aaa' will have unpredictable order when doing ORDER BY without an index (where <min> is the smallest possible character in the collation). As the smallest character in non-UCA collations is usually "U+0000 NULL", this will mean that 'aaa\0' and just 'aaa' will have unpredictable order. 2. Reserve extra bytes at the end of the key, to store the true length, so - 'aaa\0' will have the key '4141410004' - 'aaa' will have the key '4141410003', and therefore will always be sorted before 'aaa\0'. I'm inclined towards #2, to have consistent ORDER BY behavior with and without indexes.
- My main problem: I did not find my_strnncollsp_utf8_general_ci anywhere (nor in the same neither in any other file). Where is it?
The function name is just "my_strnncollsp_utf8".
Furthermore, studying the code has led to some questions; for example, there already seems to be a #define which controls the padding-when-comparing mode, but only for the _cs collations?
Can you please clarify which lines do you mean?
Should we continue our conversation on the developer mailing list?
Sure.
Regards,
Binarus
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp