Re: [Maria-developers] 48261114b9d: MDEV-26572 Improve simple multibyte collation performance on the ASCII range
Hi, Alexander! please add more benchmark results to the MDEV, all the variants that you've tried and speedups (or slowdowns) for each. otherwise, basically, no comments, see below. the only real concern is to know exactly when it helps and when it doesn't. On Sep 11, Alexander Barkov wrote:
revision-id: 48261114b9d (mariadb-10.6.1-77-g48261114b9d) parent(s): 76149650764 author: Alexander Barkov committer: Alexander Barkov timestamp: 2021-09-09 15:20:57 +0400 message:
MDEV-26572 Improve simple multibyte collation performance on the ASCII range
diff --git a/strings/ctype-ascii.h b/strings/ctype-ascii.h new file mode 100644 index 00000000000..66c9cfb3c6d --- /dev/null +++ b/strings/ctype-ascii.h @@ -0,0 +1,189 @@ +#ifndef CTYPE_ASCII_H_INCLUDED +#define CTYPE_ASCII_H_INCLUDED
The convention is CTYPE_ASCII_INCLUDED (every file has .h, so it's omitted from the include guard)
+ +#include "myisampack.h" + +/* + Magic expression. It uses the fact that for any byte value X in + the range 0..31 (0x00..0x1F) the expression (X+31)*5 returns + the 7th bit (0x80) set only for the following six (out of 32) values: + 0x00, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F. + These values correspond to offsets of non-letter characters + in the ASCII table: + + The following macro sets the bit 0x20 for the following characters: + ---------------- -------------------------------- + Magic bit 10000000000000000000000000011111 + ASCII 0x00..0x1F ................................ Control + ASCII 0x20..0x3F ................................ Punctuation, digits + ASCII 0x40..0x5F @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ + ASCII 0x60..0x7F `abcdefghijklmnopqrstuvwxyz{|}~. + ---------------- -------------------------------- + We shift the magic bit 0x80 right twice to make it 0x20. + So on the ranges [40..5F] and [60..7F] the expression + has the bit 0x20 set for all non-letter characters. + Note, other bits contain garbage. + + Requirements: + All bytes must be in the range [00..7F], + to avoid overflow and carry to the next byte. +*/ +#define MY_ASCII_20_IS_SET_IF_NOT_LETTER_MAGIC(i) \ + (((((i)+0x1F1F1F1F1F1F1F1FULL) & 0x1F1F1F1F1F1F1F1F) * 5) >> 2) + + +/* + The following macro returns the bit 0x20 set to: + - 1 for input bytes in the ranges [60..7F] or [E0..FF] + - 0 otherwise + Bytes in the ranges [40..7F] and [C0..FF] have the bit 0x40 set. + Bytes in the ranges [60..7F] smd [E0..FF] have the bit 0x20 set.
s/smd/and/
+ Hex BinHi BinLo + ---- -1-- ---- + 0x[4C]X .10. .... + 0x[5D]X .10. .... + 0x[6E]X .11. .... + 0x[7F]X .11. .... +*/ +#define MY_ASCII_20_IS_SET_IF_RANGE_60_7F_OR_E0_FF(i) (((i) >> 1) & ((i))) + + +/* + The following macro evaluates to exactly 0x20 for all + lower case ASCII letters [a-z], and to 0x00 otherwise: + + Value Range Character range Subrange + -------- -------- -------------------------------- ------- + 00000000 0x00..0x3F Control, punctuation, digits + 00100000 0x40..0x5F @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ letters A-Z + 00000000 0x40..0x5F @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ non-letters + 00100000 0x60..0x7F `abcdefghijklmnopqrstuvwxyz{|}~. letters a-z + 00000000 0x60..0x7F `abcdefghijklmnopqrstuvwxyz{|}~. non-letters + + Requirements: + All bytes must be in the range [00..7F]. + See the comments in MY_ASCII_20_IS_SET_IF_NOT_LETTER_MAGIC(). +*/ + +#define MY_ASCII_20_IF_IS_LOWER_LETTER(i) \ + (MY_ASCII_20_IS_SET_IF_RANGE_60_7F_OR_E0_FF(i) & \ + ~MY_ASCII_20_IS_SET_IF_NOT_LETTER_MAGIC(i) & \ + 0x2020202020202020) + +/* + Convert lower case ASCII letters to upper case by unsetting + the bit 0x20 with help of the magic expression. + + Requirements: + All bytes must be in the range [00..7F]. + See the comments in MY_ASCII_20_IS_SET_IF_NOT_LETTER_MAGIC() +*/ +#define MY_ASCII_TOUPPER_MAGIC(i) \ + (i ^ MY_ASCII_20_IF_IS_LOWER_LETTER(i)) + + +/* + Convert a string (consisting of 8 bytes stored in uint64) + to upper case algorithmically. + + Requirements: + All bytes must be in the range [00..0x7F]. + See the comments in MY_ASCII_20_IS_SET_IF_NOT_LETTER_MAGIC(). + The result on 8bit data is unpredictable!!! + The caller should make sure not to pass 8bit data. +*/ +static inline ulonglong my_ascii_to_upper_magic_uint64(ulonglong i) +{ + return MY_ASCII_TOUPPER_MAGIC(i); +} + + +/* + Check if: + - both strings "a" and "b" have at least 4 bytes, and + - both strings have only 7bit data. +*/ +static inline int +my_strcoll_ascii_4bytes_found(const uchar *a, const uchar *ae, + const uchar *b, const uchar *be) +{ + return a + 4 <= ae && b + 4 <= be && + (uint4korr(b) & 0x80808080) == 0 && + (uint4korr(a) & 0x80808080) == 0; +} + + +/* + Compare the leading four 7bit ASCII bytes in two strings case insensitively + by converting letters [a-z] to upper case [A-Z]. + + Requirements: + - The input strings must have at least four bytes, and + - The leading four bytes in both strings must be 7bit ASCII. + The caller must make sure to provide only strings that meet + these requirements. The result on 8-bit data is unpredictable + as 8-bit bytes may cause overflow in my_ascii_to_upper_magic_uint64(). + See comments above. +*/ +static inline int +my_strcoll_ascii_toupper_4bytes(const uchar *a, const uchar *b) +{ + ulonglong abn= (((ulonglong) mi_uint4korr(a)) << 32) | mi_uint4korr(b); + abn= my_ascii_to_upper_magic_uint64(abn); + if ((uint) (abn >> 32) == (uint32) abn)
s/uint/uint32/ ?
+ return 0; + return ((uint32) (abn >> 32)) < ((uint32) abn) ? -1 : + 1; +} + + +/* + Compare the leading eight 7bit ASCII bytes in two strings case insensitively + by converting letters [a-z] to upper case [A-Z]. + + Requirements: + - The input strings must have at least eight bytes, and + - The leading eight bytes in both strings must be 7bit ASCII. + See comments in my_strcoll_ascii_toupper_4bytes(). +*/ +static inline int +my_strcoll_ascii_toupper_8bytes(const uchar *a, const uchar *b) +{ + /* + TODO: + Try to get advantage of SIMD instructions by massive comparison + (16 bytes at a time) of characters against (x>='a' && x<='z') using: + - either explicit intrinsics
I thought your SIMD version was slower, wasn't it?
+ - or a loop that can get vectorized automatically by some compilers. + */ + ulonglong an= mi_uint8korr(a); + ulonglong bn= mi_uint8korr(b); + an= my_ascii_to_upper_magic_uint64(an); + bn= my_ascii_to_upper_magic_uint64(bn); + return an == bn ? 0 : an < bn ? -1 : +1; +} + + +/* + Compare the leading four 7bit ASCII bytes in two strings in binary style. +*/ +static inline int +my_strcoll_mb7_bin_4bytes(const uchar *a, const uchar *b) +{ + uint32 an= mi_uint4korr(a); + uint32 bn= mi_uint4korr(b); + return an == bn ? 0 : an < bn ? -1 : +1; +} + + +/* + Compare the leading four 7bit ASCII bytes in two strings in binary style. +*/ +static inline int +my_strcoll_mb7_bin_8bytes(const uchar *a, const uchar *b) +{ + ulonglong an= mi_uint8korr(a); + ulonglong bn= mi_uint8korr(b); + return an == bn ? 0 : an < bn ? -1 : +1; +} + +#endif /* CTYPE_ASCII_H_INCLUDED */ diff --git a/strings/strcoll.ic b/strings/strcoll.ic index 6aca0d0c460..72e0131e1c5 100644 --- a/strings/strcoll.ic +++ b/strings/strcoll.ic @@ -39,6 +41,42 @@ #endif
+/* + For binary collations: + - on 32bit platforms perform only 4 byte optimization + - on 64bit platforms perform both 4 byte and 8 byte optimization +*/ +#if defined(STRCOLL_MB7_BIN) +#define MY_STRCOLL_MB7_4BYTES(a,b) my_strcoll_mb7_bin_4bytes((a),(b)) +#if defined(__x86_64__)
as Wlad has already said, what about not gcc-on-intel 64bit platforms?
+#define STRCOLL_MB7_8BYTES +#define MY_STRCOLL_MB7_8BYTES(a,b) my_strcoll_mb7_bin_8bytes((a),(b)) +#endif +#endif + + +/* + For case insensitive collations with trivial mapping from [a-z] to [A-Z] + perform optimization only on 64 bit platforms. + There is no sense to perform my_ascii_to_upper_magic_uint64() based + optimization on 32bit platforms. The idea of this optimization + is that it handles 8bytes at a time, using a 64bit CPU register. + Enabling this optimization on 32bit platform may only slow things down. +*/ +#if defined(STRCOLL_MB7_TOUPPER) +#if defined(__x86_64__) +#define MY_STRCOLL_MB7_4BYTES(a,b) my_strcoll_ascii_toupper_4bytes((a),(b)) +#define MY_STRCOLL_MB7_8BYTES(a,b) my_strcoll_ascii_toupper_8bytes((a),(b)) +#endif /* Architecture test */ +#endif /* STRCOLL_MB7_TOUPPER */ + + +/* + A helper macro to shift two pointers forward, to the given amount. +*/ +#define MY_STRING_SHIFT_PTR_PTR(a,b,len) do { a+= len; b+= len; } while(0) + + /* Weight of an illegal byte, must follow these rules: 1. Must be greater than weight of any normal character in the collation.
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hello Sergei, Thanks for the review. A new patch is here: https://github.com/MariaDB/server/commit/0629711db43ec489a360d8f689b72fac66a... See also comments below: On 9/11/21 5:46 PM, Sergei Golubchik wrote:
Hi, Alexander!
please add more benchmark results to the MDEV, all the variants that you've tried and speedups (or slowdowns) for each.
otherwise, basically, no comments, see below. the only real concern is to know exactly when it helps and when it doesn't.
I added benchmark results to the MDEV. <cut>
--- /dev/null +++ b/strings/ctype-ascii.h @@ -0,0 +1,189 @@ +#ifndef CTYPE_ASCII_H_INCLUDED +#define CTYPE_ASCII_H_INCLUDED
The convention is CTYPE_ASCII_INCLUDED (every file has .h, so it's omitted from the include guard)
Fixed. <cut>
+/* + The following macro returns the bit 0x20 set to: + - 1 for input bytes in the ranges [60..7F] or [E0..FF] + - 0 otherwise + Bytes in the ranges [40..7F] and [C0..FF] have the bit 0x40 set. + Bytes in the ranges [60..7F] smd [E0..FF] have the bit 0x20 set.
s/smd/and/
Fixed. <cut>
+static inline int +my_strcoll_ascii_toupper_4bytes(const uchar *a, const uchar *b) +{ + ulonglong abn= (((ulonglong) mi_uint4korr(a)) << 32) | mi_uint4korr(b); + abn= my_ascii_to_upper_magic_uint64(abn); + if ((uint) (abn >> 32) == (uint32) abn)
s/uint/uint32/ ?
Fixed. <cut>
+ /* + TODO: + Try to get advantage of SIMD instructions by massive comparison + (16 bytes at a time) of characters against (x>='a' && x<='z') using: + - either explicit intrinsics
I thought your SIMD version was slower, wasn't it?
On quick tests SIMD appeared to be 1.5 times faster on pure ASCII (comparing to the current patch version). I decided not to include SIMD at this point yet. Enabling SIMD needs some further experimenting: 1. How to enable SIMD properly: - during compile time through cmake only - or additionally during run time depending on whether the processor really supports the SIMD commands used. 2. It should be more efficient to handle both ASCII and multi-byte characters by SIMD operations, instead of just catching ASCII substrings. This at least should be possible for _bin collations. Not sure about _ci. <cut>
+#if defined(STRCOLL_MB7_BIN) +#define MY_STRCOLL_MB7_4BYTES(a,b) my_strcoll_mb7_bin_4bytes((a),(b)) +#if defined(__x86_64__)
as Wlad has already said, what about not gcc-on-intel 64bit platforms?
Fixed to: #if SIZEOF_VOIDP == 8 .. #endif Thanks to Wlad for the suggestion.
participants (2)
-
Alexander Barkov
-
Sergei Golubchik