[Maria-discuss] Limited Unicode Support?
Hello, I hope this is the proper mailing list to ask such questions, I apologise if it isn't. I am having some problems with unusual Unicode characters in my MariaDB database. $ mariadb --version mariadb Ver 15.1 Distrib 10.3.17-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2 $ sudo ./mariadb.php [sudo] Passwort für bjoern: Query: INSERT INTO `test` SET `string` = '🙋 Huhu. wie geht es dir?' Inserted: '🙋 Huhu. wie geht es dir?' Returned: '???? Huhu. wie geht es dir?' SHOW VARIABLES LIKE 'character%': character_set_client utf8 character_set_connection utf8 character_set_database utf8 character_set_filesystem binary character_set_results utf8 character_set_server latin1 character_set_system utf8 character_sets_dir /usr/share/mysql/charsets/ As you can see here, MariaDB does not take the character '🙋' ( https://www.fileformat.info/info/unicode/char/1f64b/index.htm ) and instead replaces it with four question marks and I have no idea why. I've attached the PHP code for the example. I would be most grateful for any suggestion. Regards, Björn Keil
On 10.10.19 15:52, Björn Keil wrote:
SHOW VARIABLES LIKE 'character%': character_set_client utf8 character_set_connection utf8 character_set_database utf8 character_set_filesystem binary character_set_results utf8 character_set_server latin1 character_set_system utf8 character_sets_dir /usr/share/mysql/charsets/
As you can see here, MariaDB does not take the character '🙋' ( https://www.fileformat.info/info/unicode/char/1f64b/index.htm ) and instead replaces it with four question marks and I have no idea why.
'🙋' is in the extended unicode plane, and so needs 4 bytes when using UTF-8 encoding. Classic "utf8" charset only supports unicode code points that use up to 3 bytes when UTF-7 encoded. For emoticons etc. you need to use the newer utf8mb4 character set instead. -- Hartmut Holzgraefe - Principal Support Engineer (EMEA) MariaDB Corporation - http://www.mariadb.com/
Hi björn, 🙋 is a 4 bytes encoded character (0xF0 0x9F 0x99 0x8B). "utf8" is a 3-Byte UTF-8 Unicode encoding. You have to configure charset "utf8mb4" that permits full utf8 support. https://jira.mariadb.org/browse/MDEV-8334 in 10.5 is the first step to makes utf8mb4 default for 'utf8'. regards, diego. On Thu, Oct 10, 2019 at 3:53 PM Björn Keil <schattenkeil@googlemail.com> wrote:
Hello,
I hope this is the proper mailing list to ask such questions, I apologise if it isn't.
I am having some problems with unusual Unicode characters in my MariaDB database.
$ mariadb --version mariadb Ver 15.1 Distrib 10.3.17-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2 $ sudo ./mariadb.php [sudo] Passwort für bjoern: Query: INSERT INTO `test` SET `string` = '🙋 Huhu. wie geht es dir?' Inserted: '🙋 Huhu. wie geht es dir?' Returned: '???? Huhu. wie geht es dir?'
SHOW VARIABLES LIKE 'character%': character_set_client utf8 character_set_connection utf8 character_set_database utf8 character_set_filesystem binary character_set_results utf8 character_set_server latin1 character_set_system utf8 character_sets_dir /usr/share/mysql/charsets/
As you can see here, MariaDB does not take the character '🙋' ( https://www.fileformat.info/info/unicode/char/1f64b/index.htm ) and instead replaces it with four question marks and I have no idea why.
I've attached the PHP code for the example.
I would be most grateful for any suggestion.
Regards, Björn Keil _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Thanks for the replies. I've tried to just replace all occurrences of "utf8" in my example with "utf8mb4" and it works. Inconveniently this will require major conversations and down times for my application, but at least I know what I must do to make it work. However, the "mb4" sounds a little suspicious, though. While there are no sufficiently high numbered Unicode Points yet that would make such a measure necessary, the UTF-8 encoding allows for up to seven byte long characters, if I am not mistaken. Does utf8mb4 allow for more than four byte long characters if in and when the time comes? Am Do., 10. Okt. 2019 um 17:18 Uhr schrieb Diego Dupin < diego.dupin@mariadb.com>:
Hi björn,
🙋 is a 4 bytes encoded character (0xF0 0x9F 0x99 0x8B).
"utf8" is a 3-Byte UTF-8 Unicode encoding. You have to configure charset "utf8mb4" that permits full utf8 support. https://jira.mariadb.org/browse/MDEV-8334 in 10.5 is the first step to makes utf8mb4 default for 'utf8'.
regards, diego.
On Thu, Oct 10, 2019 at 3:53 PM Björn Keil <schattenkeil@googlemail.com> wrote:
Hello,
I hope this is the proper mailing list to ask such questions, I apologise if it isn't.
I am having some problems with unusual Unicode characters in my MariaDB database.
$ mariadb --version mariadb Ver 15.1 Distrib 10.3.17-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2 $ sudo ./mariadb.php [sudo] Passwort für bjoern: Query: INSERT INTO `test` SET `string` = '🙋 Huhu. wie geht es dir?' Inserted: '🙋 Huhu. wie geht es dir?' Returned: '???? Huhu. wie geht es dir?'
SHOW VARIABLES LIKE 'character%': character_set_client utf8 character_set_connection utf8 character_set_database utf8 character_set_filesystem binary character_set_results utf8 character_set_server latin1 character_set_system utf8 character_sets_dir /usr/share/mysql/charsets/
As you can see here, MariaDB does not take the character '🙋' ( https://www.fileformat.info/info/unicode/char/1f64b/index.htm ) and instead replaces it with four question marks and I have no idea why.
I've attached the PHP code for the example.
I would be most grateful for any suggestion.
Regards, Björn Keil _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Hi Björn, The time for more than 4 bytes in UTF8 will never come, and even the emojis expand so that more than 1112064 “characters” , new encoding will not be called UTF8 anymore, and I doubt it will even be called Unicode. UTF8 is not up to 7 characters. While the encoding scheme with leading/trailing bytes could allow for more than 4 bytes, this was explicitly clarified and forbidden in the RFC3629 https://tools.ietf.org/html/rfc3629#section-4 , along with encoding of unpaired “surrogate” characters from UTF16, so basically UTF8 can encode everything in UTF16, and not more than that. The utf8mb4 story is that - there was a discussion IIRC during MySQL 5.5 development, whether to continue using UTF8 name or whether to create a new name, for the Unicode (2.0+) conforming charset. As you noticed , traditional MySQL’s version of UTF8 is castrated. On the other hand, reusing a name for something different could possibly lead to compatibility problems with existing applications. The conservative decision was for the new name for the real (in Unicode sense) UTF8. The “utf8mb4” name is not pretty, confusing, but no compatibility problems were reported. From: Björn Keil Sent: Friday, 11 October 2019 12:10 To: maria-discuss@lists.launchpad.net Subject: Re: [Maria-discuss] Limited Unicode Support? Thanks for the replies. I've tried to just replace all occurrences of "utf8" in my example with "utf8mb4" and it works. Inconveniently this will require major conversations and down times for my application, but at least I know what I must do to make it work. However, the "mb4" sounds a little suspicious, though. While there are no sufficiently high numbered Unicode Points yet that would make such a measure necessary, the UTF-8 encoding allows for up to seven byte long characters, if I am not mistaken. Does utf8mb4 allow for more than four byte long characters if in and when the time comes? Am Do., 10. Okt. 2019 um 17:18 Uhr schrieb Diego Dupin <diego.dupin@mariadb.com>: Hi björn, 🙋 is a 4 bytes encoded character (0xF0 0x9F 0x99 0x8B). "utf8" is a 3-Byte UTF-8 Unicode encoding. You have to configure charset "utf8mb4" that permits full utf8 support. https://jira.mariadb.org/browse/MDEV-8334 in 10.5 is the first step to makes utf8mb4 default for 'utf8'. regards, diego. On Thu, Oct 10, 2019 at 3:53 PM Björn Keil <schattenkeil@googlemail.com> wrote: Hello, I hope this is the proper mailing list to ask such questions, I apologise if it isn't. I am having some problems with unusual Unicode characters in my MariaDB database. $ mariadb --version mariadb Ver 15.1 Distrib 10.3.17-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2 $ sudo ./mariadb.php [sudo] Passwort für bjoern: Query: INSERT INTO `test` SET `string` = '🙋 Huhu. wie geht es dir?' Inserted: '🙋 Huhu. wie geht es dir?' Returned: '???? Huhu. wie geht es dir?' SHOW VARIABLES LIKE 'character%': character_set_client utf8 character_set_connection utf8 character_set_database utf8 character_set_filesystem binary character_set_results utf8 character_set_server latin1 character_set_system utf8 character_sets_dir /usr/share/mysql/charsets/ As you can see here, MariaDB does not take the character '🙋' ( https://www.fileformat.info/info/unicode/char/1f64b/index.htm ) and instead replaces it with four question marks and I have no idea why. I've attached the PHP code for the example. I would be most grateful for any suggestion. Regards, Björn Keil _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
participants (4)
-
Björn Keil
-
Diego Dupin
-
Hartmut Holzgraefe
-
Vladislav Vaintroub