[Maria-developers] Fwd: [JIRA] (MDEV-6566) Different INSERT behaviour on bad bytes with and without character set conversion

10 Oct 2014

      Hi Sergei,

I suggest to fix this starting from 10.1.

-------- Original Message --------
Subject: [JIRA] (MDEV-6566) Different INSERT behaviour on bad bytes with 
and without character set conversion
Date: Thu, 9 Oct 2014 21:03:51 +0300 (EEST)
From: Sergei Golubchik (JIRA) <jira@mariadb.atlassian.net>
To: bar@mariadb.org

      [ 
https://mariadb.atlassian.net/browse/MDEV-6566?page=com.atlassian.jira.plugi... 
]

Sergei Golubchik updated MDEV-6566:
-----------------------------------
     Fix Version/s:     (was: 5.5.40)
                    5.5.41
...
Different INSERT behaviour on bad bytes with and without character set conversion
---------------------------------------------------------------------------------
Key: MDEV-6566
                URL: https://mariadb.atlassian.net/browse/MDEV-6566
            Project: MariaDB Server
         Issue Type: Bug
   Affects Versions: 5.3.12, 5.5.39, 10.0.13
           Reporter: Alexander Barkov
           Assignee: Alexander Barkov
            Fix For: 10.0.15, 5.5.41
If I run this SQL script in a utf8 client:
{noformat}
SET NAMES utf8;
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (
  a VARCHAR(10) CHARACTER SET ucs2,
  b VARCHAR(10) CHARACTER SET utf8
);
INSERT INTO t1 SELECT 'a 😁 b', 'a 😁 b';
SHOW WARNINGS;
SELECT * FROM t1;
{noformat}
It displays the following warnings:
{noformat}
+---------+------+----------------------------------------------------------------------+
| Level   | Code | Message                                                              |
+---------+------+----------------------------------------------------------------------+
| Warning | 1300 | Invalid utf8 character string: '\xF0\x9F\x98\x81 b'                  |
| Warning | 1300 | Invalid utf8 character string: '\xF0\x9F\x98\x81 b'                  |
| Warning | 1366 | Incorrect string value: '\xF0\x9F\x98\x81 b' for column 'a' at row 1 |
| Warning | 1265 | Data truncated for column 'b' at row 1                               |
+---------+------+----------------------------------------------------------------------+{noformat}
with the following results:
{noformat}
+----------+------+
| a        | b    |
+----------+------+
| a ???? b | a    |
+----------+------+
{noformat}
Notice, the character '😁' is "U+1F601 GRINNING FACE WITH SMILING EYES". It  is outside of the BMP range supported by MariaDB character sets "utf8" and "ucs2".
The two columns worked differently:
- The ucs2 column correctly treated '😁' as an invalid sequence of four bytes,
converted it into four question marks, and appended the character 'b' after them.
- The utf8 column just stopped on this character and lost the trailing 'b'.
The warnings are also different for the two columns.
As "utf8" and "ucs2" are equitable character sets that support exactly the
same character range and repertoire (just with a different encoding),
it would be fair to expect the same behaviour of the two columns.
The column "a" exposes a better behaviour: it preserves as much
information as possible.
The column "b" should be fixed to work in the same way with "a".
The reason of the failure is that the conversion code that is activated
for the column "a" (with conversion) uses an mb_wc()..mb_wc() loop, while the
code activated for the column "b" (without conversion) copies only a well formed
prefix.
The code without conversion should not stop on a bad byte sequence
and try to copy as much data as possible, like the conversion code does.
--
This message was sent by Atlassian JIRA
(v6.4-OD-05-009#64003)