[MariaDB developers] Re: 8d51c6d234b: MDEV-30164 System variable for default collations

10 Jul 2023

      Hi Sergei,

Thanks for the review. Here's a new version:

https://github.com/MariaDB/server/commit/5102d86beae6b7464b1271081f8003ae81850a5b

See comments below:

On 7/9/23 6:56 PM, Sergei Golubchik wrote:
> Hi, Alexander,
> 
> On Jul 07, Alexander Barkov wrote:
>> On 6/24/23 12:20 AM, Sergei Golubchik wrote:
>>>
>>>> diff --git a/sql/lex_charset.cc b/sql/lex_charset.cc
>>>> --- a/sql/lex_charset.cc
>>>> +++ b/sql/lex_charset.cc
>>>> @@ -447,6 +451,17 @@ Lex_exact_charset_opt_extended_collate::find_default_collation() const
>>>>
>>>> +CHARSET_INFO *
>>>> +Lex_exact_charset_opt_extended_collate::
>>>> +  find_mapped_default_collation(Charset_collation_map_st::Used *used,
>>>> +                                const Charset_collation_map_st &map) const
>>>> +{
>>>> +  CHARSET_INFO *cs= find_compiled_default_collation();
>>>> +  if (!cs)
>>>> +    return nullptr;
>>>> +  return map.get_collation_for_charset(used, cs);
>>>
>>> This seems a bit redundant. find_compiled_default_collation() will
>>> do get_charset_by_csname(m_ci, MY_CS_PRIMARY) and only then you'll
>>> be able to search, because you map default compiled collation id to
>>> use specified collation id.
> ...
>> And by the way, the situation when find_compiled_default_collation()
>> is really called is very rare itself:
>>
>>       CREATE TABLE t1 (a CHAR(10) COLLATE DEFAULT)
>>         CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
> 
> Indeed, this is quite uncommon. Ok.
> 
>>> but conceptually you need to map a charset to a collation, not a
>>> collation to a collation. If you'd map, say, cs->cs_name.str (just
>>> the pointer, not the string value) or, say, cs->cset, then you'd be
>>> mapping a charset to a collation. and here you'd be able to do just
>>>
>>>      return map.get_collation_for_charset(used, m_ci)
>>>
>>> with find_compiled_default_collation() being done inside
>>> get_collation_for_charset() and only when Elem_st wasn't found
>>
>> cs->cset is a pointer to MY_CHARSET_HANDLER. It cannot be used here,
>> because the same handler is shared between multiple character sets.
> 
> Oops. Sorry. I thought for some reason that cs->cset describes the
> charset. But now I see that all charset methods get CHARSET_INFO as an
> argument and use tables from there. Indeed, cs->cset can be the same for
> many charsets (and it is).
> 
>> cs->cs_name.str could work. However, this may not be reliable
>> with user-defined collations.
 >
 > What do you mean?
 >

cs->cs_name.str used to be a different pointer in collations
of the same character set until recent (every collation had
its own character set name copy).

A few months ago Monty made an optimization so now
all collations of the same character set point to
exactly the same string. However I'm not 100% sure
it works as expected with user collations defined
in Index.xml

I'd like to avoid starting using cs->cs_name.str as as a
replacement for a character set ID.

>> Taking into account that the situation when redundant
>> lookup is done is very rare, I propose not to change anything here now.
>> Instead, eventually we need to fix the collation library itself,
>> to avoid lookup inside get_charset_by_csname(m_ci, MY_CS_PRIMARY).
> 
> I mean, logically there should be a structure that defines a
> character set. And a structure that defines a collation.
> Former doesn't exists, the latter is confusingly called CHARSET_INFO.
> 
> Eventually there should be a clear distinction in the code just like
> you've recently made a clear distinction in the SQL.
> For now, perhaps, cs->cs_name.str could be such a "charset identifier".

I agree, there should be a separate structure for a character set.
Let's return to this question later.

For now we agreed that the situation when 
find_compiled_default_collation() is called when
it needs to do get_charset_by_csname(m_ci, MY_CS_PRIMARY)
is rare. So the current way should not bring any
performance problems.

>>>> diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc
>>>> --- a/sql/sql_parse.cc
>>>> +++ b/sql/sql_parse.cc
>>>> @@ -6049,6 +6049,27 @@ mysql_execute_command(THD *thd, bool is_called_from_prepared_stmt)
>>>>         }
>>>>         thd->reset_kill_query();
>>>>       }
>>>> +
>>>> +    /*
>>>> +      If a non-default collation (in @@character_set_collations)
>>>> +      was used during the statement, the mysqlbinlog output for
>>>> +      the current statement will contain a sequence like this:
>>>> +
>>>> +        SET character_set_collations='utf8mb3=utf8mb3_bin';
>>>> +        INSERT INTO t1 VALUES (_utf8mb3'test');
>>>> +        COMMIT;
>>>> +
>>>> +      The statment (INSERT in this example) is already in binlog at this point, and the
>>>> +      and the "SET character_set_collations" is written inside a
>>>> +      Q_CHARACTER_SET_COLLATIONS chunk in its log entry header.
>>>> +      The flag CHARACTER_SET_COLLATIONS_USED is not needed any more.
>>>> +
>>>> +      Let's suppress the flag to avoid a Q_CHARACTER_SET_COLLATIONS chunk
>>>> +      inside the COMMIT log entry header - it would be useless and would
>>>> +      only waste space in the binary log.
>>>> +    */
>>>> +    thd->used&= ~THD::CHARACTER_SET_COLLATIONS_USED;
>>>
>>> Wouldn't that logic apply to other THD::xxx_USED flags?
>>> May be you should reset thd->used=0 here?
>>
>> Sounds reasonable. I tried thd->used=0.
>>
>> rpl.rpl_hrtime failed with multiple chunks like this:
>>
>> -SET TIMESTAMP=1293832861.123456/*!*/;
>> +SET TIMESTAMP=1293832861/*!*/;
>>    COMMIT
>>
>> Do you know why binlog always puts "SET TIMESTAMP" before "COMMIT"?
> 
> because a timestamp is a always present for any event, it's the first 4
> bytes of the common event header, mandatory for all event types.
> In fact, event type itself is only in the 5th byte, after the timestamp.
> 
> But the fractional part if printed only for Query_log_event
> and only if the Q_HRNOW field in the Query_log_event was present.
> 
> I'd think that thd->used can be done not even before COMMIT, where you
> put it, but directly after MYSQL_BIN_LOG::write. 'used' flags affect the
> next binlog event, so it's reasonable to reset them after the event is
> written.
> 
>> This change would not be comfortable for me.
>> I'd prefer if "SET TIMESTAMP" was not printed at all when it's not needed,
>> instead of removing only the fractional part.
>>
>> Also these tests:
>> binlog.binlog_mysqlbinlog_row
>> binlog.binlog_mysqlbinlog_row_myisam
>> failed with one chunk:
>>
>>    SET TIMESTAMP=1000000000/*!*/;
>> -SET @@session.time_zone='SYSTEM'/*!*/;
>>    COMMIT
> 
> Yes, same.
> 
>> What do you suggest?
>>
>> I'd solve it in a separate task. It seems it needs further
>> investigations, not related to this MDEV.
> 
> I'd suggest to reset thd->used after the event was written. This should
> keep flags from leaking into the next event.
> Your thd->used&= ~THD::CHARACTER_SET_COLLATIONS_USED won't be needed.

I changed
   thd->used&= ~THD::CHARACTER_SET_COLLATIONS_USED;
to
   thd->used=0;

and re-recorded the three mentioned tests.

> 
>>>> diff --git a/sql/sql_prepare.cc b/sql/sql_prepare.cc
>>>> --- a/sql/sql_prepare.cc
>>>> +++ b/sql/sql_prepare.cc
>>>> @@ -3538,6 +3544,13 @@ static void mysql_stmt_execute_common(THD *thd,
>>>>       DBUG_VOID_RETURN;
>>>>     }
>>>>
>>>> +  if (stmt->prepare_time_charset_collation_map_version() !=
>>>> +      thd->variables.character_set_collations.version())
>>>> +  {
>>>> +    if (stmt->reprepare())
>>>> +      DBUG_VOID_RETURN;
>>>> +  }
>>>
>>> why do you reprepare here and not where reprepare() is normally
>>> happens, in the Prepared_statement::execute_loop()?
>>
>> Can you please suggest a better place for this code?
> 
> I mean, Prepared_statement already knows how to reprepare, it's already
> implemented and is done in Prepared_statement::execute_loop().
> 
> Why do you need to add another reprepare in a different place?

As discussed on Slack, I moved this code inside 
Prepare_statement::execute(). A change in
the @@charseter_set_collations is now
reported via Reprepare_observer::report_error().
to the upper level methods, so no changes in
these methods are needed:

- execute_loop()
- execute_bulk_loop()

This way makes the patch for sql_prepare.cc much smaller,
but adds some DA calls. As we agreed, these extra DA calls should
not be a problem, because @@charseter_set_collations changes
rarely.

> 
>>>> @@ -5220,6 +5252,13 @@ bool Prepared_statement::execute(String *expanded_query, bool open_cursor)
>>>>         MYSQL_QUERY_EXEC_START(thd->query(), thd->thread_id, thd->get_db(),
>>>>                                &thd->security_ctx->priv_user[0],
>>>>                                (char *) thd->security_ctx->host_or_ip, 1);
>>>> +      /*
>>>> +        If PREPARE used @@character_set_collations,
>>>> +        then we need to make sure binary log writes
>>>> +        the map in the event header.
>>>> +      */
>>>
>>> why is it different from other THD::xx_USED flags?
>>
>> CHARACTER_SET_COLLATIONS_USED in most cases is set during parsing only,
>> and is not set on later stages afterwards.
>>
>> Other XXX_USED flags are set at later stages, either during fix_fields(),
>> or during evaluation. We don't need to care about copying them
>> from m_prepare_time_thd_used_flags,
>> because the flags will be set during EXECUTE (and in many cases they are
>> not even set during PREPARE).
> 
> Still, you could've done
> 
>    thd->used|= m_prepare_time_thd_used_flags;
> 
> couldn't you?
> 
> It wouldn't beg the question of how CHARACTER_SET_COLLATIONS_USED is
> different. And in the future if there will be more parsing-time flags,
> it'll still work

Yes, it worked fine. MTR found no unexpected side effects.

> 
>>>> diff --git a/sql/sql_yacc.yy b/sql/sql_yacc.yy
>>>> --- a/sql/sql_yacc.yy
>>>> +++ b/sql/sql_yacc.yy
>>>> @@ -9686,6 +9713,9 @@ column_default_non_parenthesized_expr:
>>>>             }
>>>>           | CONVERT_SYM '(' expr USING charset_name ')'
>>>>             {
>>>> +            Character_set_collations_used used(thd);
>>>> +            $5= thd->variables.character_set_collations.
>>>> +                                 get_collation_for_charset(&used, $5);
>>>
>>> this is rather awkward pattern that you use everywhere, with
>>>
>>>    Character_set_collations_used used(thd);
>>>
>>> why not to pass thd as an argument instead?
>>
>> charset_collations.h and lex_charset.h use
>> Charset_collation_map_st::Used. They neither know what its descendant
>> Character_set_collations_used is, nor know what THD is.
>>
>> I would like not to make charset_collations.h and lex_charset.h depend
>> on the entire sql_class.h.
>>
>> As a solution, we can move the part of THD where the data type used_t
>> and the enum with XXX_USED flags are defined into a separate class in
>> a separate header:
> 
> yes, that sounds reasonable

Done.

>>>> diff --git a/sql/simple_tokenizer.h b/sql/simple_tokenizer.h
>>>> --- /dev/null
>>>> +++ b/sql/simple_tokenizer.h
>> <cut>
>>>> +class Simple_tokenizer
>>>> +{
>>>> +  const char *m_ptr;
>>>> +  const char *m_end;
>>>> +public:
>>>> +  Simple_tokenizer(const char *str, size_t length)
>>>> +   :m_ptr(str), m_end(str + length)
>>>> +  { }
> ...
>> - This charset-unaware (ASCII-only) tokenizer can be reused for a
>> number of system variables (sql_mode, optimizer_switch,
>> log_slow_filter, myisam_recover_options,
>> slave_transaction_retry_errors).
> 
> could you add an MDEV for that please?
> perhaps even consider doing it after this one.

Sure, will do.

> 
>>>> diff --git a/sql/charset_collations.h b/sql/charset_collations.h
>>>> --- /dev/null
>>>> +++ b/sql/charset_collations.h
>> <cut>
>>>> +
>>>> +  struct Elem_st
>>>> +  {
>>>> +  protected:
>>>> +    CHARSET_INFO *m_charset;
>>>> +    CHARSET_INFO *m_collation;
>>>
>>> What is the difference between charset and collation?
>>> I was under impression that any CHARSET_INFO describes a collation,
>>> there is no special structure just for a charset. Am I wrong?
>>
>> You're not wrong.
>>
>> Just conceptually we map from a charset to a collation here. Hence the
>> names.
>>
>> I can rename to:
>>
>>      CHARSET_INFO *m_from_charset;
>>      CHARSET_INFO *m_to_collation;
>>
>> to make it more clear.

I made it this way:

     CHARSET_INFO *m_from; // From a character set
     CHARSET_INFO *m_to;   // To a collation

> 
> Thanks. Much better.
> 
> Regards,
> Sergei
> VP of MariaDB Server Engineering
> and security@mariadb.org
>