10 Jul
2023
10 Jul
'23
9:42 a.m.
Hi Sergei, Thanks for the review. Here's a new version: https://github.com/MariaDB/server/commit/5102d86beae6b7464b1271081f8003ae81850a5b See comments below: On 7/9/23 6:56 PM, Sergei Golubchik wrote: > Hi, Alexander, > > On Jul 07, Alexander Barkov wrote: >> On 6/24/23 12:20 AM, Sergei Golubchik wrote: >>> >>>> diff --git a/sql/lex_charset.cc b/sql/lex_charset.cc >>>> --- a/sql/lex_charset.cc >>>> +++ b/sql/lex_charset.cc >>>> @@ -447,6 +451,17 @@ Lex_exact_charset_opt_extended_collate::find_default_collation() const >>>> >>>> +CHARSET_INFO * >>>> +Lex_exact_charset_opt_extended_collate:: >>>> + find_mapped_default_collation(Charset_collation_map_st::Used *used, >>>> + const Charset_collation_map_st &map) const >>>> +{ >>>> + CHARSET_INFO *cs= find_compiled_default_collation(); >>>> + if (!cs) >>>> + return nullptr; >>>> + return map.get_collation_for_charset(used, cs); >>> >>> This seems a bit redundant. find_compiled_default_collation() will >>> do get_charset_by_csname(m_ci, MY_CS_PRIMARY) and only then you'll >>> be able to search, because you map default compiled collation id to >>> use specified collation id. > ... >> And by the way, the situation when find_compiled_default_collation() >> is really called is very rare itself: >> >> CREATE TABLE t1 (a CHAR(10) COLLATE DEFAULT) >> CHARACTER SET utf8mb4 COLLATE utf8mb4_bin; > > Indeed, this is quite uncommon. Ok. > >>> but conceptually you need to map a charset to a collation, not a >>> collation to a collation. If you'd map, say, cs->cs_name.str (just >>> the pointer, not the string value) or, say, cs->cset, then you'd be >>> mapping a charset to a collation. and here you'd be able to do just >>> >>> return map.get_collation_for_charset(used, m_ci) >>> >>> with find_compiled_default_collation() being done inside >>> get_collation_for_charset() and only when Elem_st wasn't found >> >> cs->cset is a pointer to MY_CHARSET_HANDLER. It cannot be used here, >> because the same handler is shared between multiple character sets. > > Oops. Sorry. I thought for some reason that cs->cset describes the > charset. But now I see that all charset methods get CHARSET_INFO as an > argument and use tables from there. Indeed, cs->cset can be the same for > many charsets (and it is). > >> cs->cs_name.str could work. However, this may not be reliable >> with user-defined collations. > > What do you mean? > cs->cs_name.str used to be a different pointer in collations of the same character set until recent (every collation had its own character set name copy). A few months ago Monty made an optimization so now all collations of the same character set point to exactly the same string. However I'm not 100% sure it works as expected with user collations defined in Index.xml I'd like to avoid starting using cs->cs_name.str as as a replacement for a character set ID. >> Taking into account that the situation when redundant >> lookup is done is very rare, I propose not to change anything here now. >> Instead, eventually we need to fix the collation library itself, >> to avoid lookup inside get_charset_by_csname(m_ci, MY_CS_PRIMARY). > > I mean, logically there should be a structure that defines a > character set. And a structure that defines a collation. > Former doesn't exists, the latter is confusingly called CHARSET_INFO. > > Eventually there should be a clear distinction in the code just like > you've recently made a clear distinction in the SQL. > For now, perhaps, cs->cs_name.str could be such a "charset identifier". I agree, there should be a separate structure for a character set. Let's return to this question later. For now we agreed that the situation when find_compiled_default_collation() is called when it needs to do get_charset_by_csname(m_ci, MY_CS_PRIMARY) is rare. So the current way should not bring any performance problems. >>>> diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc >>>> --- a/sql/sql_parse.cc >>>> +++ b/sql/sql_parse.cc >>>> @@ -6049,6 +6049,27 @@ mysql_execute_command(THD *thd, bool is_called_from_prepared_stmt) >>>> } >>>> thd->reset_kill_query(); >>>> } >>>> + >>>> + /* >>>> + If a non-default collation (in @@character_set_collations) >>>> + was used during the statement, the mysqlbinlog output for >>>> + the current statement will contain a sequence like this: >>>> + >>>> + SET character_set_collations='utf8mb3=utf8mb3_bin'; >>>> + INSERT INTO t1 VALUES (_utf8mb3'test'); >>>> + COMMIT; >>>> + >>>> + The statment (INSERT in this example) is already in binlog at this point, and the >>>> + and the "SET character_set_collations" is written inside a >>>> + Q_CHARACTER_SET_COLLATIONS chunk in its log entry header. >>>> + The flag CHARACTER_SET_COLLATIONS_USED is not needed any more. >>>> + >>>> + Let's suppress the flag to avoid a Q_CHARACTER_SET_COLLATIONS chunk >>>> + inside the COMMIT log entry header - it would be useless and would >>>> + only waste space in the binary log. >>>> + */ >>>> + thd->used&= ~THD::CHARACTER_SET_COLLATIONS_USED; >>> >>> Wouldn't that logic apply to other THD::xxx_USED flags? >>> May be you should reset thd->used=0 here? >> >> Sounds reasonable. I tried thd->used=0. >> >> rpl.rpl_hrtime failed with multiple chunks like this: >> >> -SET TIMESTAMP=1293832861.123456/*!*/; >> +SET TIMESTAMP=1293832861/*!*/; >> COMMIT >> >> Do you know why binlog always puts "SET TIMESTAMP" before "COMMIT"? > > because a timestamp is a always present for any event, it's the first 4 > bytes of the common event header, mandatory for all event types. > In fact, event type itself is only in the 5th byte, after the timestamp. > > But the fractional part if printed only for Query_log_event > and only if the Q_HRNOW field in the Query_log_event was present. > > I'd think that thd->used can be done not even before COMMIT, where you > put it, but directly after MYSQL_BIN_LOG::write. 'used' flags affect the > next binlog event, so it's reasonable to reset them after the event is > written. > >> This change would not be comfortable for me. >> I'd prefer if "SET TIMESTAMP" was not printed at all when it's not needed, >> instead of removing only the fractional part. >> >> Also these tests: >> binlog.binlog_mysqlbinlog_row >> binlog.binlog_mysqlbinlog_row_myisam >> failed with one chunk: >> >> SET TIMESTAMP=1000000000/*!*/; >> -SET @@session.time_zone='SYSTEM'/*!*/; >> COMMIT > > Yes, same. > >> What do you suggest? >> >> I'd solve it in a separate task. It seems it needs further >> investigations, not related to this MDEV. > > I'd suggest to reset thd->used after the event was written. This should > keep flags from leaking into the next event. > Your thd->used&= ~THD::CHARACTER_SET_COLLATIONS_USED won't be needed. I changed thd->used&= ~THD::CHARACTER_SET_COLLATIONS_USED; to thd->used=0; and re-recorded the three mentioned tests. > >>>> diff --git a/sql/sql_prepare.cc b/sql/sql_prepare.cc >>>> --- a/sql/sql_prepare.cc >>>> +++ b/sql/sql_prepare.cc >>>> @@ -3538,6 +3544,13 @@ static void mysql_stmt_execute_common(THD *thd, >>>> DBUG_VOID_RETURN; >>>> } >>>> >>>> + if (stmt->prepare_time_charset_collation_map_version() != >>>> + thd->variables.character_set_collations.version()) >>>> + { >>>> + if (stmt->reprepare()) >>>> + DBUG_VOID_RETURN; >>>> + } >>> >>> why do you reprepare here and not where reprepare() is normally >>> happens, in the Prepared_statement::execute_loop()? >> >> Can you please suggest a better place for this code? > > I mean, Prepared_statement already knows how to reprepare, it's already > implemented and is done in Prepared_statement::execute_loop(). > > Why do you need to add another reprepare in a different place? As discussed on Slack, I moved this code inside Prepare_statement::execute(). A change in the @@charseter_set_collations is now reported via Reprepare_observer::report_error(). to the upper level methods, so no changes in these methods are needed: - execute_loop() - execute_bulk_loop() This way makes the patch for sql_prepare.cc much smaller, but adds some DA calls. As we agreed, these extra DA calls should not be a problem, because @@charseter_set_collations changes rarely. > >>>> @@ -5220,6 +5252,13 @@ bool Prepared_statement::execute(String *expanded_query, bool open_cursor) >>>> MYSQL_QUERY_EXEC_START(thd->query(), thd->thread_id, thd->get_db(), >>>> &thd->security_ctx->priv_user[0], >>>> (char *) thd->security_ctx->host_or_ip, 1); >>>> + /* >>>> + If PREPARE used @@character_set_collations, >>>> + then we need to make sure binary log writes >>>> + the map in the event header. >>>> + */ >>> >>> why is it different from other THD::xx_USED flags? >> >> CHARACTER_SET_COLLATIONS_USED in most cases is set during parsing only, >> and is not set on later stages afterwards. >> >> Other XXX_USED flags are set at later stages, either during fix_fields(), >> or during evaluation. We don't need to care about copying them >> from m_prepare_time_thd_used_flags, >> because the flags will be set during EXECUTE (and in many cases they are >> not even set during PREPARE). > > Still, you could've done > > thd->used|= m_prepare_time_thd_used_flags; > > couldn't you? > > It wouldn't beg the question of how CHARACTER_SET_COLLATIONS_USED is > different. And in the future if there will be more parsing-time flags, > it'll still work Yes, it worked fine. MTR found no unexpected side effects. > >>>> diff --git a/sql/sql_yacc.yy b/sql/sql_yacc.yy >>>> --- a/sql/sql_yacc.yy >>>> +++ b/sql/sql_yacc.yy >>>> @@ -9686,6 +9713,9 @@ column_default_non_parenthesized_expr: >>>> } >>>> | CONVERT_SYM '(' expr USING charset_name ')' >>>> { >>>> + Character_set_collations_used used(thd); >>>> + $5= thd->variables.character_set_collations. >>>> + get_collation_for_charset(&used, $5); >>> >>> this is rather awkward pattern that you use everywhere, with >>> >>> Character_set_collations_used used(thd); >>> >>> why not to pass thd as an argument instead? >> >> charset_collations.h and lex_charset.h use >> Charset_collation_map_st::Used. They neither know what its descendant >> Character_set_collations_used is, nor know what THD is. >> >> I would like not to make charset_collations.h and lex_charset.h depend >> on the entire sql_class.h. >> >> As a solution, we can move the part of THD where the data type used_t >> and the enum with XXX_USED flags are defined into a separate class in >> a separate header: > > yes, that sounds reasonable Done. >>>> diff --git a/sql/simple_tokenizer.h b/sql/simple_tokenizer.h >>>> --- /dev/null >>>> +++ b/sql/simple_tokenizer.h >> <cut> >>>> +class Simple_tokenizer >>>> +{ >>>> + const char *m_ptr; >>>> + const char *m_end; >>>> +public: >>>> + Simple_tokenizer(const char *str, size_t length) >>>> + :m_ptr(str), m_end(str + length) >>>> + { } > ... >> - This charset-unaware (ASCII-only) tokenizer can be reused for a >> number of system variables (sql_mode, optimizer_switch, >> log_slow_filter, myisam_recover_options, >> slave_transaction_retry_errors). > > could you add an MDEV for that please? > perhaps even consider doing it after this one. Sure, will do. > >>>> diff --git a/sql/charset_collations.h b/sql/charset_collations.h >>>> --- /dev/null >>>> +++ b/sql/charset_collations.h >> <cut> >>>> + >>>> + struct Elem_st >>>> + { >>>> + protected: >>>> + CHARSET_INFO *m_charset; >>>> + CHARSET_INFO *m_collation; >>> >>> What is the difference between charset and collation? >>> I was under impression that any CHARSET_INFO describes a collation, >>> there is no special structure just for a charset. Am I wrong? >> >> You're not wrong. >> >> Just conceptually we map from a charset to a collation here. Hence the >> names. >> >> I can rename to: >> >> CHARSET_INFO *m_from_charset; >> CHARSET_INFO *m_to_collation; >> >> to make it more clear. I made it this way: CHARSET_INFO *m_from; // From a character set CHARSET_INFO *m_to; // To a collation > > Thanks. Much better. > > Regards, > Sergei > VP of MariaDB Server Engineering > and security@mariadb.org >