Hi, Alexander! See below. My main concern is the stability of the charset API. I'd like it to be moved into a service, as plugins surely need it (not "will need", various plugins need it even now). But you change it quite often, so it would make a rather volatile service :(
diff --git a/sql/sql_string.cc b/sql/sql_string.cc index 9fb462e..a0b6395 100644 --- a/sql/sql_string.cc +++ b/sql/sql_string.cc @@ -922,8 +922,8 @@ String *copy_if_not_alloced(String *to,String *from,uint32 from_length) my_charset_same(from_cs, to_cs)) { m_cannot_convert_error_pos= NULL; - return to_cs->cset->copy_abort(to_cs, to, to_length, from, from_length, - nchars, this); + return to_cs->cset->copy_fix(to_cs, to, to_length, from, from_length, + nchars, this);
Now you don't use copy_abort() anywhere. and you've added copy_abort with the comment "A preparatory patch for MDEV-6566" I believe you need to remove it now, may be revert the whole commit b1b6101a
} else { diff --git a/include/m_ctype.h b/include/m_ctype.h index f08efb4..4fa8779 100644 --- a/include/m_ctype.h +++ b/include/m_ctype.h @@ -444,7 +444,55 @@ struct my_charset_handler_st size_t (*scan)(CHARSET_INFO *, const char *b, const char *e, int sq);
- /* Copying routines */ + /* String copying routines and helpers for them */ + /* + charlen() - calculate length of the left-most character in bytes. + @param cs Character set + @param str The beginning of the string + @param end The end of the string + + @return MY_CS_ILSEQ if a bad byte sequence was found. + @return MY_CS_TOOSMALLN(x) if the string ended unexpectedly. + @return a positive number in the range 1..mbmaxlen, + if a valid character was found. + */ + int (*charlen)(CHARSET_INFO *cs, const uchar *str, const uchar *end);
Why couldn't you use mbcharlen for that? I mean, why not to add validity checks to mbcharlen? or mb_wc
+ /* + well_formed_char_length() - returns character length of a string. + + @param cs Character set + @param str The beginning of the string + @param end The end of the string + @param nchars Not more than "nchars" left-most characters are checked. + @param status[OUT] Additional statistics is returned here. + "status" can be uninitialized before the call, + and it is fully initialized after the call. + + status->m_source_end_pos is set to the position where reading stopped. + + If a bad byte sequence is found, the function returns immediately and + status->m_well_formed_error_pos is set to the position where a bad byte + sequence was found. + + status->m_well_formed_error_pos is set to NULL if no bad bytes were found. + If status->m_well_formed_error_pos is NULL after the call, that means: + - either the function reached the end of the string, + - or all "nchars" characters were read. + The caller can check status->m_source_end_pos to detect which of these two + happened. + */ + size_t (*well_formed_char_length)(CHARSET_INFO *cs, + const char *str, const char *end, + size_t nchars, + MY_STRCOPY_STATUS *status);
What's the difference with numchars?
+ + /* + copy_fix() - copy a string like copy_abort(), but fix bad bytes to '?'. + */ + size_t (*copy_fix)(CHARSET_INFO *, + char *dst, size_t dst_length, + const char *src, size_t src_length, + size_t nchars, MY_STRCOPY_STATUS *status); /* copy_abort() - copy a string, abort if a bad byte sequence was found. Not more than "nchars" characters are copied. diff --git a/strings/ctype-simple.c b/strings/ctype-simple.c index b010c52..35ae191 100644 --- a/strings/ctype-simple.c +++ b/strings/ctype-simple.c @@ -1108,6 +1115,19 @@ size_t my_well_formed_len_8bit(CHARSET_INFO *cs __attribute__((unused)), }
+size_t +my_well_formed_char_length_8bit(CHARSET_INFO *cs __attribute__((unused)),
Try to put the type and the function name on one line. As I've noticed yesterday, git diff doesn't recognizes it as a function name otherwise
+ const char *start, const char *end, + size_t nchars, MY_STRCOPY_STATUS *status) +{ + size_t nbytes= (size_t) (end - start); + size_t res= MY_MIN(nbytes, nchars); + status->m_well_formed_error_pos= NULL; + status->m_source_end_pos= start + res; + return res; +}
Regards, Sergei