[Maria-developers] Review request MDEV-6274 Collation usage statistics (for feedback plugin)
Hi, Alexander! I don't particularly like the idea of the COLLATIONPROPERTY function, The standard way of getting object metadata is to query information schema. Introducing "property" function has long term implications. What would be the way to get sysvar data? select VARIABLE_VALUE from I_S.GLOBAL_VARIABLES where VARIABLE_NAME=xxx or select VARIABLEPROPERTY('xxx', 'Value') and the help text? select VARIABLEPROPERTY('xxx', 'Help') Storage engine of a table: select TABLEPROPERTY('xxx', 'Engine') See - you're introducing a new way to access object metadata (and, may be, data too - "property" is kind of ambigous). And what might look like making sense now in 10.0 for collations, isn't necessarily reasonable long term. As far as collations are concerned, I would rather see this data in the information_schema. In the new table or in the I_S.COLLATIONS - I don't know what's better, see also MDEV-6138.
=== modified file 'include/my_sys.h' --- include/my_sys.h 2014-04-15 07:29:57 +0000 +++ include/my_sys.h 2014-05-28 14:01:40 +0000 @@ -242,6 +242,11 @@ extern MYSQL_PLUGIN_IMPORT CHARSET_INFO extern MYSQL_PLUGIN_IMPORT CHARSET_INFO *all_charsets[MY_ALL_CHARSETS_SIZE]; extern struct charset_info_st compiled_charsets[];
+/* Collation properties and use statistics */ +extern my_bool my_collation_is_known_id(uint id); +extern ulonglong my_collation_statistics_get_use_count(uint id); +extern const char *my_collation_get_tailoring(uint id); + /* statistics */ extern ulong my_file_opened,my_stream_opened, my_tmp_file_created; extern ulong my_file_total_opened;
=== modified file 'mysys/charset.c' --- mysys/charset.c 2013-11-08 22:20:07 +0000 +++ mysys/charset.c 2014-05-29 05:20:11 +0000 @@ -483,6 +483,50 @@ void add_compiled_collation(struct chars static my_pthread_once_t charsets_initialized= MY_PTHREAD_ONCE_INIT; static my_pthread_once_t charsets_template= MY_PTHREAD_ONCE_INIT;
+typedef struct +{ + ulonglong use_count; +} MY_COLLATION_STATISTICS; + + +static MY_COLLATION_STATISTICS my_collation_statistics[MY_ALL_CHARSETS_SIZE]; + + +my_bool my_collation_is_known_id(uint id) +{ + return id > 0 && id < array_elements(all_charsets) && all_charsets[id] ? + TRUE : FALSE; +}
Hmm. I see this function is mostly used in asserts, which is great. But you also use it in COLLATIONPROPERTY function, and I'm not sure it's correct. One can specify an id which is not loaded yet. COLLATIONPROPERTY should not fail because of that. I believe this function should be strictly for asserts, perhaps, even, as void assert_collation_id_is_known(uint id) { DBUG_ASSERT(id > 0 && id < array_elements(all_charsets) && all_charsets[id]); }
+/* + Collation use statistics functions do not lock + counters to avoid mutex contention. This can lose + some counter increments with high thread concurrency. + But this should be Ok, as we don't need exact numbers. +*/ +static inline void my_collation_statistics_inc_use_count(uint id) +{ + DBUG_ASSERT(my_collation_is_known_id(id)); + my_collation_statistics[id].use_count++; +} + + +ulonglong my_collation_statistics_get_use_count(uint id) +{ + DBUG_ASSERT(my_collation_is_known_id(id)); + return my_collation_statistics[id].use_count;
Alternatively, you can define it to return 0 for unknown ids. Or -1. Would be a bit easier to use in fill_collation_statistics()
+} + + === modified file 'plugin/feedback/utils.cc' --- plugin/feedback/utils.cc 2013-11-20 11:05:39 +0000 +++ plugin/feedback/utils.cc 2014-05-28 10:34:11 +0000 @@ -383,6 +383,25 @@ int fill_misc_data(THD *thd, TABLE_LIST return 0; }
+int fill_collation_statistics(THD *thd, TABLE_LIST *tables) +{ + TABLE *table= tables->table; + for (uint id= 1; id < MY_ALL_CHARSETS_SIZE; id++) + { + ulonglong count; + if (my_collation_is_known_id(id) && + (count= my_collation_statistics_get_use_count(id))) + { + char name[MY_CS_NAME_SIZE + 32]; + size_t namelen= my_snprintf(name, sizeof(name), + "Collation_usecount %s",
Let's rather call it "Collation %s used", to be in line with plugin rows in the feedback table. Having simply "%s used" would've been even more consistent, but we don't want to have plugins and collations in the same namespace. I wish I had plugins prefixed with the "Plugin", but it's too late to change.
+ get_charset_name(id)); + INSERT2(name, namelen, (count, UNSIGNED)); + } + } + return 0; +};
Nothing to comment about new added function - the code is straightforward. Two only issues to discuss about it - whether we want it at all. And, if yes, whether arguments should be strings or, perhaps, the second should be a keyword. Or, may be, the first one too. Regards, Sergei
Hi Sergei, Thanks for review. Please see comments inline: On 07/03/2014 01:47 PM, Sergei Golubchik wrote:
Hi, Alexander!
I don't particularly like the idea of the COLLATIONPROPERTY function, The standard way of getting object metadata is to query information schema. Introducing "property" function has long term implications. What would be the way to get sysvar data?
select VARIABLE_VALUE from I_S.GLOBAL_VARIABLES where VARIABLE_NAME=xxx
or
select VARIABLEPROPERTY('xxx', 'Value')
and the help text?
select VARIABLEPROPERTY('xxx', 'Help')
Storage engine of a table:
select TABLEPROPERTY('xxx', 'Engine')
See - you're introducing a new way to access object metadata (and, may be, data too - "property" is kind of ambigous). And what might look like making sense now in 10.0 for collations, isn't necessarily reasonable long term.
As far as collations are concerned, I would rather see this data in the information_schema. In the new table or in the I_S.COLLATIONS - I don't know what's better, see also MDEV-6138.
I'm thinking of adding many collation properties in the future (e.g. tailoring, case/accent/kana sensitivity, Unicode version, comparison strength, sort strength, various Unicode Collation Algorithm options). So I didn't want to flood the standard table INFORMATION_SCHEMA.COLLATIONS with all these properties. Note, the "tailoring" property can be particular very long. If we add tailoring as a column to I_S.COLLATIONS, then "SELECT * FROM I_S.COLLATIONS" will be very unreadable. It should be fine to add a new table. But I have some concerns: 1. I don't like mixing usage statistics and static properties into the same table. Should we add two new tables with about these structures: CREATE TABLE COLLATION_STATISTICS ( ID BIGINT NOT NULL, USE_COUNT BIGINT NOT NULL ); CREATE TABLE COLLATION_PROPERTIES ( ID BIGINT NOT NULL, TAILORING TEXT CHARACTER SET UTF8MB4 -- things like Unicode version, sensitivity to be added here later ); ? 2. To search by name, one will need a join by ID, e.g. using a NATURAL JOIN: SELECT TAILORING FROM INFORMATION_SCHEMA.COLLATIONS NATURAL JOIN INFORMATION_SCHEMA.COLLATION_PROPERTIES WHERE COLLATION_NAME='utf8_polish_ci'; A little bit cumbersome. But should probably work. What else we can do? Duplicating COLLATION_NAME into the new tables is not a good idea. Do you agree? 3. Performance should be better with a function. With an I_S table, we'll have to populate the table every time with all rows, but in fact the caller in many cases may want only one row (e.g. searching by name or ID). But performance is rather not too much important here. See also comments below.
=== modified file 'include/my_sys.h' --- include/my_sys.h 2014-04-15 07:29:57 +0000 +++ include/my_sys.h 2014-05-28 14:01:40 +0000 @@ -242,6 +242,11 @@ extern MYSQL_PLUGIN_IMPORT CHARSET_INFO extern MYSQL_PLUGIN_IMPORT CHARSET_INFO *all_charsets[MY_ALL_CHARSETS_SIZE]; extern struct charset_info_st compiled_charsets[];
+/* Collation properties and use statistics */ +extern my_bool my_collation_is_known_id(uint id); +extern ulonglong my_collation_statistics_get_use_count(uint id); +extern const char *my_collation_get_tailoring(uint id); + /* statistics */ extern ulong my_file_opened,my_stream_opened, my_tmp_file_created; extern ulong my_file_total_opened;
=== modified file 'mysys/charset.c' --- mysys/charset.c 2013-11-08 22:20:07 +0000 +++ mysys/charset.c 2014-05-29 05:20:11 +0000 @@ -483,6 +483,50 @@ void add_compiled_collation(struct chars static my_pthread_once_t charsets_initialized= MY_PTHREAD_ONCE_INIT; static my_pthread_once_t charsets_template= MY_PTHREAD_ONCE_INIT;
+typedef struct +{ + ulonglong use_count; +} MY_COLLATION_STATISTICS; + + +static MY_COLLATION_STATISTICS my_collation_statistics[MY_ALL_CHARSETS_SIZE]; + + +my_bool my_collation_is_known_id(uint id) +{ + return id > 0 && id < array_elements(all_charsets) && all_charsets[id] ? + TRUE : FALSE; +}
Hmm. I see this function is mostly used in asserts, which is great. But you also use it in COLLATIONPROPERTY function, and I'm not sure it's correct. One can specify an id which is not loaded yet. COLLATIONPROPERTY should not fail because of that.
It should not fail. all_charsets[id] is not NULL for not loaded (but known) collations. It's allocated and have valid structure fields that store collation name, character set name, tailoring and some others. So a not loaded collation should not erroneously be treated as unknown, if you mean this.
I believe this function should be strictly for asserts, perhaps, even, as
void assert_collation_id_is_known(uint id) { DBUG_ASSERT(id > 0 && id < array_elements(all_charsets) && all_charsets[id]); }
+/* + Collation use statistics functions do not lock + counters to avoid mutex contention. This can lose + some counter increments with high thread concurrency. + But this should be Ok, as we don't need exact numbers. +*/ +static inline void my_collation_statistics_inc_use_count(uint id) +{ + DBUG_ASSERT(my_collation_is_known_id(id)); + my_collation_statistics[id].use_count++; +} + + +ulonglong my_collation_statistics_get_use_count(uint id) +{ + DBUG_ASSERT(my_collation_is_known_id(id)); + return my_collation_statistics[id].use_count;
Alternatively, you can define it to return 0 for unknown ids. Or -1. Would be a bit easier to use in fill_collation_statistics()
Not sure. I think my way is more future prove. First, you check if a collation is known and thus can be queried. Second, on success you query for its particular property.
+} + + === modified file 'plugin/feedback/utils.cc' --- plugin/feedback/utils.cc 2013-11-20 11:05:39 +0000 +++ plugin/feedback/utils.cc 2014-05-28 10:34:11 +0000 @@ -383,6 +383,25 @@ int fill_misc_data(THD *thd, TABLE_LIST return 0; }
+int fill_collation_statistics(THD *thd, TABLE_LIST *tables) +{ + TABLE *table= tables->table; + for (uint id= 1; id < MY_ALL_CHARSETS_SIZE; id++) + { + ulonglong count; + if (my_collation_is_known_id(id) && + (count= my_collation_statistics_get_use_count(id))) + { + char name[MY_CS_NAME_SIZE + 32]; + size_t namelen= my_snprintf(name, sizeof(name), + "Collation_usecount %s",
Let's rather call it "Collation %s used", to be in line with plugin rows in the feedback table. Having simply "%s used" would've been even more consistent, but we don't want to have plugins and collations in the same namespace. I wish I had plugins prefixed with the "Plugin", but it's too late to change.
Collation names have different lengths, so putting collation names as the last part of the value name would bring more readable results: +---------------------------------------+----+ | Collation used utf8_unicode_ci | 10 | | Collation used utf8_bin | 20 | | Collation used utf8mb4_unicode_520_ci | 30 | +---------------------------------------+----+ Compare to this: it looks messy: +---------------------------------------+----+ | Collation utf8_unicode_ci used | 10 | | Collation utf8_bin used | 20 | | Collation utf8mb4_unicode_520_ci used | 30 | +---------------------------------------+----+ Note, if we add some more statistics in the future, the difference in readability will be even stronger: This is easy to read: +---------------------------------------+----+ | Collation used utf8_unicode_ci | 10 | | Collation used utf8_bin | 20 | | Collation used utf8mb4_unicode_520_ci | 30 | | Collation xxx utf8_unicode_ci | 10 | | Collation xxx utf8_bin | 20 | | Collation xxx utf8mb4_unicode_520_ci | 30 | +---------------------------------------+----+ This is not: +---------------------------------------+----+ | Collation utf8_unicode_ci used | 10 | | Collation utf8_bin used | 20 | | Collation utf8mb4_unicode_520_ci used | 30 | | Collation utf8_unicode_ci xxx | 10 | | Collation utf8_bin xxx | 20 | | Collation utf8mb4_unicode_520_ci xxx | 30 | +---------------------------------------+----+ Finding a particular line in the output by eyes is much easier when the name is last.
+ get_charset_name(id)); + INSERT2(name, namelen, (count, UNSIGNED)); + } + } + return 0; +};
Nothing to comment about new added function - the code is straightforward. Two only issues to discuss about it - whether we want it at all.
Both function and new I_S tables have their own cons and pros. I don't know which is better :)
And, if yes, whether arguments should be strings or, perhaps, the second should be a keyword. Or, may be, the first one too.
Do you mean: SELECT COLLATIONPROPERTY('utf8_bin', TAILORING); or even: SELECT COLLATIONPROPERTY(utf8_bin, TAILORING); ?
Regards, Sergei
Hi, Alexander! On Jul 24, Alexander Barkov wrote:
On 07/03/2014 01:47 PM, Sergei Golubchik wrote:
Hi, Alexander!
I don't particularly like the idea of the COLLATIONPROPERTY function, The standard way of getting object metadata is to query information schema. Introducing "property" function has long term implications. What would be the way to get sysvar data?
select VARIABLE_VALUE from I_S.GLOBAL_VARIABLES where VARIABLE_NAME=xxx
or
select VARIABLEPROPERTY('xxx', 'Value')
and the help text?
select VARIABLEPROPERTY('xxx', 'Help')
Storage engine of a table:
select TABLEPROPERTY('xxx', 'Engine')
See - you're introducing a new way to access object metadata (and, may be, data too - "property" is kind of ambigous). And what might look like making sense now in 10.0 for collations, isn't necessarily reasonable long term.
As far as collations are concerned, I would rather see this data in the information_schema. In the new table or in the I_S.COLLATIONS - I don't know what's better, see also MDEV-6138.
I'm thinking of adding many collation properties in the future (e.g. tailoring, case/accent/kana sensitivity, Unicode version, comparison strength, sort strength, various Unicode Collation Algorithm options).
So I didn't want to flood the standard table INFORMATION_SCHEMA.COLLATIONS with all these properties.
Ah, yes, you're right. INFORMATION_SCHEMA.COLLATIONS is the standard one. Indeed, it's better to add a new table then.
It should be fine to add a new table. But I have some concerns:
1. I don't like mixing usage statistics and static properties into the same table.
Should we add two new tables with about these structures:
CREATE TABLE COLLATION_STATISTICS ( ID BIGINT NOT NULL, USE_COUNT BIGINT NOT NULL );
CREATE TABLE COLLATION_PROPERTIES ( ID BIGINT NOT NULL, TAILORING TEXT CHARACTER SET UTF8MB4 -- things like Unicode version, sensitivity to be added here later );
I'm fine with one table, actually.
2. To search by name, one will need a join by ID, e.g. using a NATURAL JOIN:
SELECT TAILORING FROM INFORMATION_SCHEMA.COLLATIONS NATURAL JOIN INFORMATION_SCHEMA.COLLATION_PROPERTIES WHERE COLLATION_NAME='utf8_polish_ci';
A little bit cumbersome. But should probably work.
Yes, ok.
3. Performance should be better with a function. With an I_S table, we'll have to populate the table every time with all rows, but in fact the caller in many cases may want only one row (e.g. searching by name or ID). But performance is rather not too much important here.
There's condition pushdown for I_S tables. It's only used for table-related I_S tables at the moment, but there's no reason why it cannot be used for collations, if this will become an issue.
=== modified file 'mysys/charset.c' --- mysys/charset.c 2013-11-08 22:20:07 +0000 +++ mysys/charset.c 2014-05-29 05:20:11 +0000 ... +my_bool my_collation_is_known_id(uint id) +{ + return id > 0 && id < array_elements(all_charsets) && all_charsets[id] ? + TRUE : FALSE; +}
Hmm. I see this function is mostly used in asserts, which is great. But you also use it in COLLATIONPROPERTY function, and I'm not sure it's correct. One can specify an id which is not loaded yet. COLLATIONPROPERTY should not fail because of that.
It should not fail. ... So a not loaded collation should not erroneously be treated as unknown, if you mean this.
Yes.
+/* + Collation use statistics functions do not lock + counters to avoid mutex contention. This can lose + some counter increments with high thread concurrency. + But this should be Ok, as we don't need exact numbers. +*/ +static inline void my_collation_statistics_inc_use_count(uint id) +{ + DBUG_ASSERT(my_collation_is_known_id(id)); + my_collation_statistics[id].use_count++; +} + + +ulonglong my_collation_statistics_get_use_count(uint id) +{ + DBUG_ASSERT(my_collation_is_known_id(id)); + return my_collation_statistics[id].use_count;
Alternatively, you can define it to return 0 for unknown ids. Or -1. Would be a bit easier to use in fill_collation_statistics()
Not sure. I think my way is more future prove. First, you check if a collation is known and thus can be queried. Second, on success you query for its particular property.
sure, that's up to you. that was just a suggestion, based on how you use this function in fill_collation_statistics().
+ "Collation_usecount %s",
Let's rather call it "Collation %s used", to be in line with plugin rows in the feedback table. Having simply "%s used" would've been even more consistent, but we don't want to have plugins and collations in the same namespace. I wish I had plugins prefixed with the "Plugin", but it's too late to change.
Collation names have different lengths, so putting collation names as the last part of the value name would bring more readable results:
+---------------------------------------+----+ | Collation used utf8_unicode_ci | 10 | | Collation used utf8_bin | 20 | | Collation used utf8mb4_unicode_520_ci | 30 | +---------------------------------------+----+
Fine. Readability is not the goal, the result is not really for a human consumption - the script will read and parse it. But I don't see why we need to make it intentionally unreadable :) So, "Collation used %s" is good Regards, Sergei
participants (2)
-
Alexander Barkov
-
Sergei Golubchik