Hi!
"Igor" == Igor Babaev <igor@askmonty.org> writes:
<cut>
ANALYZE TABLE t1 PERSISTENT; -- meaning all columns and indexes ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4);
<cut> Igor> The global server option --stat-tables says now whether the statistics Igor> tables have to be taken into account (without this option the server Igor> ignores them altogether) Igor> There is a system variable (global/session) that controls how the Igor> statistics table are used by queries/statements. The variable is called Igor> optimizer_use_stat_tables. Igor> There are 4 possible values for this variable: Igor> - 'never' Igor> - 'complementary' Igor> - 'preferably' Igor> - 'exclusively'. Why two variables? It should be enough to just have one variable 'use_stat_tables' or optimize_use_stat_tables. Igor> The default value for the variable now is 'never'. Igor> If we start the server with --stat-tables disabled the only meaningful Igor> value for the variable is 'never'. Other values are ignored Igor> Q1. Should we issue a warning with the attempt to reset this value Igor> when --stat-tables is disabled? Having one variable would solve the above question and make usage easier. Igor> Let me first tell you how the setting of this variable affects Igor> regular queries (SELECT, UPDATE etc). Igor> If the variable is set to: Igor> 1. case 'never': Igor> then no statistics from stat tables is used Igor> 2. case 'complementary' Igor> then the statistical value for a particular attribute is used Igor> from persistent tables ONLY when there is no corresponding Igor> statistics from other sources Igor> 3. case 'preferably' Igor> then the statistical value for a particular attribute is used Igor> from other sources ONLY when there is no corresponding Igor> statistics from persistent tables Igor> 4. case 'exclusively' Igor> the only statistics from persistent tables is used, if there is none Igor> on some attributes no statistics is used For what cases would we ever need 4) ? I think this is option is more dangerous than any of the other options because if you forget this one on, almost all queries will probably be table scans and most things will stop working. Having 1-3 is good enough for me. Igor> The variable setting is also affects how ANALYZE works. Igor> The current syntax of the ANALYZE command is Igor> ANALYZE TABLE t1, ...; Igor> An currently if @@optimizer_use_stat_tables is set to Igor> 1. 'never' Igor> then it works as before (e.g. in 5.5) Igor> 2. 'complementary' Igor> then it works as before + persistent statistics Igor> for table_stat and column_stat is collected Igor> 3. 'preferably' Igor> then it works as before + persistent statistics Igor> for table_stat, column_stat and index_stat is collected Igor> 4. 'exclusively' Igor> only persistent statistics for table_stat, column_stat Igor> and index_stat is collected I think 4) is not needed as we can do this if you use the proposed new analyze syntax. Igor> Q2. Is this semantics of ANALYZE is natural? Igor> Collection of persistent statistics with ANALYZE is now engine Igor> independent. So it's not a cheap and fast process. Igor> In many cases it is enough to collect/update statistics only some Igor> columns or even for some indexes. Igor> That's why I suggested to extend the syntax of the analyze. Igor> Let's choose something that is close to what SergeiG suggested. Igor> There are some questions here. Igor> Q3. If I need statistics for all columns and some indexes of table t1 Igor> and everything on table t2. Igor> can I use: Igor> ANALYZE TABLE t1 COLUMN[S] ALL INDEX[ES] (idx1,...), t2; ? For the first version, lets only do this extended syntax for a table at a time. It should preferably work like it works for CHECK TABLE / REPAIR. In other words: The options should be valid for all tables. (In this case the only thing that makes sences is to to either only do columns or indexes or both). Igor> Q4. Do we really need the keyword PERSISTENT here? Igor> If so what is its semantics? I don't mind the words persistent. I kind of like it as it's giving the user more information of what is going (ie, that the result will be stored once and for all). Igor> Q5. Isn't 'FOR' in the syntax of ANALYZE confusing in the sense that Igor> it commands to collect statistics for column_stat/index_stat, Igor> but not for table_stat? Maybe it's better to omit it? Comparing: ANALYZE TABLE t1 PERSISTENT COLUMNS (c,e,d) INDEXES (idx2, idx4); ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4); The later is more natural english, but I agree that it's not clear that the command also does table statistics. However as ANALYZE table does generally do table statistics, it's not totally wrong. I tried to think of a new syntax but didn't come up with anything better than: ANALYZE TABLE t1 PERSISTENT INCLUDING COLUMNS (c,e,d) INDEXES (idx2, idx4); but it's harder to remember than simple using FOR Igor> I do not plan to extend the syntax of ANALYZE to collect/update Igor> statistics only on some attributes of the columns (like collect Igor> statistics on min/max values, but not on the number of distinct values). Igor> Although I admit that collecting statistics on the number of distinct Igor> values is more expensive I don'tthink such an extension of ANALYZE is Igor> badly needed now. What statistics are you collecting for columns? I thought that the only usable value would be distinct values. Can we somehow cheaply get an approximation for distinct values? For example, assume we use the following algorithm: - Set up dynamic arrays to hold up 1000 distinct values per column - Start filling the column. - When one array fills up, remmeber the number of rows involved and reset the array. - The approximated number of distinct values could then be approximated as: total_fills * 1000 + current_number_in_array This would be close to correct for: - Data that is clustered - Columns with less than 1000 distinct values - Tables with very few similar values The worst case scenario is if the same values repeats uniquely every 1000 rows but that is not very likely. Even then, it would not bee too bad to get a larger value for distinct as there would probably be a disk io involved for every value we would need to read (as the values would be far appart on the disk and could not be read on the same fetch). Igor> I plan to add a possibility to collect statistics from random sample Igor> records. I plan to control the process of sampling only through system Igor> variables. ok. Igor> How does all above comply with persistent statistics of InnoDB from Igor> mysql-5.6? Almost perfectly. They store their persistent statistics Igor> in mysql.innodb_table_stat and mysql.innodb_index_stat. So we do not Igor> conflict here. The collection and the usage of their persistent Igor> statistics is going through their own system variables. Again - no Igor> conflicts. Igor> I will appreciate any comments on this post and, of course, Igor> I would like to have some answers for the above questions from Igor> SergeiG. Hope the above answers most/all of your questions? Don't worry about the syntax. We can always fix that later when we got some feedback from users! Regards, Monty