Re: [Maria-developers] Support of the extended syntax for ANALYZE
Hi, Igor! On May 10, Igor Babaev wrote:
Serg,
Here's the patch we talked about on IRC today. ... +ANALYZE TABLE t1(COLUMN(), INDEX()); +ANALYZE TABLE t1(COLUMN(c,e,b), INDEX(idx2,idx4)); +ANALYZE TABLE t1(COLUMN(*), INDEX(*));
I see what you want to do. I think it'll be more SQL-like, if it'll be more verbose, for example ANALYZE TABLE t1 FOR COLUMNS (c,e,d) INDEXES (idx2, idx4); I would prefer to drop parentheses too, but then the grammar will become ambiguous :( And instead of double (*) it could simply be ANALYZE TABLE t1 FULL; or ALL; it'd be similar to CHECK TABLE t1 EXTENDED; REPAIR TABLE t1 QUICK; etc Alternatively, it could have an explicit "PERSISTENT" keyword: ANALYZE TABLE t1 PERSISTENT; -- meaning all columns and indexes ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4); That's probably the best one. Reasonably SQL-like. And the most common use case (analyze all) is short, it's clearly the main, default behavior. And the clause to specify columns and indexes extends the default syntax. Regards, Sergei
On 05/19/2012 04:22 AM, Sergei Golubchik wrote:
Hi, Igor!
On May 10, Igor Babaev wrote:
Serg,
Here's the patch we talked about on IRC today. ... +ANALYZE TABLE t1(COLUMN(), INDEX()); +ANALYZE TABLE t1(COLUMN(c,e,b), INDEX(idx2,idx4)); +ANALYZE TABLE t1(COLUMN(*), INDEX(*));
I see what you want to do.
I think it'll be more SQL-like, if it'll be more verbose, for example
ANALYZE TABLE t1 FOR COLUMNS (c,e,d) INDEXES (idx2, idx4);
I would prefer to drop parentheses too, but then the grammar will become ambiguous :( And instead of double (*) it could simply be
ANALYZE TABLE t1 FULL; or ALL;
it'd be similar to
CHECK TABLE t1 EXTENDED; REPAIR TABLE t1 QUICK; etc
Alternatively, it could have an explicit "PERSISTENT" keyword:
ANALYZE TABLE t1 PERSISTENT; -- meaning all columns and indexes ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4);
We usually don't use key words in plural. So maybe you meant: ANALYZE TABLE t1 PERSISTENT FOR COLUMN (c,e,d) INDEX (idx2, idx4); ? Regards, Igor.
That's probably the best one. Reasonably SQL-like. And the most common use case (analyze all) is short, it's clearly the main, default behavior. And the clause to specify columns and indexes extends the default syntax.
Regards, Sergei
" We usually don't use key words in plural." So what about SHOW COLUMNS? -- Peter On Sat, May 19, 2012 at 7:41 PM, Igor Babaev <igor@askmonty.org> wrote:
On 05/19/2012 04:22 AM, Sergei Golubchik wrote:
Hi, Igor!
On May 10, Igor Babaev wrote:
Serg,
Here's the patch we talked about on IRC today. ... +ANALYZE TABLE t1(COLUMN(), INDEX()); +ANALYZE TABLE t1(COLUMN(c,e,b), INDEX(idx2,idx4)); +ANALYZE TABLE t1(COLUMN(*), INDEX(*));
I see what you want to do.
I think it'll be more SQL-like, if it'll be more verbose, for example
ANALYZE TABLE t1 FOR COLUMNS (c,e,d) INDEXES (idx2, idx4);
I would prefer to drop parentheses too, but then the grammar will become ambiguous :( And instead of double (*) it could simply be
ANALYZE TABLE t1 FULL; or ALL;
it'd be similar to
CHECK TABLE t1 EXTENDED; REPAIR TABLE t1 QUICK; etc
Alternatively, it could have an explicit "PERSISTENT" keyword:
ANALYZE TABLE t1 PERSISTENT; -- meaning all columns and indexes ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4);
We usually don't use key words in plural. So maybe you meant: ANALYZE TABLE t1 PERSISTENT FOR COLUMN (c,e,d) INDEX (idx2, idx4); ?
Regards, Igor.
That's probably the best one. Reasonably SQL-like. And the most common use case (analyze all) is short, it's clearly the main, default behavior. And the clause to specify columns and indexes extends the default syntax.
Regards, Sergei
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Hi, Igor! On May 19, Igor Babaev wrote:
On 05/19/2012 04:22 AM, Sergei Golubchik wrote:
On May 10, Igor Babaev wrote: Alternatively, it could have an explicit "PERSISTENT" keyword:
ANALYZE TABLE t1 PERSISTENT; -- meaning all columns and indexes ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4);
We usually don't use key words in plural. So maybe you meant: ANALYZE TABLE t1 PERSISTENT FOR COLUMN (c,e,d) INDEX (idx2, idx4); ?
I've checked before suggesting this syntax. Both COLUMNS and INDEXES already exist as tokens and used somewhere in the grammar. For "INDEXES" you can use the existing keys_or_index rule: keys_or_index: KEYS {} | INDEX_SYM {} | INDEXES {} ; it's typically used in places like that. There is no corresponding column_or_columns, but feel free to create it, if you'd like. Regards, Sergei
Serg, All, On 05/19/2012 10:51 AM, Sergei Golubchik wrote:
Hi, Igor!
On May 19, Igor Babaev wrote:
On 05/19/2012 04:22 AM, Sergei Golubchik wrote:
On May 10, Igor Babaev wrote: Alternatively, it could have an explicit "PERSISTENT" keyword:
ANALYZE TABLE t1 PERSISTENT; -- meaning all columns and indexes ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4);
We usually don't use key words in plural. So maybe you meant: ANALYZE TABLE t1 PERSISTENT FOR COLUMN (c,e,d) INDEX (idx2, idx4); ?
I've checked before suggesting this syntax. Both COLUMNS and INDEXES already exist as tokens and used somewhere in the grammar.
For "INDEXES" you can use the existing keys_or_index rule:
keys_or_index: KEYS {} | INDEX_SYM {} | INDEXES {} ;
it's typically used in places like that. There is no corresponding column_or_columns, but feel free to create it, if you'd like.
Ok, Basically I'm positive about your suggestions. There are some details I'm doubting about, but I'll dwell on them later. Let's talk about the semantics of the extended ANALYZE first. We need to agree upon it because now I plan moving from a 'proof of concept' for the feature to a 'real' implementation. For this I have to give you more context. The engine persistent independent statistics is stored in 3 tables of the system mysql database: table_stat, column_stat, index_stat. The particular contents of these tables is not important for the further discussion here, it's enough to know that: - table_stat contains some statistics on the table in general - column_stat contains statistics on table columns - index_stat contains statistics on index prefixes. The global server option --stat-tables says now whether the statistics tables have to be taken into account (without this option the server ignores them altogether) There is a system variable (global/session) that controls how the statistics table are used by queries/statements. The variable is called optimizer_use_stat_tables. There are 4 possible values for this variable: - 'never' - 'complementary' - 'preferably' - 'exclusively'. The default value for the variable now is 'never'. If we start the server with --stat-tables disabled the only meaningful value for the variable is 'never'. Other values are ignored Q1. Should we issue a warning with the attempt to reset this value when --stat-tables is disabled? Let me first tell you how the setting of this variable affects regular queries (SELECT, UPDATE etc). If the variable is set to: 1. case 'never': then no statistics from stat tables is used 2. case 'complementary' then the statistical value for a particular attribute is used from persistent tables ONLY when there is no corresponding statistics from other sources 3. case 'preferably' then the statistical value for a particular attribute is used from other sources ONLY when there is no corresponding statistics from persistent tables 4. case 'exclusively' the only statistics from persistent tables is used, if there is none on some attributes no statistics is used The variable setting is also affects how ANALYZE works. The current syntax of the ANALYZE command is ANALYZE TABLE t1, ...; An currently if @@optimizer_use_stat_tables is set to 1. 'never' then it works as before (e.g. in 5.5) 2. 'complementary' then it works as before + persistent statistics for table_stat and column_stat is collected 3. 'preferably' then it works as before + persistent statistics for table_stat, column_stat and index_stat is collected 4. 'exclusively' only persistent statistics for table_stat, column_stat and index_stat is collected Q2. Is this semantics of ANALYZE is natural? Collection of persistent statistics with ANALYZE is now engine independent. So it's not a cheap and fast process. In many cases it is enough to collect/update statistics only some columns or even for some indexes. That's why I suggested to extend the syntax of the analyze. Let's choose something that is close to what SergeiG suggested. There are some questions here. Q3. If I need statistics for all columns and some indexes of table t1 and everything on table t2. can I use: ANALYZE TABLE t1 COLUMN[S] ALL INDEX[ES] (idx1,...), t2; ? Q4. Do we really need the keyword PERSISTENT here? If so what is its semantics? Q5. Isn't 'FOR' in the syntax of ANALYZE confusing in the sense that it commands to collect statistics for column_stat/index_stat, but not for table_stat? Maybe it's better to omit it? I do not plan to extend the syntax of ANALYZE to collect/update statistics only on some attributes of the columns (like collect statistics on min/max values, but not on the number of distinct values). Although I admit that collecting statistics on the number of distinct values is more expensive I don'tthink such an extension of ANALYZE is badly needed now. I plan to add a possibility to collect statistics from random sample records. I plan to control the process of sampling only through system variables. How does all above comply with persistent statistics of InnoDB from mysql-5.6? Almost perfectly. They store their persistent statistics in mysql.innodb_table_stat and mysql.innodb_index_stat. So we do not conflict here. The collection and the usage of their persistent statistics is going through their own system variables. Again - no conflicts. I will appreciate any comments on this post and, of course, I would like to have some answers for the above questions from SergeiG. Regards, Igor. then any particular satistics from
Regards, Sergei
Hi!
"Igor" == Igor Babaev <igor@askmonty.org> writes:
<cut>
ANALYZE TABLE t1 PERSISTENT; -- meaning all columns and indexes ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4);
<cut> Igor> The global server option --stat-tables says now whether the statistics Igor> tables have to be taken into account (without this option the server Igor> ignores them altogether) Igor> There is a system variable (global/session) that controls how the Igor> statistics table are used by queries/statements. The variable is called Igor> optimizer_use_stat_tables. Igor> There are 4 possible values for this variable: Igor> - 'never' Igor> - 'complementary' Igor> - 'preferably' Igor> - 'exclusively'. Why two variables? It should be enough to just have one variable 'use_stat_tables' or optimize_use_stat_tables. Igor> The default value for the variable now is 'never'. Igor> If we start the server with --stat-tables disabled the only meaningful Igor> value for the variable is 'never'. Other values are ignored Igor> Q1. Should we issue a warning with the attempt to reset this value Igor> when --stat-tables is disabled? Having one variable would solve the above question and make usage easier. Igor> Let me first tell you how the setting of this variable affects Igor> regular queries (SELECT, UPDATE etc). Igor> If the variable is set to: Igor> 1. case 'never': Igor> then no statistics from stat tables is used Igor> 2. case 'complementary' Igor> then the statistical value for a particular attribute is used Igor> from persistent tables ONLY when there is no corresponding Igor> statistics from other sources Igor> 3. case 'preferably' Igor> then the statistical value for a particular attribute is used Igor> from other sources ONLY when there is no corresponding Igor> statistics from persistent tables Igor> 4. case 'exclusively' Igor> the only statistics from persistent tables is used, if there is none Igor> on some attributes no statistics is used For what cases would we ever need 4) ? I think this is option is more dangerous than any of the other options because if you forget this one on, almost all queries will probably be table scans and most things will stop working. Having 1-3 is good enough for me. Igor> The variable setting is also affects how ANALYZE works. Igor> The current syntax of the ANALYZE command is Igor> ANALYZE TABLE t1, ...; Igor> An currently if @@optimizer_use_stat_tables is set to Igor> 1. 'never' Igor> then it works as before (e.g. in 5.5) Igor> 2. 'complementary' Igor> then it works as before + persistent statistics Igor> for table_stat and column_stat is collected Igor> 3. 'preferably' Igor> then it works as before + persistent statistics Igor> for table_stat, column_stat and index_stat is collected Igor> 4. 'exclusively' Igor> only persistent statistics for table_stat, column_stat Igor> and index_stat is collected I think 4) is not needed as we can do this if you use the proposed new analyze syntax. Igor> Q2. Is this semantics of ANALYZE is natural? Igor> Collection of persistent statistics with ANALYZE is now engine Igor> independent. So it's not a cheap and fast process. Igor> In many cases it is enough to collect/update statistics only some Igor> columns or even for some indexes. Igor> That's why I suggested to extend the syntax of the analyze. Igor> Let's choose something that is close to what SergeiG suggested. Igor> There are some questions here. Igor> Q3. If I need statistics for all columns and some indexes of table t1 Igor> and everything on table t2. Igor> can I use: Igor> ANALYZE TABLE t1 COLUMN[S] ALL INDEX[ES] (idx1,...), t2; ? For the first version, lets only do this extended syntax for a table at a time. It should preferably work like it works for CHECK TABLE / REPAIR. In other words: The options should be valid for all tables. (In this case the only thing that makes sences is to to either only do columns or indexes or both). Igor> Q4. Do we really need the keyword PERSISTENT here? Igor> If so what is its semantics? I don't mind the words persistent. I kind of like it as it's giving the user more information of what is going (ie, that the result will be stored once and for all). Igor> Q5. Isn't 'FOR' in the syntax of ANALYZE confusing in the sense that Igor> it commands to collect statistics for column_stat/index_stat, Igor> but not for table_stat? Maybe it's better to omit it? Comparing: ANALYZE TABLE t1 PERSISTENT COLUMNS (c,e,d) INDEXES (idx2, idx4); ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4); The later is more natural english, but I agree that it's not clear that the command also does table statistics. However as ANALYZE table does generally do table statistics, it's not totally wrong. I tried to think of a new syntax but didn't come up with anything better than: ANALYZE TABLE t1 PERSISTENT INCLUDING COLUMNS (c,e,d) INDEXES (idx2, idx4); but it's harder to remember than simple using FOR Igor> I do not plan to extend the syntax of ANALYZE to collect/update Igor> statistics only on some attributes of the columns (like collect Igor> statistics on min/max values, but not on the number of distinct values). Igor> Although I admit that collecting statistics on the number of distinct Igor> values is more expensive I don'tthink such an extension of ANALYZE is Igor> badly needed now. What statistics are you collecting for columns? I thought that the only usable value would be distinct values. Can we somehow cheaply get an approximation for distinct values? For example, assume we use the following algorithm: - Set up dynamic arrays to hold up 1000 distinct values per column - Start filling the column. - When one array fills up, remmeber the number of rows involved and reset the array. - The approximated number of distinct values could then be approximated as: total_fills * 1000 + current_number_in_array This would be close to correct for: - Data that is clustered - Columns with less than 1000 distinct values - Tables with very few similar values The worst case scenario is if the same values repeats uniquely every 1000 rows but that is not very likely. Even then, it would not bee too bad to get a larger value for distinct as there would probably be a disk io involved for every value we would need to read (as the values would be far appart on the disk and could not be read on the same fetch). Igor> I plan to add a possibility to collect statistics from random sample Igor> records. I plan to control the process of sampling only through system Igor> variables. ok. Igor> How does all above comply with persistent statistics of InnoDB from Igor> mysql-5.6? Almost perfectly. They store their persistent statistics Igor> in mysql.innodb_table_stat and mysql.innodb_index_stat. So we do not Igor> conflict here. The collection and the usage of their persistent Igor> statistics is going through their own system variables. Again - no Igor> conflicts. Igor> I will appreciate any comments on this post and, of course, Igor> I would like to have some answers for the above questions from Igor> SergeiG. Hope the above answers most/all of your questions? Don't worry about the syntax. We can always fix that later when we got some feedback from users! Regards, Monty
Hi, Igor! Here're my replies. I've seen Monty's replies, and in many (not all) cases my reply will be similar On May 19, Igor Babaev wrote:
On 05/19/2012 10:51 AM, Sergei Golubchik wrote:
ANALYZE TABLE t1 PERSISTENT FOR COLUMNS (c,e,d) INDEXES (idx2, idx4);
The global server option --stat-tables says now whether the statistics tables have to be taken into account (without this option the server ignores them altogether)
There is a system variable (global/session) that controls how the statistics table are used by queries/statements. The variable is called optimizer_use_stat_tables.
My first thought was - why do you need two variables, instead of one? I see that Monty wondered about it too.
There are 4 possible values for this variable: - 'never' - 'complementary' - 'preferably' - 'exclusively'.
Unlike Monty, I have no opinion about 'exclusively' mode. I'm fine with it being present. I'm fine with it being absent too.
Q1. Should we issue a warning with the attempt to reset this value when --stat-tables is disabled?
Agree with Monty. Having one variable solves this question.
The variable setting is also affects how ANALYZE works. The current syntax of the ANALYZE command is
ANALYZE TABLE t1, ...;
An currently if @@optimizer_use_stat_tables is set to 1. 'never' then it works as before (e.g. in 5.5) 2. 'complementary' then it works as before + persistent statistics for table_stat and column_stat is collected 3. 'preferably' then it works as before + persistent statistics for table_stat, column_stat and index_stat is collected 4. 'exclusively' only persistent statistics for table_stat, column_stat and index_stat is collected
Q2. Is this semantics of ANALYZE is natural?
I don't think I like it. I'd prefer @@optimizer_use_stat_tables to affect *only* the optimizer. And ANALYZE is only affected by various optional ANALYZE clauses, and not by @@optimizer_use_stat_tables.
Q3. If I need statistics for all columns and some indexes of table t1 and everything on table t2. can I use: ANALYZE TABLE t1 COLUMN[S] ALL INDEX[ES] (idx1,...), t2; ?
Note that I intentionally avoided commas in the new syntax. It allows to write ANALYZE TABLE t1 PERSISTENT FOR COLUMNS ALL INDEXES (idx1,...), t2 PERSISTENT FOR COLUMN (a) INDEXES (i1, i2), t3, t4 PERSISTENT; That is PERSISTENT FOR COLUMNS ... INDEXES ... clauses apply to the table.
Q4. Do we really need the keyword PERSISTENT here? If so what is its semantics?
The idea is to distinguish between ANALYZE TABLE t1; and ANALYZE TABLE t1 PERSISTENT; the first collects no persistent statistics, the second - all of it. But also it's generally in the spirit of SQL standard. The standard syntax is very verbose and has quite a few clauses that are redundant, but make SQL more readable.
Q5. Isn't 'FOR' in the syntax of ANALYZE confusing in the sense that it commands to collect statistics for column_stat/index_stat, but not for table_stat? Maybe it's better to omit it?
I agree in a sense that with this syntax there's no way *not* to collect table statistics. Is it a problem? I think not. I mean - when you collect any column/index statistics, if you collect table statistics at the same time it will add no overhead, right? So generally users will never want to avoid collecting it. And in rare cases when one wants to play with query plans, table statistics can always be deleted from the table_stat. I agree that FOR is completely redundant. It's like "INTO" in INSERT INTO table ... I'm fine if you make FOR optional (just like INTO is).
I plan to add a possibility to collect statistics from random sample records. I plan to control the process of sampling only through system variables.
Uhm. Arguable. Perhaps it's ok, or perhaps extending ANALYZE syntax would be better. I cannot comment until I see what exactly you want to control. Hope that helps. Regards, Sergei
participants (4)
-
Igor Babaev
-
Michael Widenius
-
Peter Laursen
-
Sergei Golubchik