[Maria-developers] Performance problem: need index-based access when have only non-sargable conditions
Hi! I was asked about this performance problem: consider a query select * from tbl where tbl.key like '%foo%' Table records are big, much larger than the length of the key. LIKE condition is very selective, however its pattern starts with '%', so we can't construct range access (and they are really searching for matches in word suffixes/infixes, so fulltext index isn't an option). The desired execution strategy is: - Scan the index on tbl.key and collect index tuples that match the LIKE condition. - For those index tuples, retrieve full records. One can simulate it by rewriting the query as a self join: select * from tbl A, tbl B where B.primary_key = A.primary_key AND B.key LIKE '%foo%'; However, in this particular case they have a problem doing such rewrites because the queries are generated by some ORM tool. Proposed solution from our side ------------------------------- It is not difficult to add an execution strategy like this get_next_record() { read index tuple; if (condition satisified) { read full record } } (with optional rowid-ordered retrieval for full records). The difficult part is to find a way to decided whether this strategy should be used. We have no meaningful selectivity estimates for 'column LIKE '%foo%'. The only way we could achieve this is by having the user manually specify condition selectivities. I can see one way to achieve this without getting into painful syntax modifications: select * from tbl where selectivity(tbl.key like '%foo%', 0.05) here 'selectivity(X,y)' will evaluate to value of X, but will also inform the optimizer that P(X is true)=y. Any thoughts about this? BR Sergey -- Sergey Petrunia, Software Developer Monty Program AB, http://askmonty.org Blog: http://s.petrunia.net/blog
Hi, Sergey! On Oct 10, Sergey Petrunya wrote:
I was asked about this performance problem: consider a query
(so was I, so was Igor, and may be others :)
select * from tbl where tbl.key like '%foo%'
Table records are big, much larger than the length of the key. LIKE condition is very selective, however its pattern starts with '%', so we can't construct range access (and they are really searching for matches in word suffixes/infixes, so fulltext index isn't an option).
The desired execution strategy is:
- Scan the index on tbl.key and collect index tuples that match the LIKE condition. - For those index tuples, retrieve full records.
One can simulate it by rewriting the query as a self join:
select * from tbl A, tbl B where B.primary_key = A.primary_key AND B.key LIKE '%foo%';
However, in this particular case they have a problem doing such rewrites because the queries are generated by some ORM tool.
Proposed solution from our side ------------------------------- It is not difficult to add an execution strategy like this
get_next_record() { read index tuple; if (condition satisified) { read full record } } (with optional rowid-ordered retrieval for full records).
The difficult part is to find a way to decided whether this strategy should be used. We have no meaningful selectivity estimates for 'column LIKE '%foo%'.
The only way we could achieve this is by having the user manually specify condition selectivities. I can see one way to achieve this without getting into painful syntax modifications:
select * from tbl where selectivity(tbl.key like '%foo%', 0.05)
here 'selectivity(X,y)' will evaluate to value of X, but will also inform the optimizer that P(X is true)=y.
I don't like the manual selectivity() "function". Igor was mentioning random dives to estimate the selectivity. That could work. Another thought - for InnoDB (and other engines that use primary key as a rowid and convert table scan to an index scan) your execution strategy is practically never worse than reading full rows right away. That is for these engines we can always use your execution strategy without any selectivity estimates. Additionally, we could've make FORCE INDEX to force this strategy too - not nice, but at least it would avoid syntax extensions. Besides, think of it, it would be the "natural" behavior, as the user has already tried FORCE INDEX to achieve this effect, that is, FORCE INDEX is expected to work here. So, the minimal change could be - always use this strategy for InnoDB (and some other engines), never use it automatically for MyISAM (and some other engines), always use it if FORCE INDEX was specified. Regards, Sergei
participants (2)
-
Sergei Golubchik
-
Sergey Petrunya