[Maria-developers] Cost analysis: optimizer statistics vs real dataset properties
So, we have done: alter table name add index(n_name); and now we have: explain extended select sql_calc_found_rows s_name, s_address from nation straight_join supplier where s_suppkey in (select ps_suppkey from partsupp where ps_partkey in (select p_partkey from part where p_name like 'forest%') and ps_availqty > (select 0.5 * sum(l_quantity) from lineitem where l_partkey = ps_partkey and l_suppkey = ps_suppkey and l_shipdate >= date('1994-01-01') and l_shipdate < date('1994-01-01') + interval '1' year )) and s_nationkey = n_nationkey and n_name = 'canada' order by s_name limit 10; +--+------------------+-----------+------++-------------------+-------+---------------------------------------+----+---------------------------------------------------------+ |id|select_type |table |type ||key |key_len|ref |rows|Extra | +--+------------------+-----------+------++-------------------+-------+---------------------------------------+----+---------------------------------------------------------+ | 1|PRIMARY |nation |ref ||n_name |26 |const | 1|Using where; Using index; Using temporary; Using filesort| | 1|PRIMARY |supplier |ref ||i_s_nationkey |5 |nation.n_nationkey | 251| | | 1|PRIMARY |<subquery2>|eq_ref||distinct_key |4 |func | 1| | | 2|MATERIALIZED |part |range ||p_name |58 |NULL |2387|Using where; Using index | | 2|MATERIALIZED |partsupp |ref ||PRIMARY |4 |part.p_partkey | 1|Using where | | 4|DEPENDENT SUBQUERY|lineitem |ref ||i_l_suppkey_partkey|10 |partsupp.ps_partkey,partsupp.ps_suppkey| 3|Using where | +--+------------------+-----------+------++-------------------+-------+---------------------------------------+----+---------------------------------------------------------+ === nation: === 1 row scanned, for real OK. - "using where" doesn't filter anything out. === supplier === - dataset has 412 rows for CANADA - rec_per_key: 10L suppliers / 25 countries=400 === <subquery2> == - eq_ref, we know that there's always match - "using where" won't filter anything out. === part === - dataset/where match 2378 rows for (p_name like 'forest%') OK. - "using where" won't filter anything out. === partsupp === - rec_per_key: partsupp has 800K rows and 200K distinct ps_partkey values, which gives rec_per_key=4 (while EXPLAIN shows 1!) - select count(*) from partsupp where ps_partkey in (select p_partkey from part where p_name like 'forest%'); gives 9552 rows. (rec_per_key=4) * 2378 = 9512 , close to 9552. - "using where" won't filter anything out (NOT TAKING scalar-subquery into account) +--+-------------+--------+------++-------------------+-------+---------------------------------------+----+--------+---------------------------------------------------------+ |id|select_type |table |type ||key |key_len|ref |rows|filtered|Extra | +--+-------------+--------+------++-------------------+-------+---------------------------------------+----+--------+---------------------------------------------------------+ | 1|PRIMARY |nation |ref ||n_name |26 |const | 1| 100.00|Using where; Using index; Using temporary; Using filesort| | 1|PRIMARY |supplier|ref ||i_s_nationkey |5 |nation.n_nationkey | 251| 100.00| | | 1|PRIMARY |partsupp|ref ||i_ps_suppkey |4 |supplier.s_suppkey | 34| 100.00|Using where | | 1|PRIMARY |part |eq_ref||PRIMARY |4 |partsupp.ps_partkey | 1| 100.00|Using where; FirstMatch(supplier) | | 4|DEP. SUBQUERY|lineitem|ref ||i_l_suppkey_partkey|10 |partsupp.ps_partkey,partsupp.ps_suppkey| 3| 100.00|Using where | +--+-------------+--------+------++-------------------+-------+---------------------------------------+----+--------+---------------------------------------------------------+ 5 rows in set, 3 warnings (0.01 sec) === nation: === like above, 1 row. OK. - "using where" won't filter anything out. === supplier: === see above, 251 rows, ~ok 412 real... === partsupp: === - 800K rows and 10K distinct ps_suppkey, which gives rec_per_key=80 (EXPLAIN shows 34) - as for data that we will hit: select count(*) from nation, supplier, partsupp where s_suppkey=ps_suppkey and s_nationkey=n_nationkey and n_name='canada'; gives 32690 rows. 32690 / 412 = 80, matches rec_per_key. - "using where" won't filter anything out (NOT TAKING scalar-subquery into account) === part: === - eq_ref, so exactly 1 match. We know DBT-3 dataset is such that it always has one. - "using where" is "p_name like forest"! it will filter stuff out! ^^^^^^^^ but does it matter.. subquery is attached to ps? ********* Some benchmarking ************* Execution times: - FirstMatch is ~15 sec - Materialization: ~5 sec. With scalar-subquery predicate replaced with TRUE: - Materialization: ~0.20 sec. - FirstMatch: 0.19 sec. Number of times the scalar subquery was executed: - DEP. SUBQUERY: ... 24055 times - MATERIALIZATION: 9555 times If we take the cost of scalar-subquery predicate evaluation into account, the difference should be around: (24055 - 9555) * 5 = 72K which is sufficient. The problem is, Materialization executes the scalar-subquery fewer times than FirstMatch. And you expect that taking into account scalar-subquery execution cost will cause Materialization not to be used?? BR Sergei -- Sergei Petrunia, Software Developer Monty Program AB, http://askmonty.org Blog: http://s.petrunia.net/blog
participants (1)
-
Sergei Petrunia