[Maria-developers] Benchmarking index_merge sort_intersect
Hi! Please find below results of my attempts to figure out if/when one should "index_merge_sort_intersection=on" setting. General setup information: * Experiment was done on my laptop * I picked "on time flight statistics" dataset, because - it has lots of columns, so one can't be expected to have all possible composite indexes - it allows for meaningful intersection queries - the data is real and I don't expect it to be extremely correlated (although some aiports/airlines are probably worse than others, so it's not totally uncrellated, either). * Storage engine is XtraDB * I loaded data for Q1 2009, which is 1.5M rows, and ontime.ibd file is 704M: MariaDB [ontime]> select count(*), min(FlightDate), max(FlightDate) from ontime; +----------+-----------------+-----------------+ | count(*) | min(FlightDate) | max(FlightDate) | +----------+-----------------+-----------------+ | 1578171 | 2009-01-01 | 2009-03-31 | +----------+-----------------+-----------------+ 1 row in set (15.92 sec) MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=on'; MariaDB [ontime]> explain select count(*) from ontime where depdelay > 30 and OriginState IN ('WA', 'CA'); +----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+ | 1 | SIMPLE | ontime | range | OriginState,DepDelay | DepDelay,OriginState | 5,3 | NULL | 41832 | Using sort_intersect(DepDelay,OriginState); Using where | +----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+ 1 row in set (0.01 sec) ## (Run query several times until query time stabilizes) MariaDB [ontime]> select count(*) from ontime where depdelay > 30 and OriginState IN ('WA', 'CA'); +----------+ | count(*) | +----------+ | 18824 | +----------+ 1 row in set (2.37 sec) MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=off'; MariaDB [ontime]> explain select count(*) from ontime where depdelay > 30 and OriginState IN ('WA', 'CA'); +----+-------------+--------+-------+----------------------+----------+---------+------+--------+-----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+----------------------+----------+---------+------+--------+-----------------------------------------------+ | 1 | SIMPLE | ontime | range | OriginState,DepDelay | DepDelay | 5 | NULL | 201894 | Using index condition; Using where; Using MRR | +----+-------------+--------+-------+----------------------+----------+---------+------+--------+-----------------------------------------------+ 1 row in set (0.01 sec) ## (Run query several times until query time stabilizes) MariaDB [ontime]> select count(*) from ontime where depdelay > 30 and OriginState IN ('WA', 'CA'); +----------+ | count(*) | +----------+ | 18824 | +----------+ 1 row in set (18.27 sec) ## For comparison, without indexes whatsoever: MariaDB [ontime]> select count(*) from ontime ignore index (depdelay, originstate) where depdelay > 30 and OriginState IN ('WA', 'CA'); +----------+ | count(*) | +----------+ | 18824 | +----------+ 1 row in set (15.50 sec) "iostat -x" showed disk utilization as 0% during all of the above listed queries, so all accessed data was either in cache or in buffer pool. For an idea about predicate selectivities and correlation: ROWS FRACTION whole table 1578171 100.0% depdelay>30 151806 9.6% OriginState IN ('WA', 'CA') 206576 13.1% both 18824 1.1% Conclusions: - sort-intersect gives about 7x improvement - Igor's statement that sort-intersect can't be useful when the load is not IO-bound is not supported by the experiment. My next step was to try with narrower ranges. I changed the predicates to - OriginState IN ('IL', 'VW') (100K rows instead of 200K we had with WA+CA) - depdelay > 60 (75 K rows instead of 150K we had with depdelay>30) MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=on'; MariaDB [ontime]> explain select count(*) from ontime where depdelay > 60 and OriginState IN ('IL', 'VT'); +----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+ | 1 | SIMPLE | ontime | range | OriginState,DepDelay | DepDelay,OriginState | 5,3 | NULL | 10105 | Using sort_intersect(DepDelay,OriginState); Using where | +----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+ 1 row in set (0.02 sec) # Run a few times to stabilize MariaDB [ontime]> select count(*) from ontime where depdelay > 60 and OriginState IN ('IL', 'VT'); +----------+ | count(*) | +----------+ | 6283 | +----------+ 1 row in set (0.99 sec) MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=off'; Query OK, 0 rows affected (0.00 sec) MariaDB [ontime]> explain select count(*) from ontime where depdelay > 60 and OriginState IN ('IL', 'VT'); +----+-------------+--------+-------+----------------------+----------+---------+------+-------+-----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+----------------------+----------+---------+------+-------+-----------------------------------------------+ | 1 | SIMPLE | ontime | range | OriginState,DepDelay | DepDelay | 5 | NULL | 99636 | Using index condition; Using where; Using MRR | +----+-------------+--------+-------+----------------------+----------+---------+------+-------+-----------------------------------------------+ 1 row in set (0.02 sec) # Run a few times to stabilize MariaDB [ontime]> select count(*) from ontime where depdelay > 60 and OriginState IN ('IL', 'VT'); +----------+ | count(*) | +----------+ | 6283 | +----------+ 1 row in set (8.63 sec) MariaDB [ontime]> select count(*) from ontime ignore index (OriginState,DepDelay) where depdelay > 60 and OriginState IN ('IL', 'VT'); +----------+ | count(*) | +----------+ | 6283 | +----------+ 1 row in set (14.00 sec) The picture is similar: sort-intersect is still about 8 times better than single-index range access plan. I'm not sure about the conclusions. I suspect that there is some threshold (#records to be read through the most selective index) after which sort-intersect becomes worse than using one of the indexes. I wasn't able to hit it, though. Then I've tried changing the OriginState predicate to being simple equality, to see if we could compare againist ref access: MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=on'; MariaDB [ontime]> explain select count(*) from ontime where depdelay > 60 and OriginState='IL'; +----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+ | 1 | SIMPLE | ontime | range | OriginState,DepDelay | DepDelay,OriginState | 5,3 | NULL | 10043 | Using sort_intersect(DepDelay,OriginState); Using where | +----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+ MariaDB [ontime]> select count(*) from ontime where depdelay > 60 and OriginState='IL'; +----------+ | count(*) | +----------+ | 6215 | +----------+ 1 row in set (1.00 sec) MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=off'; MariaDB [ontime]> explain select count(*) from ontime where depdelay > 60 and OriginState='IL'; +----+-------------+--------+------+----------------------+-------------+---------+-------+--------+------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+------+----------------------+-------------+---------+-------+--------+------------------------------------+ | 1 | SIMPLE | ontime | ref | OriginState,DepDelay | OriginState | 3 | const | 201600 | Using index condition; Using where | +----+-------------+--------+------+----------------------+-------------+---------+-------+--------+------------------------------------+ 1 row in set (0.01 sec) MariaDB [ontime]> select count(*) from ontime where depdelay > 60 and OriginState='IL'; +----------+ | count(*) | +----------+ | 6215 | +----------+ 1 row in set (0.95 sec) And the benefit was gone. sort_intersect happens to be just as fast as ref access. I've changed OriginState='IL' to OriginState IN ('IL','ZZ'), where 'ZZ' is the non existant state code: MariaDB [ontime]> explain select count(*) from ontime where depdelay > 60 and OriginState IN ('IL','ZZ'); +----+-------------+--------+-------+----------------------+----------+---------+------+-------+-----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+----------------------+----------+---------+------+-------+-----------------------------------------------+ | 1 | SIMPLE | ontime | range | OriginState,DepDelay | DepDelay | 5 | NULL | 99636 | Using index condition; Using where; Using MRR | +----+-------------+--------+-------+----------------------+----------+---------+------+-------+-----------------------------------------------+ The used index has changed, MariaDB [ontime]> select count(*) from ontime where depdelay > 60 and OriginState IN ('IL','ZZ'); +----------+ | count(*) | +----------+ | 6215 | +----------+ 1 row in set (9.10 sec) And the query time went up greatly. Using FORCE INDEX to use the index that ref access was using improved the situation slightly: MariaDB [ontime]> explain select count(*) from ontime force index(originstate) where depdelay > 60 and OriginState IN ('IL','ZZ'); +----+-------------+--------+-------+---------------+-------------+---------+------+--------+-----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+---------------+-------------+---------+------+--------+-----------------------------------------------+ | 1 | SIMPLE | ontime | range | OriginState | OriginState | 3 | NULL | 201601 | Using index condition; Using where; Using MRR | +----+-------------+--------+-------+---------------+-------------+---------+------+--------+-----------------------------------------------+ MariaDB [ontime]> select count(*) from ontime force index(originstate) where depdelay > 60 and OriginState IN ('IL','ZZ'); +----------+ | count(*) | +----------+ | 6215 | +----------+ 1 row in set (2.69 sec) Disabling MRR further improved the situation: MariaDB [ontime]> set optimizer_use_mrr='disable'; Query OK, 0 rows affected (0.00 sec) MariaDB [ontime]> explain select count(*) from ontime force index(originstate) where depdelay > 60 and OriginState IN ('IL','ZZ'); +----+-------------+--------+-------+---------------+-------------+---------+------+--------+------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+---------------+-------------+---------+------+--------+------------------------------------+ | 1 | SIMPLE | ontime | range | OriginState | OriginState | 3 | NULL | 201601 | Using index condition; Using where | +----+-------------+--------+-------+---------------+-------------+---------+------+--------+------------------------------------+ 1 row in set (0.01 sec) MariaDB [ontime]> select count(*) from ontime force index(originstate) where depdelay > 60 and OriginState IN ('IL','ZZ'); +----------+ | count(*) | +----------+ | 6215 | +----------+ 1 row in set (1.23 sec) But it seems we can't put all the blame on MRR. The supposedly best query plan without MRR is just as bad as it was with MRR: MariaDB [ontime]> explain select count(*) from ontime where depdelay > 60 and OriginState IN ('IL','ZZ'); +----+-------------+--------+-------+----------------------+----------+---------+------+-------+------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+----------------------+----------+---------+------+-------+------------------------------------+ | 1 | SIMPLE | ontime | range | OriginState,DepDelay | DepDelay | 5 | NULL | 99636 | Using index condition; Using where | +----+-------------+--------+-------+----------------------+----------+---------+------+-------+------------------------------------+ 1 row in set (0.02 sec) MariaDB [ontime]> select count(*) from ontime where depdelay > 60 and OriginState IN ('IL','ZZ'); +----------+ | count(*) | +----------+ | 6215 | +----------+ 1 row in set (8.96 sec) In any case, sort_intersect doesn't care about whether one of the predicates is an equality or IN: MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=on'; Query OK, 0 rows affected (0.00 sec) MariaDB [ontime]> explain select count(*) from ontime where depdelay > 60 and OriginState IN ('IL','ZZ'); +----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+ | 1 | SIMPLE | ontime | range | OriginState,DepDelay | DepDelay,OriginState | 5,3 | NULL | 10043 | Using sort_intersect(DepDelay,OriginState); Using where | +----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+ 1 row in set (0.02 sec) MariaDB [ontime]> select count(*) from ontime where depdelay > 60 and OriginState IN ('IL','ZZ'); +----------+ | count(*) | +----------+ | 6215 | +----------+ 1 row in set (0.97 sec) Conclusions: I'm not sure if the "failure" to demonstrate speedup against ref access is a genuine symptom (I think - not). Perhaps, enabling all of our improvements at once (BKA, DS-MRR, sort-intersect) will produce more coherent results, because all of these strategies use rowid-ordered disk-sweep reads, which allows to make fairer comparison between them. BR Sergey -- Sergey Petrunia, Software Developer Monty Program AB, http://askmonty.org Blog: http://s.petrunia.net/blog
Per Philip's request, details to replicate the dataset: == Server settings == [mysqld] innodb_file_per_table=1 innodb_file_format=barracuda innodb_log_file_size=100M == DDL == I based on Percona's DDL http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-perform... and added a couple of indexes. CREATE TABLE `ontime` ( `Year` year(4) DEFAULT NULL, `Quarter` tinyint(4) DEFAULT NULL, `Month` tinyint(4) DEFAULT NULL, `DayofMonth` tinyint(4) DEFAULT NULL, `DayOfWeek` tinyint(4) DEFAULT NULL, `FlightDate` date DEFAULT NULL, `UniqueCarrier` char(7) DEFAULT NULL, `AirlineID` int(11) DEFAULT NULL, `Carrier` char(2) DEFAULT NULL, `TailNum` varchar(50) DEFAULT NULL, `FlightNum` varchar(10) DEFAULT NULL, `Origin` char(5) DEFAULT NULL, `OriginCityName` varchar(100) DEFAULT NULL, `OriginState` char(2) DEFAULT NULL, `OriginStateFips` varchar(10) DEFAULT NULL, `OriginStateName` varchar(100) DEFAULT NULL, `OriginWac` int(11) DEFAULT NULL, `Dest` char(5) DEFAULT NULL, `DestCityName` varchar(100) DEFAULT NULL, `DestState` char(2) DEFAULT NULL, `DestStateFips` varchar(10) DEFAULT NULL, `DestStateName` varchar(100) DEFAULT NULL, `DestWac` int(11) DEFAULT NULL, `CRSDepTime` int(11) DEFAULT NULL, `DepTime` int(11) DEFAULT NULL, `DepDelay` int(11) DEFAULT NULL, `DepDelayMinutes` int(11) DEFAULT NULL, `DepDel15` int(11) DEFAULT NULL, `DepartureDelayGroups` int(11) DEFAULT NULL, `DepTimeBlk` varchar(20) DEFAULT NULL, `TaxiOut` int(11) DEFAULT NULL, `WheelsOff` int(11) DEFAULT NULL, `WheelsOn` int(11) DEFAULT NULL, `TaxiIn` int(11) DEFAULT NULL, `CRSArrTime` int(11) DEFAULT NULL, `ArrTime` int(11) DEFAULT NULL, `ArrDelay` int(11) DEFAULT NULL, `ArrDelayMinutes` int(11) DEFAULT NULL, `ArrDel15` int(11) DEFAULT NULL, `ArrivalDelayGroups` int(11) DEFAULT NULL, `ArrTimeBlk` varchar(20) DEFAULT NULL, `Cancelled` tinyint(4) DEFAULT NULL, `CancellationCode` char(1) DEFAULT NULL, `Diverted` tinyint(4) DEFAULT NULL, `CRSElapsedTime` int(11) DEFAULT NULL, `ActualElapsedTime` int(11) DEFAULT NULL, `AirTime` int(11) DEFAULT NULL, `Flights` int(11) DEFAULT NULL, `Distance` int(11) DEFAULT NULL, `DistanceGroup` tinyint(4) DEFAULT NULL, `CarrierDelay` int(11) DEFAULT NULL, `WeatherDelay` int(11) DEFAULT NULL, `NASDelay` int(11) DEFAULT NULL, `SecurityDelay` int(11) DEFAULT NULL, `LateAircraftDelay` int(11) DEFAULT NULL, `FirstDepTime` varchar(10) DEFAULT NULL, `TotalAddGTime` varchar(10) DEFAULT NULL, `LongestAddGTime` varchar(10) DEFAULT NULL, `DivAirportLandings` varchar(10) DEFAULT NULL, `DivReachedDest` varchar(10) DEFAULT NULL, `DivActualElapsedTime` varchar(10) DEFAULT NULL, `DivArrDelay` varchar(10) DEFAULT NULL, `DivDistance` varchar(10) DEFAULT NULL, `Div1Airport` varchar(10) DEFAULT NULL, `Div1WheelsOn` varchar(10) DEFAULT NULL, `Div1TotalGTime` varchar(10) DEFAULT NULL, `Div1LongestGTime` varchar(10) DEFAULT NULL, `Div1WheelsOff` varchar(10) DEFAULT NULL, `Div1TailNum` varchar(10) DEFAULT NULL, `Div2Airport` varchar(10) DEFAULT NULL, `Div2WheelsOn` varchar(10) DEFAULT NULL, `Div2TotalGTime` varchar(10) DEFAULT NULL, `Div2LongestGTime` varchar(10) DEFAULT NULL, `Div2WheelsOff` varchar(10) DEFAULT NULL, `Div2TailNum` varchar(10) DEFAULT NULL, `Div3Airport` varchar(10) DEFAULT NULL, `Div3WheelsOn` varchar(10) DEFAULT NULL, `Div3TotalGTime` varchar(10) DEFAULT NULL, `Div3LongestGTime` varchar(10) DEFAULT NULL, `Div3WheelsOff` varchar(10) DEFAULT NULL, `Div3TailNum` varchar(10) DEFAULT NULL, `Div4Airport` varchar(10) DEFAULT NULL, `Div4WheelsOn` varchar(10) DEFAULT NULL, `Div4TotalGTime` varchar(10) DEFAULT NULL, `Div4LongestGTime` varchar(10) DEFAULT NULL, `Div4WheelsOff` varchar(10) DEFAULT NULL, `Div4TailNum` varchar(10) DEFAULT NULL, `Div5Airport` varchar(10) DEFAULT NULL, `Div5WheelsOn` varchar(10) DEFAULT NULL, `Div5TotalGTime` varchar(10) DEFAULT NULL, `Div5LongestGTime` varchar(10) DEFAULT NULL, `Div5WheelsOff` varchar(10) DEFAULT NULL, `Div5TailNum` varchar(10) DEFAULT NULL, KEY `AirlineID` (`AirlineID`), KEY `OriginState` (`OriginState`), KEY `Origin` (`Origin`), KEY `DepDelay` (`DepDelay`), KEY `DepDelayMinutes` (`DepDelayMinutes`), KEY `ArrDelay` (`ArrDelay`), KEY `ArrDelayMinutes` (`ArrDelayMinutes`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 == The dataset === You should get the same* by doing the following: for i in `seq 1 3` ; do wget http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_2009_$i.zi... ; done LOAD DATA INFILE 'On_Time_On_Time_Performance_2009_1.csv' INTO TABLE ontime FIELDS TERMINATED BY ',' ENCLOSED BY '\"' IGNORE 1 LINES; LOAD DATA INFILE 'On_Time_On_Time_Performance_2009_2.csv' INTO TABLE ontime FIELDS TERMINATED BY ',' ENCLOSED BY '\"' IGNORE 1 LINES; LOAD DATA INFILE 'On_Time_On_Time_Performance_2009_3.csv' INTO TABLE ontime FIELDS TERMINATED BY ',' ENCLOSED BY '\"' IGNORE 1 LINES; There will be warnings during LOAD DATA statements, which I have ignored (because they seemed to relate to missing data for columns that I didn't care about). (*) - I loaded data before creating indexes. BR Sergey -- Sergey Petrunia, Software Developer Monty Program AB, http://askmonty.org Blog: http://s.petrunia.net/blog
Hello Philip, One more note about correlated results: I agree that it's true that the optimizer can't make a good choice if the data is correlated. But since this is intersection, we could take this from other end: when doing intersection scan, calculate utility of using extra indexes on the fly. If we scan some amount of rows and see that using extra indexes doesn't provide any benefit, we could fall back to "unary intersection" which would be very similar to what range+DS-MRR do. The question is, is the issue of slowdowns because of correlations is such a big problem that we need to drop other things and start working on a solution like the above... BR Sergey -- Sergey Petrunia, Software Developer Monty Program AB, http://askmonty.org Blog: http://s.petrunia.net/blog
participants (1)
-
Sergey Petrunya