----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: index_merge: non-ROR intersection CREATION DATE..: Thu, 21 May 2009, 21:32 SUPERVISOR.....: Knielsen IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-RawIdeaBin TASK ID........: 21 (http://askmonty.org/worklog/?tid=21) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=- High Level Description modified. --- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300 +++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300 @@ -3,5 +3,15 @@ constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2 AND ... ) and the equalities should cover all index components. +For example, assuming that key1 has 2 parts and key2 has 1 part. + +The current optimization works with: + +WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3 + +but not with: + +WHERE key1_part1=1 OR key2_part1=3 + This WL entry is to lift this limitation by developing algorithms that do -intersection on non-ROR scans. +intersection on non-ROR (rowid ordered retrieval) scans. -=-=(Guest - Tue, 26 May 2009, 14:04)=-=- High-Level Specification modified. --- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300 +++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300 @@ -1,4 +1,3 @@ - <contents> 1. Execution 1.1 Temptable @@ -30,6 +29,8 @@ 1.1 Temptable ------------- +[ This is our strategy of choice at the moment] + Use a temporary heap-grow-out-to-myisam table with a primary key: create table temp_table ( @@ -168,3 +169,8 @@ a subset of columns covered by all other indexes. = (TODO any other rules?) +- Correlation across selectivities. If there is a condition + + "cond(key1) AND cond(key2) AND ... AND cond(keyN)", + + can we consider satisfaction of AND-parts to be independent? -=-=(Psergey - Thu, 21 May 2009, 21:33)=-=- High-Level Specification modified. --- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300 +++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300 @@ -1 +1,170 @@ +<contents> +1. Execution +1.1 Temptable +1.1.1 Improvement +1.2 Produce/merge sorted streams +1.3 Extend Unique class to handle intersection +1.4 Strategies that do not seem to be useful +1.4.1 Remove matches after having produced an ordered stream +1.4.2 Sparse rowid bitmaps +2. Optimization + +</contents> + +1. Execution +============ + +The primary task is to find means to compute an intersection of N unordered +streams. Besides general memory/cpu cost of computation, we consider: + +- whether the produced rowid stream is ordered. If it is, it can be piped + into index_merge/intersect (as opposed to sort-intersect) + +- whether the strategy can take advantage of the fact that some input streams + are already rowid-ordered + +- startup cost (cost of producing the first output record) + +We see the following possible strategies: + +1.1 Temptable +------------- +Use a temporary heap-grow-out-to-myisam table with a primary key: + +create table temp_table ( + rowid binary($rowid_size), + count n, + primary key(rowid); +); + +Then use this algorithm: + + i1= {index with the least E(#records)}; + + for each record R in range_scan(i1) + temp_table.insert(R.rowid, count=1); + + for each index idx except i1 + { + for each R record in scan(idx) // (INNER-LOOP) + { + if (temp_table has R) + temptable[R].count++; + } + } + + // The following loop can do ordered or unordered scan + // if we want it to be ordered scan, we probably better arrange so that + // 'count' column is part of the index. + for each record R in temp_table + { + if (R.count == number_of_streams) + emit(R.rowid); + } + +The algorithm has an option to emit an ordered rowid stream. + +In the above form, the cost to produce the first record is high. It's easy to +adjust the algorithm to make it low - we'll need to just start scanning all +indexes at once, and finish as soon as we got a full match, i.e. the + + temptable[R].count++ + +operation resulted in the counter being equal to the number of merged scans. + +1.1.1 Improvement +~~~~~~~~~~~~~~~~~ +When running INNER-LOOP, we could count how many times we've done the +"count++" operation. If it has been done #records-in-temptable times, that +means that all further records will not have matches and we can finish the +scan, i.e. break out of the INNER-LOOP. + +1.2 Produce/merge sorted streams +-------------------------------- +For each of the merged scan, use filesort-like action to end up with an +ordered stream of rowids. Then merge the ordered streams. + +By filesort-like action we mean + - Run over index, collect rowids in a buffer. + - When the buffer is full, sort it and dump into a temporary file. +After the above we'll end up with a number of sorted buffers on disk. We can +use mergebuff() function (it is part of filesort's functions) to produce one +ordered sequence (i.e. array, which may be partially on disk) of rowids. + +Merging of ordered streams with help of priority queue is already implemented +in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the + + child_quick->get_next() + +call with a call to read rowid from an ordered sequence. + +1.3 Extend Unique class to handle intersection +---------------------------------------------- +There is no point to use Unique object as a device that accumulates rowids of +a single scan then produces them in sorted order. One could do the same faster +with accumulating an array of rowids and then sorting it. + +It's possible to use Unique object to collect/merge data from all scans though. +The idea is as follows: + +- Unique should store <rowid, n_scans> pairs +- Duplicates are pairs with the same rowid +- Unique should try to avoid creating duplicates: + - don't add a duplicate into the in-memory part, instead combine two elements + together by adding their n_scans elements. + - combine duplicates when it sees them in Unique.get() call +- The data we get from Unique.get() should be filtered, all records that have + n_scans != number_of_scans_being_merged should be discarded. + +If we're lucky to have started and finished a scan on some index (denote it +as S) without flushing the Unique in the process, then: +- there is no point in adding any new records into the Unique because their + absence in the Unique means that they don't have match in S and hence will + not get into the result of intersection. +- we need to only update the counters to be able to tell if the elements that + are already in the Unique will have matches in all scans. + +1.4 Strategies that do not seem to be useful +-------------------------------------------- + +keeping them here so we don't consider them over and over + +1.4.1 Remove matches after having produced an ordered stream +------------------------------------------------------------ +We can dump everything into a rowid stream and get it sorted. Then we read it, +and if we see a rowid repeated $n_merged_scans times, it belongs to the +intersection (pass to output), otherwise it doesn't (skip). +This doesn't have any advantages over the produce/merge sorted streams +approach. + +1.4.2 Sparse rowid bitmaps +-------------------------- +Use Falcon-style rowid bitmaps. The problem with that is that Falcon's +bitmaps assume there will always be enough memory to accommodate them. + +PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering +disk pages, not ids of individual records. It's hard for us to do something +similar because our rowids are opaque entities whose meaning depends on the +storage engines. + +This seems to require too much change to be worth it. + +2. Optimization +=============== + +SEL_TREE objects already represent intersections. The problems with +optimizations are: + +- Cost formula(s) +- When N keys/conditions are present: + + "cond(key1) AND cond(key2) AND ... AND cond(keyN)", + + somehow avoid considering (2^n - n) possible options. + +- Avoid producing (or even considering) apparently suboptimal plans: + = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are + a subset of columns covered by all other indexes. + = (TODO any other rules?) + DESCRIPTION: At the moment index_merge supports intersection only for rowid-ordered streams. This translates into a limitation that index_merge/intersect can only be constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2 AND ... ) and the equalities should cover all index components. For example, assuming that key1 has 2 parts and key2 has 1 part. The current optimization works with: WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3 but not with: WHERE key1_part1=1 OR key2_part1=3 This WL entry is to lift this limitation by developing algorithms that do intersection on non-ROR (rowid ordered retrieval) scans. HIGH-LEVEL SPECIFICATION: <contents> 1. Execution 1.1 Temptable 1.1.1 Improvement 1.2 Produce/merge sorted streams 1.3 Extend Unique class to handle intersection 1.4 Strategies that do not seem to be useful 1.4.1 Remove matches after having produced an ordered stream 1.4.2 Sparse rowid bitmaps 2. Optimization </contents> 1. Execution ============ The primary task is to find means to compute an intersection of N unordered streams. Besides general memory/cpu cost of computation, we consider: - whether the produced rowid stream is ordered. If it is, it can be piped into index_merge/intersect (as opposed to sort-intersect) - whether the strategy can take advantage of the fact that some input streams are already rowid-ordered - startup cost (cost of producing the first output record) We see the following possible strategies: 1.1 Temptable ------------- [ This is our strategy of choice at the moment] Use a temporary heap-grow-out-to-myisam table with a primary key: create table temp_table ( rowid binary($rowid_size), count n, primary key(rowid); ); Then use this algorithm: i1= {index with the least E(#records)}; for each record R in range_scan(i1) temp_table.insert(R.rowid, count=1); for each index idx except i1 { for each R record in scan(idx) // (INNER-LOOP) { if (temp_table has R) temptable[R].count++; } } // The following loop can do ordered or unordered scan // if we want it to be ordered scan, we probably better arrange so that // 'count' column is part of the index. for each record R in temp_table { if (R.count == number_of_streams) emit(R.rowid); } The algorithm has an option to emit an ordered rowid stream. In the above form, the cost to produce the first record is high. It's easy to adjust the algorithm to make it low - we'll need to just start scanning all indexes at once, and finish as soon as we got a full match, i.e. the temptable[R].count++ operation resulted in the counter being equal to the number of merged scans. 1.1.1 Improvement ~~~~~~~~~~~~~~~~~ When running INNER-LOOP, we could count how many times we've done the "count++" operation. If it has been done #records-in-temptable times, that means that all further records will not have matches and we can finish the scan, i.e. break out of the INNER-LOOP. 1.2 Produce/merge sorted streams -------------------------------- For each of the merged scan, use filesort-like action to end up with an ordered stream of rowids. Then merge the ordered streams. By filesort-like action we mean - Run over index, collect rowids in a buffer. - When the buffer is full, sort it and dump into a temporary file. After the above we'll end up with a number of sorted buffers on disk. We can use mergebuff() function (it is part of filesort's functions) to produce one ordered sequence (i.e. array, which may be partially on disk) of rowids. Merging of ordered streams with help of priority queue is already implemented in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the child_quick->get_next() call with a call to read rowid from an ordered sequence. 1.3 Extend Unique class to handle intersection ---------------------------------------------- There is no point to use Unique object as a device that accumulates rowids of a single scan then produces them in sorted order. One could do the same faster with accumulating an array of rowids and then sorting it. It's possible to use Unique object to collect/merge data from all scans though. The idea is as follows: - Unique should store <rowid, n_scans> pairs - Duplicates are pairs with the same rowid - Unique should try to avoid creating duplicates: - don't add a duplicate into the in-memory part, instead combine two elements together by adding their n_scans elements. - combine duplicates when it sees them in Unique.get() call - The data we get from Unique.get() should be filtered, all records that have n_scans != number_of_scans_being_merged should be discarded. If we're lucky to have started and finished a scan on some index (denote it as S) without flushing the Unique in the process, then: - there is no point in adding any new records into the Unique because their absence in the Unique means that they don't have match in S and hence will not get into the result of intersection. - we need to only update the counters to be able to tell if the elements that are already in the Unique will have matches in all scans. 1.4 Strategies that do not seem to be useful -------------------------------------------- keeping them here so we don't consider them over and over 1.4.1 Remove matches after having produced an ordered stream ------------------------------------------------------------ We can dump everything into a rowid stream and get it sorted. Then we read it, and if we see a rowid repeated $n_merged_scans times, it belongs to the intersection (pass to output), otherwise it doesn't (skip). This doesn't have any advantages over the produce/merge sorted streams approach. 1.4.2 Sparse rowid bitmaps -------------------------- Use Falcon-style rowid bitmaps. The problem with that is that Falcon's bitmaps assume there will always be enough memory to accommodate them. PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering disk pages, not ids of individual records. It's hard for us to do something similar because our rowids are opaque entities whose meaning depends on the storage engines. This seems to require too much change to be worth it. 2. Optimization =============== SEL_TREE objects already represent intersections. The problems with optimizations are: - Cost formula(s) - When N keys/conditions are present: "cond(key1) AND cond(key2) AND ... AND cond(keyN)", somehow avoid considering (2^n - n) possible options. - Avoid producing (or even considering) apparently suboptimal plans: = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are a subset of columns covered by all other indexes. = (TODO any other rules?) - Correlation across selectivities. If there is a condition "cond(key1) AND cond(key2) AND ... AND cond(keyN)", can we consider satisfaction of AND-parts to be independent? ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)