[Maria-developers] Accessing read records during join for condition pushdown
Hello list (first email here, hopefully sending it to the right address), I have been trying to solve this problem for a while now, and I am stuck. I wonder if what I am trying to do is even possible. I am implementing a new storage engine which handles engine condition pushdown. Now, when MariaDB is working on a JOIN using more than one table (for simplicity, let's assume 2), how can I access the read records for table 1 when working on a condition pushdown for table 2? When joining like this: table1.field1 = table2.field1, I can see 2 Item objects of type FIELD in the condition. However, I can't find where the read records for table1 are so I can effectively filter records in table2 by those values. Thanks a lot for your help, [A close up of a sign Description automatically generated] Eduardo Berrocal García de Carellán Senior Software Engineer Intel Corporation | intel.com<http://intel.com/>
My own digging here and there reveals me that the data I am looking for resides inside a JOIN_CACHE object inside a JOIN_TAB. However, it seems that the data can't be accessed unless some fields and methods are moved from protected to public in the class. Am I right here? Is there other (easier) way to accomplish what I am trying to do? (to recap: access "on the fly" records during a JOIN so that I can use that data for condition pushdown). Thanks, From: Maria-developers <maria-developers-bounces+eduardo.berrocal=intel.com@lists.launchpad.net> On Behalf Of Berrocal, Eduardo Sent: Tuesday, July 6, 2021 4:13 PM To: maria-developers@lists.launchpad.net Subject: [Maria-developers] Accessing read records during join for condition pushdown Hello list (first email here, hopefully sending it to the right address), I have been trying to solve this problem for a while now, and I am stuck. I wonder if what I am trying to do is even possible. I am implementing a new storage engine which handles engine condition pushdown. Now, when MariaDB is working on a JOIN using more than one table (for simplicity, let's assume 2), how can I access the read records for table 1 when working on a condition pushdown for table 2? When joining like this: table1.field1 = table2.field1, I can see 2 Item objects of type FIELD in the condition. However, I can't find where the read records for table1 are so I can effectively filter records in table2 by those values. Thanks a lot for your help, [A close up of a sign Description automatically generated] Eduardo Berrocal García de Carellán Senior Software Engineer Intel Corporation | intel.com<http://intel.com/>
Hi, Berrocal,! On Jul 06, Berrocal, Eduardo wrote:
Hello list (first email here, hopefully sending it to the right address), I have been trying to solve this problem for a while now, and I am stuck. I wonder if what I am trying to do is even possible.
I am implementing a new storage engine which handles engine condition pushdown. Now, when MariaDB is working on a JOIN using more than one table (for simplicity, let's assume 2), how can I access the read records for table 1 when working on a condition pushdown for table 2? When joining like this: table1.field1 = table2.field1, I can see 2 Item objects of type FIELD in the condition. However, I can't find where the read records for table1 are so I can effectively filter records in table2 by those values.
if f1 is Item_field for table1.field1, then generally f1->field->table->record[0] is the current record for the table1, and f1->field->ptr is the pointer to the field1 value in that currect record
Thanks a lot for your help,
Eduardo Berrocal García de Carellán Senior Software Engineer Intel Corporation | intel.com<http://intel.com/>
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hello Sergei, Thanks a lot for your response. I have tried to use that route but without success. The problem is that I can only access the last record read from table1. Let say that, during the JOIN, 3 rows are read from table1. These 3 rows will be compared against all rows from table2. I want to avoid a full scan of table2 by passing the value for the field from those 3 rows down to the storage engine. When condition pushdown is called on table2, and I try to access those 3 rows, I can only access the last one read through f1->field->ptr (or f1->table->record[0]). For what I understand, f1->table does not store an array of read records during JOIN. Am I right? That is stored in the JOIN_CACHE object. Please correct me if I am wrong here. Thanks again for your help, Eduardo Berrocal García de Carellán Senior Software Engineer Intel Corporation | intel.com -----Original Message----- From: Sergei Golubchik <serg@mariadb.org> Sent: Thursday, July 8, 2021 12:43 AM To: Berrocal, Eduardo <eduardo.berrocal@intel.com> Cc: maria-developers@lists.launchpad.net Subject: Re: [Maria-developers] Accessing read records during join for condition pushdown Hi, Berrocal,! On Jul 06, Berrocal, Eduardo wrote:
Hello list (first email here, hopefully sending it to the right address), I have been trying to solve this problem for a while now, and I am stuck. I wonder if what I am trying to do is even possible.
I am implementing a new storage engine which handles engine condition pushdown. Now, when MariaDB is working on a JOIN using more than one table (for simplicity, let's assume 2), how can I access the read records for table 1 when working on a condition pushdown for table 2? When joining like this: table1.field1 = table2.field1, I can see 2 Item objects of type FIELD in the condition. However, I can't find where the read records for table1 are so I can effectively filter records in table2 by those values.
if f1 is Item_field for table1.field1, then generally f1->field->table->record[0] is the current record for the table1, and f1->field->ptr is the pointer to the field1 value in that currect record
Thanks a lot for your help,
Eduardo Berrocal García de Carellán Senior Software Engineer Intel Corporation | intel.com<http://intel.com/>
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Eduardo, On Thu, Jul 08, 2021 at 06:43:05PM +0000, Berrocal, Eduardo wrote:
I have tried to use that route but without success. The problem is that I can only access the last record read from table1.
Yes, this is expected.
Let say that, during the JOIN, 3 rows are read from table1. These 3 rows will be compared against all rows from table2. I want to avoid a full scan of table2 by passing the value for the field from those 3 rows down to the storage engine.
Looks reasonable.
When condition pushdown is called on table2, and I try to access those 3 rows, I can only access the last one read through f1->field->ptr (or f1->table->record[0]).
For what I understand, f1->table does not store an array of read records during JOIN. Am I right? That is stored in the JOIN_CACHE object. Please correct me if I am wrong here.
You're correct. Other storage engines have only requested pushdown of basic conditions, so the code at SQL layer side doesn't handle the case where the Storage Engine would want to check the conditions that refer to the contents of the join buffer (aka JOIN_CACHE structure). Note that join buffer can get complex for multi-table joins. Consider a query: select * from t1, t2, t3 where t1.col1=t3.col1 and t2.col2=t3.col2 And a query plan (one gets something like this by default): +------+-------------+-------+------+---------------+------+---------+------+------+--------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+------+---------------+------+---------+------+------+--------------------------------------------------------+ | 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 10 | | | 1 | SIMPLE | t2 | ALL | NULL | NULL | NULL | NULL | 20 | Using join buffer (flat, BNL join) | | 1 | SIMPLE | t3 | ALL | NULL | NULL | NULL | NULL | 40 | Using where; Using join buffer (incremental, BNL join) | +------+-------------+-------+------+---------------+------+---------+------+------+--------------------------------------------------------+ Suppose t3 is your storage engine table. you would want to push down the whole t1.col1=t3.col1 and t2.col2=t3.col2 When the join buffer is "incremental" (like in the above EXPLAIN), it will actually be two join buffers linked through pointers. Traversing these might be complicated. (whether join buffer is incremental is controlled by this setting: https://mariadb.com/kb/en/server-system-variables/#join_cache_level) A question: do you intend to push down and check conditions in arbitrary form? In case they are just equalities, it might make sense for storage engine to pretend having an index and then Batched Key Access optimization will be used, and the SQL layer will provide batches of keys to lookup. BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
Hi Sergei, Thanks for the info. Very useful. Regarding your question: "A question: do you intend to push down and check conditions in arbitrary form? In case they are just equalities, it might make sense for storage engine to pretend having an index and then Batched Key Access optimization will be used, and the SQL layer will provide batches of keys to lookup." The answer is yes. I would like to check more than just equalities, if possible. The Batched key access is something worth looking into, in case that that can be used as a way to speed up at least conditions with equalities. A follow up question for you: When you say that traversing the linked join buffers is difficult... how difficult exactly? Like a linked list, a tree... ? I still think it is worth exploring in our case given that the speedup can be huge if we can avoid a full scan of a table with TBs of data. Eduardo Berrocal García de Carellán Senior Software Engineer Intel Corporation | intel.com -----Original Message----- From: Sergey Petrunia <sergey@mariadb.com> Sent: Monday, July 12, 2021 2:49 AM To: Berrocal, Eduardo <eduardo.berrocal@intel.com> Cc: Sergei Golubchik <serg@mariadb.org>; maria-developers@lists.launchpad.net Subject: Re: [Maria-developers] Accessing read records during join for condition pushdown Hi Eduardo, On Thu, Jul 08, 2021 at 06:43:05PM +0000, Berrocal, Eduardo wrote:
I have tried to use that route but without success. The problem is that I can only access the last record read from table1.
Yes, this is expected.
Let say that, during the JOIN, 3 rows are read from table1. These 3 rows will be compared against all rows from table2. I want to avoid a full scan of table2 by passing the value for the field from those 3 rows down to the storage engine.
Looks reasonable.
When condition pushdown is called on table2, and I try to access those 3 rows, I can only access the last one read through f1->field->ptr (or f1->table->record[0]).
For what I understand, f1->table does not store an array of read records during JOIN. Am I right? That is stored in the JOIN_CACHE object. Please correct me if I am wrong here.
You're correct. Other storage engines have only requested pushdown of basic conditions, so the code at SQL layer side doesn't handle the case where the Storage Engine would want to check the conditions that refer to the contents of the join buffer (aka JOIN_CACHE structure). Note that join buffer can get complex for multi-table joins. Consider a query: select * from t1, t2, t3 where t1.col1=t3.col1 and t2.col2=t3.col2 And a query plan (one gets something like this by default): +------+-------------+-------+------+---------------+------+---------+------+------+--------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+------+---------------+------+---------+------+------+--------------------------------------------------------+ | 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 10 | | | 1 | SIMPLE | t2 | ALL | NULL | NULL | NULL | NULL | 20 | Using join buffer (flat, BNL join) | | 1 | SIMPLE | t3 | ALL | NULL | NULL | NULL | NULL | 40 | Using where; Using join buffer (incremental, BNL join) | +------+-------------+-------+------+---------------+------+---------+------+------+--------------------------------------------------------+ Suppose t3 is your storage engine table. you would want to push down the whole t1.col1=t3.col1 and t2.col2=t3.col2 When the join buffer is "incremental" (like in the above EXPLAIN), it will actually be two join buffers linked through pointers. Traversing these might be complicated. (whether join buffer is incremental is controlled by this setting: https://mariadb.com/kb/en/server-system-variables/#join_cache_level) A question: do you intend to push down and check conditions in arbitrary form? In case they are just equalities, it might make sense for storage engine to pretend having an index and then Batched Key Access optimization will be used, and the SQL layer will provide batches of keys to lookup. BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
Hi Eduardo, On Mon, Jul 12, 2021 at 11:30:35PM +0000, Berrocal, Eduardo wrote:
"A question: do you intend to push down and check conditions in arbitrary form? In case they are just equalities, it might make sense for storage engine to pretend having an index and then Batched Key Access optimization will be used, and the SQL layer will provide batches of keys to lookup."
The answer is yes. I would like to check more than just equalities, if possible. The Batched key access is something worth looking into, in case that that can be used as a way to speed up at least conditions with equalities.
A follow up question for you: When you say that traversing the linked join buffers is difficult... how difficult exactly? Like a linked list, a tree... ? I still think it is worth exploring in our case given that the speedup can be huge if we can avoid a full scan of a table with TBs of data.
Iterating the join buffer shouldn't be hard. It is just a memory area which is filled with variable-size records which have certain fields. Debugging a query that uses join buffers, one can see that the iteration over contents of the join buffer is done in JOIN_CACHE::join_matching_records() using these calls: prepare_look_for_matches(); while (get_next_candidate_for_match()) { read_next_candidate_for_match(); } The code in JOIN_CACHE::join_matching_records() is complex, because it does other things besides just iterating the cache: * Checking of "first match" flag for queries which only need one matching row from the second table. * Handling outer JOINs, where we apply the restriction from ON expression always, and apply the restriction from WHERE expression only after we've figured out if LEFT JOIN has a match. AFAIU, you should ignore these inside the storage engine (produce all matches, don't check the Item_func_trig_cond items). This will make your copy of join_matching_records simpler. BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
Hello Sergei, Thanks a lot for your response. The way you showed seems to be the way to go. There is just one problem: all those functions are protected inside the JOIN_CACHE (and JOIN_CACHE_*) Classes. Which seems to suggest that code changes to the MariaDB core code is a necessity in this case. Cheers, Eduardo Berrocal García de Carellán Senior Software Engineer Intel Corporation | intel.com -----Original Message----- From: Sergey Petrunia <sergey@mariadb.com> Sent: Wednesday, July 14, 2021 2:26 AM To: Berrocal, Eduardo <eduardo.berrocal@intel.com> Cc: Sergei Golubchik <serg@mariadb.org>; maria-developers@lists.launchpad.net Subject: Re: [Maria-developers] Accessing read records during join for condition pushdown Hi Eduardo, On Mon, Jul 12, 2021 at 11:30:35PM +0000, Berrocal, Eduardo wrote:
"A question: do you intend to push down and check conditions in arbitrary form? In case they are just equalities, it might make sense for storage engine to pretend having an index and then Batched Key Access optimization will be used, and the SQL layer will provide batches of keys to lookup."
The answer is yes. I would like to check more than just equalities, if possible. The Batched key access is something worth looking into, in case that that can be used as a way to speed up at least conditions with equalities.
A follow up question for you: When you say that traversing the linked join buffers is difficult... how difficult exactly? Like a linked list, a tree... ? I still think it is worth exploring in our case given that the speedup can be huge if we can avoid a full scan of a table with TBs of data.
Iterating the join buffer shouldn't be hard. It is just a memory area which is filled with variable-size records which have certain fields. Debugging a query that uses join buffers, one can see that the iteration over contents of the join buffer is done in JOIN_CACHE::join_matching_records() using these calls: prepare_look_for_matches(); while (get_next_candidate_for_match()) { read_next_candidate_for_match(); } The code in JOIN_CACHE::join_matching_records() is complex, because it does other things besides just iterating the cache: * Checking of "first match" flag for queries which only need one matching row from the second table. * Handling outer JOINs, where we apply the restriction from ON expression always, and apply the restriction from WHERE expression only after we've figured out if LEFT JOIN has a match. AFAIU, you should ignore these inside the storage engine (produce all matches, don't check the Item_func_trig_cond items). This will make your copy of join_matching_records simpler. BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
Hi Eduardo, On Thu, Jul 22, 2021 at 12:28:09AM +0000, Berrocal, Eduardo wrote:
Thanks a lot for your response. The way you showed seems to be the way to go. There is just one problem: all those functions are protected inside the JOIN_CACHE (and JOIN_CACHE_*) Classes. Which seems to suggest that code changes to the MariaDB core code is a necessity in this case.
Hi Eduardo, (Adding Igor Babaev, the author of this code, to CC:) == Short == * I think, making the iteration methods public is not a big deal. This would allow to iterate JOIN_CACHE_BNL. * It seems, one can't iterate JOIN_CACHE_BNLH as easily. == Long == === Making iteration methods public === In my opinion, we can accept a patch that makes JOIN_CACHE iteration methods public. This will allow one to iterate over records in JOIN_CACHE_BNL. (One can check the join cache type by calling JOIN_CACHE::get_join_alg() which is public, and check for linked join cache(s) by traversing JOIN_CACHE::prev_cache which is also public) === On iterating JOIN_CACHE_BNLH === JOIN_CACHE_BNLH is the "hashed" variant of the join buffer. The join buffer is not an array but rather a hash table with the key being the value of some column(s) in the join buffer. When one calls JOIN_CACHE_BNLH::prepare_look_for_matches(), that function calls get_matching_chain_by_join_key(), which has an "implicit" parameter - the value of the join column. Then, *next_candidate_for_match() functions iterate only over the rows with the matching hash value. This means the interface won't work if one just wants to examine the contents of the join buffer. Possible ways out: A. Ingore this. The default @@join_cache_level value is 2, looking it up here: https://mariadb.com/kb/en/server-system-variables/#join_cache_level one can see that this means hashed join buffers are not used by default. 2. Develop a method to iterate over all values in the join buffer. This should be doable.
Eduardo Berrocal García de Carellán Senior Software Engineer
Intel Corporation | intel.com
-----Original Message----- From: Sergey Petrunia <sergey@mariadb.com> Sent: Wednesday, July 14, 2021 2:26 AM To: Berrocal, Eduardo <eduardo.berrocal@intel.com> Cc: Sergei Golubchik <serg@mariadb.org>; maria-developers@lists.launchpad.net Subject: Re: [Maria-developers] Accessing read records during join for condition pushdown
Hi Eduardo,
On Mon, Jul 12, 2021 at 11:30:35PM +0000, Berrocal, Eduardo wrote:
"A question: do you intend to push down and check conditions in arbitrary form? In case they are just equalities, it might make sense for storage engine to pretend having an index and then Batched Key Access optimization will be used, and the SQL layer will provide batches of keys to lookup."
The answer is yes. I would like to check more than just equalities, if possible. The Batched key access is something worth looking into, in case that that can be used as a way to speed up at least conditions with equalities.
A follow up question for you: When you say that traversing the linked join buffers is difficult... how difficult exactly? Like a linked list, a tree... ? I still think it is worth exploring in our case given that the speedup can be huge if we can avoid a full scan of a table with TBs of data.
Iterating the join buffer shouldn't be hard. It is just a memory area which is filled with variable-size records which have certain fields.
Debugging a query that uses join buffers, one can see that the iteration over contents of the join buffer is done in JOIN_CACHE::join_matching_records() using these calls:
prepare_look_for_matches(); while (get_next_candidate_for_match()) { read_next_candidate_for_match(); }
The code in JOIN_CACHE::join_matching_records() is complex, because it does other things besides just iterating the cache:
* Checking of "first match" flag for queries which only need one matching row from the second table.
* Handling outer JOINs, where we apply the restriction from ON expression always, and apply the restriction from WHERE expression only after we've figured out if LEFT JOIN has a match.
AFAIU, you should ignore these inside the storage engine (produce all matches, don't check the Item_func_trig_cond items). This will make your copy of join_matching_records simpler.
BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
-- BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
Sergei, I created a pull request to add the iteration methods in JOIN_CACHE_BNL public. Here it is: https://github.com/MariaDB/server/pull/1878 Thanks, Eduardo Berrocal García de Carellán Senior Software Engineer Intel Corporation | intel.com -----Original Message----- From: Sergey Petrunia <sergey@mariadb.com> Sent: Thursday, July 22, 2021 3:28 AM To: Berrocal, Eduardo <eduardo.berrocal@intel.com> Cc: Sergei Golubchik <serg@mariadb.org>; maria-developers@lists.launchpad.net; Igor Babaev <igor@mariadb.com> Subject: Re: [Maria-developers] Accessing read records during join for condition pushdown Hi Eduardo, On Thu, Jul 22, 2021 at 12:28:09AM +0000, Berrocal, Eduardo wrote:
Thanks a lot for your response. The way you showed seems to be the way to go. There is just one problem: all those functions are protected inside the JOIN_CACHE (and JOIN_CACHE_*) Classes. Which seems to suggest that code changes to the MariaDB core code is a necessity in this case.
Hi Eduardo, (Adding Igor Babaev, the author of this code, to CC:) == Short == * I think, making the iteration methods public is not a big deal. This would allow to iterate JOIN_CACHE_BNL. * It seems, one can't iterate JOIN_CACHE_BNLH as easily. == Long == === Making iteration methods public === In my opinion, we can accept a patch that makes JOIN_CACHE iteration methods public. This will allow one to iterate over records in JOIN_CACHE_BNL. (One can check the join cache type by calling JOIN_CACHE::get_join_alg() which is public, and check for linked join cache(s) by traversing JOIN_CACHE::prev_cache which is also public) === On iterating JOIN_CACHE_BNLH === JOIN_CACHE_BNLH is the "hashed" variant of the join buffer. The join buffer is not an array but rather a hash table with the key being the value of some column(s) in the join buffer. When one calls JOIN_CACHE_BNLH::prepare_look_for_matches(), that function calls get_matching_chain_by_join_key(), which has an "implicit" parameter - the value of the join column. Then, *next_candidate_for_match() functions iterate only over the rows with the matching hash value. This means the interface won't work if one just wants to examine the contents of the join buffer. Possible ways out: A. Ingore this. The default @@join_cache_level value is 2, looking it up here: https://mariadb.com/kb/en/server-system-variables/#join_cache_level one can see that this means hashed join buffers are not used by default. 2. Develop a method to iterate over all values in the join buffer. This should be doable.
Eduardo Berrocal García de Carellán Senior Software Engineer
Intel Corporation | intel.com
-----Original Message----- From: Sergey Petrunia <sergey@mariadb.com> Sent: Wednesday, July 14, 2021 2:26 AM To: Berrocal, Eduardo <eduardo.berrocal@intel.com> Cc: Sergei Golubchik <serg@mariadb.org>; maria-developers@lists.launchpad.net Subject: Re: [Maria-developers] Accessing read records during join for condition pushdown
Hi Eduardo,
On Mon, Jul 12, 2021 at 11:30:35PM +0000, Berrocal, Eduardo wrote:
"A question: do you intend to push down and check conditions in arbitrary form? In case they are just equalities, it might make sense for storage engine to pretend having an index and then Batched Key Access optimization will be used, and the SQL layer will provide batches of keys to lookup."
The answer is yes. I would like to check more than just equalities, if possible. The Batched key access is something worth looking into, in case that that can be used as a way to speed up at least conditions with equalities.
A follow up question for you: When you say that traversing the linked join buffers is difficult... how difficult exactly? Like a linked list, a tree... ? I still think it is worth exploring in our case given that the speedup can be huge if we can avoid a full scan of a table with TBs of data.
Iterating the join buffer shouldn't be hard. It is just a memory area which is filled with variable-size records which have certain fields.
Debugging a query that uses join buffers, one can see that the iteration over contents of the join buffer is done in JOIN_CACHE::join_matching_records() using these calls:
prepare_look_for_matches(); while (get_next_candidate_for_match()) { read_next_candidate_for_match(); }
The code in JOIN_CACHE::join_matching_records() is complex, because it does other things besides just iterating the cache:
* Checking of "first match" flag for queries which only need one matching row from the second table.
* Handling outer JOINs, where we apply the restriction from ON expression always, and apply the restriction from WHERE expression only after we've figured out if LEFT JOIN has a match.
AFAIU, you should ignore these inside the storage engine (produce all matches, don't check the Item_func_trig_cond items). This will make your copy of join_matching_records simpler.
BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
-- BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
participants (3)
-
Berrocal, Eduardo
-
Sergei Golubchik
-
Sergey Petrunia