[Maria-developers] Extending storage engine API for random-row extraction for histogram collection (and others)
Hi! Here is my proposal on extending the storage engine API to provide a functionality for retrieving random rows from tables (those that have indexes). The storage engines for which I plan to implement this are: MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB. --- a/sql/handler.h +++ b/sql/handler.h @@ -2927,7 +2927,7 @@ class handler :public Sql_alloc /** Length of ref (1-8 or the clustered key length) */ uint ref_length; FT_INFO *ft_handler; - enum init_stat { NONE=0, INDEX, RND }; + enum init_stat { NONE=0, INDEX, RND, RANDOM }; init_stat inited, pre_inited; ........ + virtual int ha_random_sample_init() __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample_init"); + inited= RANDOM; + DBUG_RETURN(random_sample_init()); + } + virtual int ha_random_sample(uint inx, + key_range *min_key, + key_range *max_key) + __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample"); + DBUG_ASSERT(inited == RANDOM); + DBUG_RETURN(random_sample(inx, min_key, max_key)); + } + virtual int ha_random_sample_end() __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample_end"); + inited= NONE; + DBUG_RETURN(random_sample_end()); + } + This is the default implementation for a storage engine which does not support it: + virtual int random_sample_init() { return 0; } ; + virtual int random_sample(uint idx, key_range *min_key, key_range *max_key) + { + return HA_ERR_WRONG_COMMAND; + } + virtual int random_sample_end() { return 0; }; Alternative ideas: random_sample_init() takes the idx as a parameter and random_sample just fetches a row from the range using the index previously specified. The range can be left unspecified with nulls to provide a fetch from the full table range. I don't know enough about storage engine internals to know if an index declaration within the init function instead of within the "sample" function is better. Maybe I am complicating it too much and a simple random_sample() function is sufficient, kind of how ha_records_in_range does it. Thoughts? Vicențiu
On Tue, Dec 11, 2018 at 12:34:12AM +0200, Vicențiu Ciorbaru wrote:
Hi!
Here is my proposal on extending the storage engine API to provide a functionality for retrieving random rows from tables (those that have indexes). The storage engines for which I plan to implement this are: MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.
Observations: - as far as I understand, random skip scan is not possible with this API? (which is probably fine as we expect that sampling will only retrieve a small fraction of table rows, which means the difference between a forward-walking skip scan and genuinely random probing is negligible). - Can the scan return the same row twice? - Do we want/need a concept of random "seed" which will cause the same rows to be returned on the same table?
--- a/sql/handler.h +++ b/sql/handler.h @@ -2927,7 +2927,7 @@ class handler :public Sql_alloc /** Length of ref (1-8 or the clustered key length) */ uint ref_length; FT_INFO *ft_handler; - enum init_stat { NONE=0, INDEX, RND }; + enum init_stat { NONE=0, INDEX, RND, RANDOM }; init_stat inited, pre_inited; ........ + virtual int ha_random_sample_init() __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample_init"); + inited= RANDOM; + DBUG_RETURN(random_sample_init()); + } + virtual int ha_random_sample(uint inx, + key_range *min_key, + key_range *max_key) + __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample"); + DBUG_ASSERT(inited == RANDOM); + DBUG_RETURN(random_sample(inx, min_key, max_key)); + } + virtual int ha_random_sample_end() __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample_end"); + inited= NONE; + DBUG_RETURN(random_sample_end()); + } +
This is the default implementation for a storage engine which does not support it:
+ virtual int random_sample_init() { return 0; } ; + virtual int random_sample(uint idx, key_range *min_key, key_range *max_key) + { + return HA_ERR_WRONG_COMMAND; + } + virtual int random_sample_end() { return 0; };
Alternative ideas: random_sample_init() takes the idx as a parameter and random_sample just fetches a row from the range using the index previously specified. The range can be left unspecified with nulls to provide a fetch from the full table range. I don't know enough about storage engine internals to know if an index declaration within the init function instead of within the "sample" function is better. Maybe I am complicating it too much and a simple random_sample() function is sufficient, kind of how ha_records_in_range does it.
Thoughts? Vicențiu
-- BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog
On Tue, 11 Dec 2018 at 12:43 Sergey Petrunia <sergey@mariadb.com> wrote:
On Tue, Dec 11, 2018 at 12:34:12AM +0200, Vicențiu Ciorbaru wrote:
Hi!
Here is my proposal on extending the storage engine API to provide a functionality for retrieving random rows from tables (those that have indexes). The storage engines for which I plan to implement this are: MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.
Observations:
- as far as I understand, random skip scan is not possible with this API? (which is probably fine as we expect that sampling will only retrieve a small fraction of table rows, which means the difference between a forward-walking skip scan and genuinely random probing is negligible).
One could add a flag to the init function specify this, but it feels a bit too much, hence I left it out.
- Can the scan return the same row twice?
The way I see it now, it is implementation defined within the storage engine. I'd vote to keep it like that. I don't see a real use case for ensuring no row duplication inside the storage engine. One can keep a Unique Object if this is really problematic.
- Do we want/need a concept of random "seed" which will cause the same rows to be returned on the same table?
I've thought about this. I think it would be useful and it could be passed via the init function, but I couldn't find similar examples in the API and I kept it out. Should I add it?
--- a/sql/handler.h +++ b/sql/handler.h @@ -2927,7 +2927,7 @@ class handler :public Sql_alloc /** Length of ref (1-8 or the clustered key length) */ uint ref_length; FT_INFO *ft_handler; - enum init_stat { NONE=0, INDEX, RND }; + enum init_stat { NONE=0, INDEX, RND, RANDOM }; init_stat inited, pre_inited; ........ + virtual int ha_random_sample_init()
+ { + DBUG_ENTER("ha_random_sample_init"); + inited= RANDOM; + DBUG_RETURN(random_sample_init()); + } + virtual int ha_random_sample(uint inx, + key_range *min_key, + key_range *max_key) + __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample"); + DBUG_ASSERT(inited == RANDOM); + DBUG_RETURN(random_sample(inx, min_key, max_key)); + } + virtual int ha_random_sample_end() __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample_end"); + inited= NONE; + DBUG_RETURN(random_sample_end()); + } +
This is the default implementation for a storage engine which does not support it:
+ virtual int random_sample_init() { return 0; } ; + virtual int random_sample(uint idx, key_range *min_key, key_range *max_key) + { + return HA_ERR_WRONG_COMMAND; + } + virtual int random_sample_end() { return 0; };
Alternative ideas: random_sample_init() takes the idx as a parameter and random_sample just fetches a row from the range using the index
__attribute__((warn_unused_result)) previously
specified. The range can be left unspecified with nulls to provide a fetch from the full table range. I don't know enough about storage engine internals to know if an index declaration within the init function instead of within the "sample" function is better. Maybe I am complicating it too much and a simple random_sample() function is sufficient, kind of how ha_records_in_range does it.
Thoughts? Vicențiu
-- BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog
Hi, Vicențiu! On Dec 11, Vicențiu Ciorbaru wrote:
Hi!
Here is my proposal on extending the storage engine API to provide a functionality for retrieving random rows from tables (those that have indexes). The storage engines for which I plan to implement this are: MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB. ... Maybe I am complicating it too much and a simple random_sample() function is sufficient, kind of how ha_records_in_range does it.
Yeah. My first thought was something like index_random(). That is, index_init(), index_random(), index_random(), ..., index_end(). index_init() does whatever it always needs to do to prepare using an index. and index_random() gets one random row. Not like records_in_range, that takes the index number and does _init/_end internally, because records_in_range is typically done only once or just a few times, while index_random() is more like index_next(), will be repeated continuously. But then I was thinking, why do you need to specify an index at all? Shouldn't it be just "get me a random row"? Index or whatever - that's engine implementation detail. For example, MyISAM with a fixed-size rows can just read from lseek(floor((file_size/row_size)*rand())*row_size). Sergey had a good point about being able to repeat exactly the same sampling by specifying the seed. This should be solved if the engine will get its random numbers from the thd_rnd service. Regards, Sergei
On Tue, 11 Dec 2018 at 14:33 Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Vicențiu!
On Dec 11, Vicențiu Ciorbaru wrote:
Hi!
Here is my proposal on extending the storage engine API to provide a functionality for retrieving random rows from tables (those that have indexes). The storage engines for which I plan to implement this are: MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB. ... Maybe I am complicating it too much and a simple random_sample() function is sufficient, kind of how ha_records_in_range does it.
Yeah. My first thought was something like index_random(). That is, index_init(), index_random(), index_random(), ..., index_end().
index_init() does whatever it always needs to do to prepare using an index. and index_random() gets one random row.
Not like records_in_range, that takes the index number and does _init/_end internally, because records_in_range is typically done only once or just a few times, while index_random() is more like index_next(), will be repeated continuously.
But then I was thinking, why do you need to specify an index at all? Shouldn't it be just "get me a random row"? Index or whatever - that's engine implementation detail. For example, MyISAM with a fixed-size rows can just read from lseek(floor((file_size/row_size)*rand())*row_size).
I agree that the need for an index seems a bit much. My reasoning was that I wanted to allow random sampling on a particular range. This could help for example when one wants to collect histograms for a multi-distribution dataset, to get individual distributions (if the indexed column is able to separate them). A more generic idea would be if one could pass some conditions for random row retrieval to the storage engine, but it feels like this would complicate storage engine implementation by quite a bit. For the first iteration, after considering your input, I'd go with "init function", "get random row", "end function", without imposing an index, but somehow passing a (COND or similar) arg to the init function. Sounds reasonable, or too much? Vicențiu Sergey had a good point about being able to repeat exactly the same
sampling by specifying the seed. This should be solved if the engine will get its random numbers from the thd_rnd service.
Regards, Sergei
Hi, Vicențiu! On Dec 11, Vicențiu Ciorbaru wrote:
On Tue, 11 Dec 2018 at 14:33 Sergei Golubchik <serg@mariadb.org> wrote:
But then I was thinking, why do you need to specify an index at all? Shouldn't it be just "get me a random row"? Index or whatever - that's engine implementation detail. For example, MyISAM with a fixed-size rows can just read from lseek(floor((file_size/row_size)*rand())*row_size).
I agree that the need for an index seems a bit much. My reasoning was that I wanted to allow random sampling on a particular range. This could help for example when one wants to collect histograms for a multi-distribution dataset, to get individual distributions (if the indexed column is able to separate them).
A more generic idea would be if one could pass some conditions for random row retrieval to the storage engine, but it feels like this would complicate storage engine implementation by quite a bit.
For the first iteration, after considering your input, I'd go with "init function", "get random row", "end function", without imposing an index, but somehow passing a (COND or similar) arg to the init function.
For the first iteration I'd go without a condition. You, probably shouldn't add an API that you won't use, and in the first iteration you won't use it, right? It can be added later when needed. Regards, Sergei
Hi Sergei!
For the first iteration, after considering your input, I'd go with
"init function", "get random row", "end function", without imposing an index, but somehow passing a (COND or similar) arg to the init function.
For the first iteration I'd go without a condition. You, probably shouldn't add an API that you won't use, and in the first iteration you won't use it, right? It can be added later when needed.
That's my dilemma, do we want to fiddle with the API frequently, as more development work happens? If that's the case, I'll use the simplest possible format that covers my use case for the time being. Vicențiu
participants (3)
-
Sergei Golubchik
-
Sergey Petrunia
-
Vicențiu Ciorbaru