[Maria-developers] Extending storage engine API for random-row extraction for histogram collection (and others)

newer
Re: [Maria-developers] [Commits]...

older
Re: [Maria-developers] [Commits]...

Vicențiu Ciorbaru

10 Dec 2018 10 Dec '18

11:34 p.m.

Hi! Here is my proposal on extending the storage engine API to provide a functionality for retrieving random rows from tables (those that have indexes). The storage engines for which I plan to implement this are: MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB. --- a/sql/handler.h +++ b/sql/handler.h @@ -2927,7 +2927,7 @@ class handler :public Sql_alloc /** Length of ref (1-8 or the clustered key length) */ uint ref_length; FT_INFO *ft_handler; - enum init_stat { NONE=0, INDEX, RND }; + enum init_stat { NONE=0, INDEX, RND, RANDOM }; init_stat inited, pre_inited; ........ + virtual int ha_random_sample_init() __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample_init"); + inited= RANDOM; + DBUG_RETURN(random_sample_init()); + } + virtual int ha_random_sample(uint inx, + key_range *min_key, + key_range *max_key) + __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample"); + DBUG_ASSERT(inited == RANDOM); + DBUG_RETURN(random_sample(inx, min_key, max_key)); + } + virtual int ha_random_sample_end() __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample_end"); + inited= NONE; + DBUG_RETURN(random_sample_end()); + } + This is the default implementation for a storage engine which does not support it: + virtual int random_sample_init() { return 0; } ; + virtual int random_sample(uint idx, key_range *min_key, key_range *max_key) + { + return HA_ERR_WRONG_COMMAND; + } + virtual int random_sample_end() { return 0; }; Alternative ideas: random_sample_init() takes the idx as a parameter and random_sample just fetches a row from the range using the index previously specified. The range can be left unspecified with nulls to provide a fetch from the full table range. I don't know enough about storage engine internals to know if an index declaration within the init function instead of within the "sample" function is better. Maybe I am complicating it too much and a simple random_sample() function is sufficient, kind of how ha_records_in_range does it. Thoughts? Vicențiu

Attachments:

attachment.html (text/html — 4.0 KB)

Show replies by date

Sergey Petrunia

11 Dec 11 Dec

11:43 a.m.

On Tue, Dec 11, 2018 at 12:34:12AM +0200, Vicențiu Ciorbaru wrote:

...

Hi!

Here is my proposal on extending the storage engine API to provide a functionality for retrieving random rows from tables (those that have indexes). The storage engines for which I plan to implement this are: MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.

Observations: - as far as I understand, random skip scan is not possible with this API? (which is probably fine as we expect that sampling will only retrieve a small fraction of table rows, which means the difference between a forward-walking skip scan and genuinely random probing is negligible). - Can the scan return the same row twice? - Do we want/need a concept of random "seed" which will cause the same rows to be returned on the same table?

...

--- a/sql/handler.h +++ b/sql/handler.h @@ -2927,7 +2927,7 @@ class handler :public Sql_alloc /** Length of ref (1-8 or the clustered key length) */ uint ref_length; FT_INFO *ft_handler; - enum init_stat { NONE=0, INDEX, RND }; + enum init_stat { NONE=0, INDEX, RND, RANDOM }; init_stat inited, pre_inited; ........ + virtual int ha_random_sample_init() __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample_init"); + inited= RANDOM; + DBUG_RETURN(random_sample_init()); + } + virtual int ha_random_sample(uint inx, + key_range *min_key, + key_range *max_key) + __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample"); + DBUG_ASSERT(inited == RANDOM); + DBUG_RETURN(random_sample(inx, min_key, max_key)); + } + virtual int ha_random_sample_end() __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample_end"); + inited= NONE; + DBUG_RETURN(random_sample_end()); + } +

This is the default implementation for a storage engine which does not support it:

+ virtual int random_sample_init() { return 0; } ; + virtual int random_sample(uint idx, key_range *min_key, key_range *max_key) + { + return HA_ERR_WRONG_COMMAND; + } + virtual int random_sample_end() { return 0; };

Alternative ideas: random_sample_init() takes the idx as a parameter and random_sample just fetches a row from the range using the index previously specified. The range can be left unspecified with nulls to provide a fetch from the full table range. I don't know enough about storage engine internals to know if an index declaration within the init function instead of within the "sample" function is better. Maybe I am complicating it too much and a simple random_sample() function is sufficient, kind of how ha_records_in_range does it.

Thoughts? Vicențiu

-- BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog

Vicențiu Ciorbaru

1:22 p.m.

On Tue, 11 Dec 2018 at 12:43 Sergey Petrunia <sergey@mariadb.com> wrote:

...

On Tue, Dec 11, 2018 at 12:34:12AM +0200, Vicențiu Ciorbaru wrote:

...
Hi!

Here is my proposal on extending the storage engine API to provide a functionality for retrieving random rows from tables (those that have indexes). The storage engines for which I plan to implement this are: MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.

Observations:

- as far as I understand, random skip scan is not possible with this API? (which is probably fine as we expect that sampling will only retrieve a small fraction of table rows, which means the difference between a forward-walking skip scan and genuinely random probing is negligible).

One could add a flag to the init function specify this, but it feels a bit too much, hence I left it out.

...

- Can the scan return the same row twice?

The way I see it now, it is implementation defined within the storage engine. I'd vote to keep it like that. I don't see a real use case for ensuring no row duplication inside the storage engine. One can keep a Unique Object if this is really problematic.

...

- Do we want/need a concept of random "seed" which will cause the same rows to be returned on the same table?

I've thought about this. I think it would be useful and it could be passed via the init function, but I couldn't find similar examples in the API and I kept it out. Should I add it?

...

...
--- a/sql/handler.h +++ b/sql/handler.h @@ -2927,7 +2927,7 @@ class handler :public Sql_alloc /** Length of ref (1-8 or the clustered key length) */ uint ref_length; FT_INFO *ft_handler; - enum init_stat { NONE=0, INDEX, RND }; + enum init_stat { NONE=0, INDEX, RND, RANDOM }; init_stat inited, pre_inited; ........ + virtual int ha_random_sample_init()

...
+ { + DBUG_ENTER("ha_random_sample_init"); + inited= RANDOM; + DBUG_RETURN(random_sample_init()); + } + virtual int ha_random_sample(uint inx, + key_range *min_key, + key_range *max_key) + __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample"); + DBUG_ASSERT(inited == RANDOM); + DBUG_RETURN(random_sample(inx, min_key, max_key)); + } + virtual int ha_random_sample_end() __attribute__((warn_unused_result)) + { + DBUG_ENTER("ha_random_sample_end"); + inited= NONE; + DBUG_RETURN(random_sample_end()); + } +

This is the default implementation for a storage engine which does not support it:

+ virtual int random_sample_init() { return 0; } ; + virtual int random_sample(uint idx, key_range *min_key, key_range *max_key) + { + return HA_ERR_WRONG_COMMAND; + } + virtual int random_sample_end() { return 0; };

Alternative ideas: random_sample_init() takes the idx as a parameter and random_sample just fetches a row from the range using the index

__attribute__((warn_unused_result)) previously

...
specified. The range can be left unspecified with nulls to provide a fetch from the full table range. I don't know enough about storage engine internals to know if an index declaration within the init function instead of within the "sample" function is better. Maybe I am complicating it too much and a simple random_sample() function is sufficient, kind of how ha_records_in_range does it.

Thoughts? Vicențiu

-- BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog

Sergei Golubchik

1:33 p.m.

Hi, Vicențiu! On Dec 11, Vicențiu Ciorbaru wrote:

...

Hi!

Here is my proposal on extending the storage engine API to provide a functionality for retrieving random rows from tables (those that have indexes). The storage engines for which I plan to implement this are: MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB. ... Maybe I am complicating it too much and a simple random_sample() function is sufficient, kind of how ha_records_in_range does it.

Yeah. My first thought was something like index_random(). That is, index_init(), index_random(), index_random(), ..., index_end(). index_init() does whatever it always needs to do to prepare using an index. and index_random() gets one random row. Not like records_in_range, that takes the index number and does _init/_end internally, because records_in_range is typically done only once or just a few times, while index_random() is more like index_next(), will be repeated continuously. But then I was thinking, why do you need to specify an index at all? Shouldn't it be just "get me a random row"? Index or whatever - that's engine implementation detail. For example, MyISAM with a fixed-size rows can just read from lseek(floor((file_size/row_size)*rand())*row_size). Sergey had a good point about being able to repeat exactly the same sampling by specifying the seed. This should be solved if the engine will get its random numbers from the thd_rnd service. Regards, Sergei

Vicențiu Ciorbaru

1:57 p.m.

On Tue, 11 Dec 2018 at 14:33 Sergei Golubchik <serg@mariadb.org> wrote:

...

Hi, Vicențiu!

On Dec 11, Vicențiu Ciorbaru wrote:

...
Hi!

Here is my proposal on extending the storage engine API to provide a functionality for retrieving random rows from tables (those that have indexes). The storage engines for which I plan to implement this are: MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB. ... Maybe I am complicating it too much and a simple random_sample() function is sufficient, kind of how ha_records_in_range does it.

Yeah. My first thought was something like index_random(). That is, index_init(), index_random(), index_random(), ..., index_end().

index_init() does whatever it always needs to do to prepare using an index. and index_random() gets one random row.

Not like records_in_range, that takes the index number and does _init/_end internally, because records_in_range is typically done only once or just a few times, while index_random() is more like index_next(), will be repeated continuously.

But then I was thinking, why do you need to specify an index at all? Shouldn't it be just "get me a random row"? Index or whatever - that's engine implementation detail. For example, MyISAM with a fixed-size rows can just read from lseek(floor((file_size/row_size)*rand())*row_size).

I agree that the need for an index seems a bit much. My reasoning was that I wanted to allow random sampling on a particular range. This could help for example when one wants to collect histograms for a multi-distribution dataset, to get individual distributions (if the indexed column is able to separate them). A more generic idea would be if one could pass some conditions for random row retrieval to the storage engine, but it feels like this would complicate storage engine implementation by quite a bit. For the first iteration, after considering your input, I'd go with "init function", "get random row", "end function", without imposing an index, but somehow passing a (COND or similar) arg to the init function. Sounds reasonable, or too much? Vicențiu Sergey had a good point about being able to repeat exactly the same

...

sampling by specifying the seed. This should be solved if the engine will get its random numbers from the thd_rnd service.

Regards, Sergei

Sergei Golubchik

2:01 p.m.

Hi, Vicențiu! On Dec 11, Vicențiu Ciorbaru wrote:

...

On Tue, 11 Dec 2018 at 14:33 Sergei Golubchik <serg@mariadb.org> wrote:

...
But then I was thinking, why do you need to specify an index at all? Shouldn't it be just "get me a random row"? Index or whatever - that's engine implementation detail. For example, MyISAM with a fixed-size rows can just read from lseek(floor((file_size/row_size)*rand())*row_size).

I agree that the need for an index seems a bit much. My reasoning was that I wanted to allow random sampling on a particular range. This could help for example when one wants to collect histograms for a multi-distribution dataset, to get individual distributions (if the indexed column is able to separate them).

A more generic idea would be if one could pass some conditions for random row retrieval to the storage engine, but it feels like this would complicate storage engine implementation by quite a bit.

For the first iteration, after considering your input, I'd go with "init function", "get random row", "end function", without imposing an index, but somehow passing a (COND or similar) arg to the init function.

For the first iteration I'd go without a condition. You, probably shouldn't add an API that you won't use, and in the first iteration you won't use it, right? It can be added later when needed. Regards, Sergei

Vicențiu Ciorbaru

2:07 p.m.

Hi Sergei!

...

For the first iteration, after considering your input, I'd go with

...
"init function", "get random row", "end function", without imposing an index, but somehow passing a (COND or similar) arg to the init function.

For the first iteration I'd go without a condition. You, probably shouldn't add an API that you won't use, and in the first iteration you won't use it, right? It can be added later when needed.

That's my dilemma, do we want to fiddle with the API frequently, as more development work happens? If that's the case, I'll use the simplest possible format that covers my use case for the time being. Vicențiu

2138

Age (days ago)

2139

Last active (days ago)

List overview

Download

6 comments

3 participants

participants (3)

Sergei Golubchik
Sergey Petrunia
Vicențiu Ciorbaru