Hi, Vicențiu!
On Dec 11, Vicențiu Ciorbaru wrote:
> Hi!
>
> Here is my proposal on extending the storage engine API to provide a
> functionality for retrieving random rows from tables (those that have
> indexes). The storage engines for which I plan to implement this are:
> MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.
...
> Maybe I am complicating it too much and a simple random_sample()
> function is sufficient, kind of how ha_records_in_range does it.
Yeah. My first thought was something like index_random(). That is,
index_init(), index_random(), index_random(), ..., index_end().
index_init() does whatever it always needs to do to prepare using an
index. and index_random() gets one random row.
Not like records_in_range, that takes the index number and does
_init/_end internally, because records_in_range is typically done only
once or just a few times, while index_random() is more like
index_next(), will be repeated continuously.
But then I was thinking, why do you need to specify an index at all?
Shouldn't it be just "get me a random row"? Index or whatever - that's
engine implementation detail. For example, MyISAM with a fixed-size rows
can just read from lseek(floor((file_size/row_size)*rand())*row_size).
I agree that the need for an index seems a bit much. My reasoning was that I wanted to allow random sampling on a particular range. This could help for example when one wants to collect histograms for a multi-distribution dataset, to get individual distributions (if the indexed column is able to separate them).
A more generic idea would be if one could pass some conditions for random row retrieval to the storage engine, but it feels like this would complicate storage engine implementation by quite a bit.
For the first iteration, after considering your input, I'd go with "init function", "get random row", "end function", without imposing an index, but somehow passing a (COND or similar) arg to the init function.
Sounds reasonable, or too much?
Vicențiu
Sergey had a good point about being able to repeat exactly the same
sampling by specifying the seed. This should be solved if the engine
will get its random numbers from the thd_rnd service.
Regards,
Sergei