Hi, MariaDB community!
I'm implementing the patch
for Using JSON as the on-disk format for MariaDB
histograms(https://jira.mariadb.org/browse/MDEV-21130). as part of GSoC
'21 and hoping to get insights and ideas from the community on how the
JSON format for histograms should be structured.
We are planning to change the columns of `mysql.column_stats` into something like:
CREATE TABLE `column_stats` (
`db_name` varchar(64) COLLATE utf8mb3_bin NOT NULL,
`table_name` varchar(64) COLLATE utf8mb3_bin NOT NULL,
`column_name` varchar(64) COLLATE utf8mb3_bin NOT NULL,
`histogram` blob
);
The histogram column above will then be responsible for storing the JSON document for that row.
Currently,
the JSON structure we are looking at is based on the existing columns
in `column_stats` table as well MySQL's JSON histogram
implementation(https://dev.mysql.com/doc/refman/8.0/en/optimizer-statistics.html).
Here's an example:
{
"min_value": 0,
"max_value": 0,
"hist_type": "singleton",
"bucket": [
[1,0.3333333333333333],
[2, 0.6666666666666666],
[3,1]
],
"nulls_ratios": 0,
"last_updated": "2021-06-02 13:32:40.000000",
"sampling_rate": 1,
}
- The min_value and max_values retain the meanings described at (https://mariadb.com/kb/en/mysqlcolumn_stats-table/).
- hist_type can be one of `SINGLETON` (where the number of buckets is
>= the number of distinct values in the specified column) or
`EQUI-HEIGHT` (where the number of buckets is < the number of distinct values found in the column).
- nulls_ratio is a value
between 0.0 and 1.0 indicating the fraction of column values that are
SQL NULL values. If 0, the column contains no NULL values.
- `last_updated` indicates when the histogram was generated, as a UTC value in YYYY-MM-DD hh:mm:ss.uuuuuu format.
-
`sampling_rate` is also a value between 0.0 and 1.0 indicating the fraction of column values that were sampled to create the histogram. A
value of 1 means that all of the data was read (no sampling).