[Maria-developers] Request for Discussion: Structure for JSON histogram objects
Hi, MariaDB community! I'm implementing the patch for Using JSON as the on-disk format for MariaDB histograms(https://jira.mariadb.org/browse/MDEV-21130). as part of GSoC '21 and hoping to get insights and ideas from the community on how the JSON format for histograms should be structured. We are planning to change the columns of `mysql.column_stats` into something like: CREATE TABLE `column_stats` ( `db_name` varchar(64) COLLATE utf8mb3_bin NOT NULL, `table_name` varchar(64) COLLATE utf8mb3_bin NOT NULL, `column_name` varchar(64) COLLATE utf8mb3_bin NOT NULL, `histogram` blob ); The histogram column above will then be responsible for storing the JSON document for that row. Currently, the JSON structure we are looking at is based on the existing columns in `column_stats` table as well MySQL's JSON histogram implementation( https://dev.mysql.com/doc/refman/8.0/en/optimizer-statistics.html). Here's an example: { "min_value": 0, "max_value": 0, "hist_type": "singleton", "bucket": [ [1,0.3333333333333333], [2, 0.6666666666666666], [3,1] ], "nulls_ratios": 0, "last_updated": "2021-06-02 13:32:40.000000", "sampling_rate": 1, } - The min_value and max_values retain the meanings described at ( https://mariadb.com/kb/en/mysqlcolumn_stats-table/). - hist_type can be one of `SINGLETON` (where the number of buckets is >= the number of distinct values in the specified column) or `EQUI-HEIGHT` (where the number of buckets is < the number of distinct values found in the column). - nulls_ratio is a value between 0.0 and 1.0 indicating the fraction of column values that are SQL NULL values. If 0, the column contains no NULL values. - `last_updated` indicates when the histogram was generated, as a UTC value in YYYY-MM-DD hh:mm:ss.uuuuuu format. - `sampling_rate` is also a value between 0.0 and 1.0 indicating the fraction of column values that were sampled to create the histogram. A value of 1 means that all of the data was read (no sampling).
Just to clarify: we are looking at two options currently. Please find their description below. == Current mysql.column_stats == Currently, MariaDB defines mysql.column_stats as: CREATE TABLE column_stats ( db_name varchar(64) COLLATE utf8mb3_bin NOT NULL, table_name varchar(64) COLLATE utf8mb3_bin NOT NULL, column_name varchar(64) COLLATE utf8mb3_bin NOT NULL, min_value varbinary(255) DEFAULT NULL, max_value varbinary(255) DEFAULT NULL, nulls_ratio decimal(12,4) DEFAULT NULL, avg_length decimal(12,4) DEFAULT NULL, avg_frequency decimal(12,4) DEFAULT NULL, hist_size tinyint(3) unsigned DEFAULT NULL, hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB') COLLATE utf8mb3_bin DEFAULT NULL, histogram varbinary(255) DEFAULT NULL, PRIMARY KEY (db_name,table_name,column_name) ); It seems there is some redundancy here: the histogram column is of variable length, but there is also hist_size column which also stores length of the histogram (?). == Variant #1: Add JSON support to the current scheme * Keep all the same fields, but add another histogram type: JSON (more on the * name choice in the next email) * Change the definition of the 'histogram' column so it is not limited to 255 bytes. CREATE TABLE column_stats ( ... -- the same columns as before ... hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB', 'JSON') COLLATE utf8mb3_bin DEFAULT NULL, histogram blob DEFAULT NULL, PRIMARY KEY (db_name,table_name,column_name) ); Advantages: * The definition is close to the current definition * The data from the older version can be kept as-is: no migration or conversion is necessary * The format is backward-compatible: old server binary can read the records stored in the new format (we'll still need to verify this) Disadvantages: - depending on value of hist_type, histogram column holds either JSON or binary data, which is not a good data modeling practice. - JSON histogram holds real bucket bounds. The column's min_value and max_value will be stored twice: First, in min_value and max_value columns, and then in the first/last bound in the histogram. On the other hand, separate min_value and max_value columns allow storing min/max statistics without collecting the histogram. == Variant #2: Change to JSON-only == New mysql.column_stats will look like this: CREATE TABLE column_stats ( db_name varchar(64) COLLATE utf8mb3_bin NOT NULL, table_name varchar(64) COLLATE utf8mb3_bin NOT NULL, column_name varchar(64) COLLATE utf8mb3_bin NOT NULL, -- perhaps some columns like avg_length, avg_frequency histogram JSON, -- All histograms are in JSON PRIMARY KEY (db_name,table_name,column_name) ); The histogram will be always stored in JSON. Old histograms will be converted to the new format. Advantages - straightforward data modeling. - no "baggage" (Do we need old-format histograms, long term?) Disadvantages - This is not compatible with the old data format. - If we chose to support legacy histograms, we'll need to convert them into JSON (This should be fairly easy to with DECODE_HISTOGRAM function is useful for that) Any thoughts on this anyone? BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://petrunia.net
Hi, Sergey! One point here, that we've learned the hard way on privilege tables, store the current server version too. So that when (not if!) you'll change the structure in the future, you won't need to resort to various heuristics when interpreting a row. On Jun 04, Sergey Petrunia wrote:
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
participants (3)
-
Michael Okoko
-
Sergei Golubchik
-
Sergey Petrunia