[MariaDB discuss] Re: Request to Adapt VIDEX (Virtual Index for what-if Analysis) for MariaDB

22 Apr 2025

      Dear Sergei Golubchik,

That's a good question! I would like to discuss the VIDEX architecture, which is inspired by industry practices at ByteDance, particularly in its cloud environment. Additionally, you may refer to the VIDEX arXiv paper, which presents further design considerations [1].

VIDEX aims to simulate the query plans of a target instance (with large volume data) in the absence of actual data. We categorize the information requested by the upper-layer optimizer from the underlying engine into two types:

1. **Simple statistic data**: This can be straightforwardly and efficiently obtained from the target instance (e.g., mysql.innodb_index_stats, information_schema.TABLES, etc.). This includes data such as table_rows, AVG_ROW_LENGTH, and clustered_index_size.
2. **Complex statistic data**: These include complex statistics that cannot be directly obtained and usually require estimation, primarily multi-column NDV (number of distinct values) estimation needed for records_per_key, and multi-column cardinality estimation needed for records_in_range. Considering that what-if analysis might create indexes for any combination of columns (up to 2^N combinations), we must employ heuristic or AI-boosted algorithmic models for estimation, rather than enumerating and pre-storing all combinations.

## Q1: Why are AI-boosted cardinality and NDV algorithms required?

Simple statistics are directly logged into system tables. However, accurately estimating multi-column joint distributions poses significant challenges. Rule-based or heuristic methods often result in inaccuracies (e.g. PostgreSQL's native implementation). Most of these research papers [2] employ query-driven or data-driven AI models to enhance the precision of NDV and cardinality estimates. The advantage of AI models is that they learn data distributions from data or queries without actually creating indexes or running statistics, providing more generalized and accurate estimates of cardinality and NDV.

## Q2: Why separate a RESTful Statistic server from the VIDEX engine side?

Although much research has been conducted, most of it is not practically applicable in production. For instance, the Postgres cardinality hacking approach [3] requires researchers to write fixed-order cardinality results to a file, which the Postgres optimizer then reads. This method is not conducive to runtime operations.
Moreover, AI researchers encounter substantial challenges when integrating AI models into database engines. On one hand, these engines must manage complex underlying interfaces; on the other, AI models frequently necessitate Python and GPU environments.

We advocate for a seperate RESTful Statistic server that facilitates the seamless integration of various models (AI and non-AI) into the query plans of MariaDB/MySQL. This method has proven to be genuinely practical. Validated through extensive use at ByteDance's cloud service, VIDEX processes over 5,000 index recommendation tasks and tens of thousands of slow SQL queries daily. Furthermore, we have developed and successfully integrated multiple NDV and cardinality algorithms into VIDEX, accepted by VLDB and SIGMOD [4][5][6].

## Q3: How to Balance Estimation Accuracy and Architectural Simplicity

In the current implementation, we forward all requests (whatever simple or complex) to the server for simplicity. However, this involved an additional server module. Ideally, we believe VIDEX can strike a balance between estimation accuracy and architectural simplicity: 

1. **Heuristic estimation mode**: All statistical data is stored in the database's statistic tables (e.g., table_rows, cluster_size, single-column NDV, single-column histograms). The VIDEX plugin uses simple assumptions (e.g., independence between columns, uniform distribution across a single column) to provide estimates for records_in_range and records_per_key.
2. **AI-boosted estimation mode**: To enhance accuracy, complex multi-column joint NDV and cardinality estimates are forwarded to a RESTful statistic server, which can be developed using Python, C, or Java, and deployed on CPU or GPU environments.

[1] VIDEX design paper: Rong Kang, Shuai Wang, et al. VIDEX: A Disaggregated and Extensible Virtual Index for the Cloud and AI Era, https://arxiv.org/abs/2503.23776
[2] AI models in DB optimization: Yuxing Han et al., “Cardinality estimation in DBMS: a comprehensive benchmark evaluation,” Proc. VLDB 2021.
[3] PG cardinality hacking: https://github.com/Nathaniel-Han/End-to-End-CardEst-Benchmark/
[3] VIDEX NDV paper 1: Xianghong Xu, Tieying Zhang, et al, AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators, VLDB, 2025 (accepted)
[4] VIDEX NDV paper 2: Xianghong Xu, Xiao He, et al, PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models, SIGMOD, 2025 (accepted)
[5] VIDEX Cardinality paper 3: Peizhi Wu, Rong Kang, et al. Data-Agnostic Cardinality Learning from Imperfect Workloads, VLDB, 2025 (accepted)

I hope this addresses your concerns. If you have any further questions, please feel free to continue the discussion.

Best,
Rong,
ByteBrain Team, ByteDance