Request to Adapt VIDEX (Virtual Index for what-if Analysis) for MariaDB

Dear MariaDB Community, I hope this message finds you well. We'd like to introduce VIDEX (Virtual Index for MySQL) and humbly seek your guidance and assistance in adapting it for MariaDB compatibility. We've developed a virtual index storage engine (plugin) called VIDEX. It can simulate the impact of indexes on query plan costs without requiring real data. VIDEX forwards the statistical information, ndv (records_per_key), and cardinality information (records_in_range) needed for making query plans via a RESTful approach, with responses provided by an "algorithm service." This disaggregated design is particularly well-suited for integrating various AI-boosted cardinality and ndv algorithms. VIDEX has been successfully deployed at ByteDance, serving thousands of MySQL instances daily and over millions of SQL queries for index optimization tasks. We implemented VIDEX based on MySQL 8.0 and now, thanks to Federico Razzoli for his suggestion (LinkedIn post 1 <https://www.linkedin.com/feed/update/urn:li:activity:7307955728613351424/>, LinkedIn post 2 <https://www.linkedin.com/posts/federicorazzoli_github-bytedancevidex-virtual-index-for-activity-7308254768194342912-owBu/>, , Github Issue <https://github.com/bytedance/videx/issues/1>), we'd like to adapt it to MariaDB. However, we've discovered significant differences between MySQL 8.0 and the latest MariaDB codebase. Therefore, we're seeking help from the MariaDB community, and would greatly appreciate collaboration with experienced developers to modify the code. We're very willing to cooperate in this effort. In the future, we also hope to have the opportunity to add VIDEX to the official MariaDB plugin list. *About VIDEX* VIDEX GitHub repository: https://github.com/bytedance/videx/ Virtual index technology is already available in PostgreSQL, Oracle, and IBM DB2, and would provide MariaDB users with valuable capabilities for query optimization and database research. VIDEX offers key features that would benefit MariaDB users: - What-if analysis for index strategies without production impact - Support for integration of custom or AI-based estimation algorithms - Accurate simulation of query plans, verified with complex benchmarks like TPC-H and JOB If you're interested in collaborating or have guidance to offer on adapting our code to MariaDB, we'd greatly appreciate hearing from you. Thank you for your consideration. Best regards, Rong ByteBrain Team, Bytedance

Hello, A MariaDB query optimizer member here. I think it would be nice to get VIDEX for MariaDB. As far I'm informed, the basics of interaction between the query optimizer and storage engine are the same between MariaDB and MySQL: - ha_engine::info() call - ha_engine::records_in_range() call - extended keys feature Execution part (index/table read methods) is also very similar although there are some cosmetic differences. There are bigger differences in how partitioning is handled, data dictionary and column histograms. I think it's feasible to work around those. To sum up, at first glance adoption to MariaDB seems doable. I'd like to study VINDEX further and see where/how we MariaDB devs could lend a hand. Best Regards, -- Sergei Petrunia Query Optimizer team MariaDB Corporation On Wed, Apr 16, 2025 at 09:41:16PM -0500, 康荣 via discuss wrote:
Dear MariaDB Community,
I hope this message finds you well. We'd like to introduce VIDEX (Virtual Index for MySQL) and humbly seek your guidance and assistance in adapting it for MariaDB compatibility.
We've developed a virtual index storage engine (plugin) called VIDEX. It can simulate the impact of indexes on query plan costs without requiring real data. VIDEX forwards the statistical information, ndv (records_per_key), and cardinality information (records_in_range) needed for making query plans via a RESTful approach, with responses provided by an "algorithm service." This disaggregated design is particularly well-suited for integrating various AI-boosted cardinality and ndv algorithms. VIDEX has been successfully deployed at ByteDance, serving thousands of MySQL instances daily and over millions of SQL queries for index optimization tasks.
We implemented VIDEX based on MySQL 8.0 and now, thanks to Federico Razzoli for his suggestion (LinkedIn post 1, LinkedIn post 2, , Github Issue), we'd like to adapt it to MariaDB. However, we've discovered significant differences between MySQL 8.0 and the latest MariaDB codebase. Therefore, we're seeking help from the MariaDB community, and would greatly appreciate collaboration with experienced developers to modify the code. We're very willing to cooperate in this effort. In the future, we also hope to have the opportunity to add VIDEX to the official MariaDB plugin list.
About VIDEX VIDEX GitHub repository: https://github.com/bytedance/videx/
Virtual index technology is already available in PostgreSQL, Oracle, and IBM DB2, and would provide MariaDB users with valuable capabilities for query optimization and database research. VIDEX offers key features that would benefit MariaDB users:
• What-if analysis for index strategies without production impact • Support for integration of custom or AI-based estimation algorithms • Accurate simulation of query plans, verified with complex benchmarks like TPC-H and JOB
If you're interested in collaborating or have guidance to offer on adapting our code to MariaDB, we'd greatly appreciate hearing from you.
Thank you for your consideration.
Best regards, Rong ByteBrain Team, Bytedance
_______________________________________________ discuss mailing list -- discuss@lists.mariadb.org To unsubscribe send an email to discuss-leave@lists.mariadb.org

Sergey Petrunia wrote:
Hello,
A MariaDB query optimizer member here.
I think it would be nice to get VIDEX for MariaDB.
As far I'm informed, the basics of interaction between the query optimizer and storage engine are the same between MariaDB and MySQL:
- ha_engine::info() call - ha_engine::records_in_range() call - extended keys feature
Execution part (index/table read methods) is also very similar although there are some cosmetic differences.
There are bigger differences in how partitioning is handled, data dictionary and column histograms. I think it's feasible to work around those.
To sum up, at first glance adoption to MariaDB seems doable. I'd like to study VINDEX further and see where/how we MariaDB devs could lend a hand.
Best Regards,
Dear Sergei Petrunia, I'm delighted to see your interest in adapting VIDEX for MariaDB and am very willing to collaborate on this effort. Regarding the interaction between the query optimizer and storage engine, your understanding is indeed comprehensive and accurate. Additionally, there might be some configurations related to index status, such as `handlerton->flags`. Furthermore, VIDEX primarily focuses on query optimizer-related operations, thus skipping all interfaces related to index queries (like `index_first`, `index_read`), which I believe could be skipped similarly in MariaDB. I'm very open to discussing the best collaboration approach. Given your expertise in the MariaDB Query Optimizer team and your deep understanding of its internal interfaces, I would be especially grateful and it would be extremely helpful if you could consider contributing to the MariaDB plugin part in this adaptation. Your insights into MariaDB's architecture would be invaluable for efficiently modifying the VIDEX plugin to work seamlessly with MariaDB. Besides the MariaDB plugin part, I can take responsibility for any necessary modifications to the VIDEX Statistics Server and the existing MySQL plugin. Our goal is to have the VIDEX statistics server provide a unified interface that's compatible with both MariaDB and MySQL, rather than forcing MariaDB to adapt to the existing MySQL interface. Of course, I could also handle validation work on various benchmarks (TPC-H, JOB, etc.) to ensure everything works properly. Thank you again for your interest and support. I look forward to further discussing the details with you. Best regards, Rong ByteBrain Team, Bytedance

Hi, Rong, Could you please explain why did you chose this particular architecture? What is the use case? You refer to PostgreSQL, Oracle, and IBM DB2, but neither of them involves taking via REST API to a separate server and neither mentions "AI-boosted cardinality and ndv algorithms". So, what is "AI-boosted cardinality and ndv algorithms" and why a separate server? Regards, Sergei Chief Architect, MariaDB Server and security@mariadb.org On Apr 16, Rong Kang via discuss wrote:
We've developed a virtual index storage engine (plugin) called VIDEX. It can simulate the impact of indexes on query plan costs without requiring real data. VIDEX forwards the statistical information, ndv (records_per_key), and cardinality information (records_in_range) needed for making query plans via a RESTful approach, with responses provided by an "algorithm service." This disaggregated design is particularly well-suited for integrating various AI-boosted cardinality and ndv algorithms. VIDEX has been successfully deployed at ByteDance, serving thousands of MySQL instances daily and over millions of SQL queries for index optimization tasks.
*About VIDEX* VIDEX GitHub repository: https://github.com/bytedance/videx/
Virtual index technology is already available in PostgreSQL, Oracle, and IBM DB2, and would provide MariaDB users with valuable capabilities for query optimization and database research. VIDEX offers key features that would benefit MariaDB users:
- What-if analysis for index strategies without production impact - Support for integration of custom or AI-based estimation algorithms - Accurate simulation of query plans, verified with complex benchmarks like TPC-H and JOB
If you're interested in collaborating or have guidance to offer on adapting our code to MariaDB, we'd greatly appreciate hearing from you.
Thank you for your consideration.

Dear Sergei Golubchik, That's a good question! I would like to discuss the VIDEX architecture, which is inspired by industry practices at ByteDance, particularly in its cloud environment. Additionally, you may refer to the VIDEX arXiv paper, which presents further design considerations [1]. VIDEX aims to simulate the query plans of a target instance (with large volume data) in the absence of actual data. We categorize the information requested by the upper-layer optimizer from the underlying engine into two types: 1. **Simple statistic data**: This can be straightforwardly and efficiently obtained from the target instance (e.g., mysql.innodb_index_stats, information_schema.TABLES, etc.). This includes data such as table_rows, AVG_ROW_LENGTH, and clustered_index_size. 2. **Complex statistic data**: These include complex statistics that cannot be directly obtained and usually require estimation, primarily multi-column NDV (number of distinct values) estimation needed for records_per_key, and multi-column cardinality estimation needed for records_in_range. Considering that what-if analysis might create indexes for any combination of columns (up to 2^N combinations), we must employ heuristic or AI-boosted algorithmic models for estimation, rather than enumerating and pre-storing all combinations. ## Q1: Why are AI-boosted cardinality and NDV algorithms required? Simple statistics are directly logged into system tables. However, accurately estimating multi-column joint distributions poses significant challenges. Rule-based or heuristic methods often result in inaccuracies (e.g. PostgreSQL's native implementation). Most of these research papers [2] employ query-driven or data-driven AI models to enhance the precision of NDV and cardinality estimates. The advantage of AI models is that they learn data distributions from data or queries without actually creating indexes or running statistics, providing more generalized and accurate estimates of cardinality and NDV. ## Q2: Why separate a RESTful Statistic server from the VIDEX engine side? Although much research has been conducted, most of it is not practically applicable in production. For instance, the Postgres cardinality hacking approach [3] requires researchers to write fixed-order cardinality results to a file, which the Postgres optimizer then reads. This method is not conducive to runtime operations. Moreover, AI researchers encounter substantial challenges when integrating AI models into database engines. On one hand, these engines must manage complex underlying interfaces; on the other, AI models frequently necessitate Python and GPU environments. We advocate for a seperate RESTful Statistic server that facilitates the seamless integration of various models (AI and non-AI) into the query plans of MariaDB/MySQL. This method has proven to be genuinely practical. Validated through extensive use at ByteDance's cloud service, VIDEX processes over 5,000 index recommendation tasks and tens of thousands of slow SQL queries daily. Furthermore, we have developed and successfully integrated multiple NDV and cardinality algorithms into VIDEX, accepted by VLDB and SIGMOD [4][5][6]. ## Q3: How to Balance Estimation Accuracy and Architectural Simplicity In the current implementation, we forward all requests (whatever simple or complex) to the server for simplicity. However, this involved an additional server module. Ideally, we believe VIDEX can strike a balance between estimation accuracy and architectural simplicity: 1. **Heuristic estimation mode**: All statistical data is stored in the database's statistic tables (e.g., table_rows, cluster_size, single-column NDV, single-column histograms). The VIDEX plugin uses simple assumptions (e.g., independence between columns, uniform distribution across a single column) to provide estimates for records_in_range and records_per_key. 2. **AI-boosted estimation mode**: To enhance accuracy, complex multi-column joint NDV and cardinality estimates are forwarded to a RESTful statistic server, which can be developed using Python, C, or Java, and deployed on CPU or GPU environments. [1] VIDEX design paper: Rong Kang, Shuai Wang, et al. VIDEX: A Disaggregated and Extensible Virtual Index for the Cloud and AI Era, https://arxiv.org/abs/2503.23776 [2] AI models in DB optimization: Yuxing Han et al., “Cardinality estimation in DBMS: a comprehensive benchmark evaluation,” Proc. VLDB 2021. [3] PG cardinality hacking: https://github.com/Nathaniel-Han/End-to-End-CardEst-Benchmark/ [3] VIDEX NDV paper 1: Xianghong Xu, Tieying Zhang, et al, AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators, VLDB, 2025 (accepted) [4] VIDEX NDV paper 2: Xianghong Xu, Xiao He, et al, PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models, SIGMOD, 2025 (accepted) [5] VIDEX Cardinality paper 3: Peizhi Wu, Rong Kang, et al. Data-Agnostic Cardinality Learning from Imperfect Workloads, VLDB, 2025 (accepted) I hope this addresses your concerns. If you have any further questions, please feel free to continue the discussion. Best, Rong, ByteBrain Team, ByteDance
participants (4)
-
Rong Kang
-
Sergei Golubchik
-
Sergey Petrunia
-
康荣