Request to Adapt VIDEX (Virtual Index for what-if Analysis) for MariaDB

newer
ODBC driver and SQLDescribeParam()

康荣

17 Apr 2025 17 Apr '25

4:41 a.m.

Dear MariaDB Community, I hope this message finds you well. We'd like to introduce VIDEX (Virtual Index for MySQL) and humbly seek your guidance and assistance in adapting it for MariaDB compatibility. We've developed a virtual index storage engine (plugin) called VIDEX. It can simulate the impact of indexes on query plan costs without requiring real data. VIDEX forwards the statistical information, ndv (records_per_key), and cardinality information (records_in_range) needed for making query plans via a RESTful approach, with responses provided by an "algorithm service." This disaggregated design is particularly well-suited for integrating various AI-boosted cardinality and ndv algorithms. VIDEX has been successfully deployed at ByteDance, serving thousands of MySQL instances daily and over millions of SQL queries for index optimization tasks. We implemented VIDEX based on MySQL 8.0 and now, thanks to Federico Razzoli for his suggestion (LinkedIn post 1 <https://www.linkedin.com/feed/update/urn:li:activity:7307955728613351424/>, LinkedIn post 2 <https://www.linkedin.com/posts/federicorazzoli_github-bytedancevidex-virtual-index-for-activity-7308254768194342912-owBu/>, , Github Issue <https://github.com/bytedance/videx/issues/1>), we'd like to adapt it to MariaDB. However, we've discovered significant differences between MySQL 8.0 and the latest MariaDB codebase. Therefore, we're seeking help from the MariaDB community, and would greatly appreciate collaboration with experienced developers to modify the code. We're very willing to cooperate in this effort. In the future, we also hope to have the opportunity to add VIDEX to the official MariaDB plugin list. *About VIDEX* VIDEX GitHub repository: https://github.com/bytedance/videx/ Virtual index technology is already available in PostgreSQL, Oracle, and IBM DB2, and would provide MariaDB users with valuable capabilities for query optimization and database research. VIDEX offers key features that would benefit MariaDB users: - What-if analysis for index strategies without production impact - Support for integration of custom or AI-based estimation algorithms - Accurate simulation of query plans, verified with complex benchmarks like TPC-H and JOB If you're interested in collaborating or have guidance to offer on adapting our code to MariaDB, we'd greatly appreciate hearing from you. Thank you for your consideration. Best regards, Rong ByteBrain Team, Bytedance

Attachments:

attachment.html (text/html — 7.1 KB)

Show replies by date

Sergey Petrunia

18 Apr 18 Apr

10:26 a.m.

Hello, A MariaDB query optimizer member here. I think it would be nice to get VIDEX for MariaDB. As far I'm informed, the basics of interaction between the query optimizer and storage engine are the same between MariaDB and MySQL: - ha_engine::info() call - ha_engine::records_in_range() call - extended keys feature Execution part (index/table read methods) is also very similar although there are some cosmetic differences. There are bigger differences in how partitioning is handled, data dictionary and column histograms. I think it's feasible to work around those. To sum up, at first glance adoption to MariaDB seems doable. I'd like to study VINDEX further and see where/how we MariaDB devs could lend a hand. Best Regards, -- Sergei Petrunia Query Optimizer team MariaDB Corporation On Wed, Apr 16, 2025 at 09:41:16PM -0500, 康荣 via discuss wrote:

...

Dear MariaDB Community,

I hope this message finds you well. We'd like to introduce VIDEX (Virtual Index for MySQL) and humbly seek your guidance and assistance in adapting it for MariaDB compatibility.

We've developed a virtual index storage engine (plugin) called VIDEX. It can simulate the impact of indexes on query plan costs without requiring real data. VIDEX forwards the statistical information, ndv (records_per_key), and cardinality information (records_in_range) needed for making query plans via a RESTful approach, with responses provided by an "algorithm service." This disaggregated design is particularly well-suited for integrating various AI-boosted cardinality and ndv algorithms. VIDEX has been successfully deployed at ByteDance, serving thousands of MySQL instances daily and over millions of SQL queries for index optimization tasks.

We implemented VIDEX based on MySQL 8.0 and now, thanks to Federico Razzoli for his suggestion (LinkedIn post 1, LinkedIn post 2, , Github Issue), we'd like to adapt it to MariaDB. However, we've discovered significant differences between MySQL 8.0 and the latest MariaDB codebase. Therefore, we're seeking help from the MariaDB community, and would greatly appreciate collaboration with experienced developers to modify the code. We're very willing to cooperate in this effort. In the future, we also hope to have the opportunity to add VIDEX to the official MariaDB plugin list.

About VIDEX VIDEX GitHub repository: https://github.com/bytedance/videx/

Virtual index technology is already available in PostgreSQL, Oracle, and IBM DB2, and would provide MariaDB users with valuable capabilities for query optimization and database research. VIDEX offers key features that would benefit MariaDB users:

• What-if analysis for index strategies without production impact • Support for integration of custom or AI-based estimation algorithms • Accurate simulation of query plans, verified with complex benchmarks like TPC-H and JOB

If you're interested in collaborating or have guidance to offer on adapting our code to MariaDB, we'd greatly appreciate hearing from you.

Thank you for your consideration.

Best regards, Rong ByteBrain Team, Bytedance

...

_______________________________________________ discuss mailing list -- discuss@lists.mariadb.org To unsubscribe send an email to discuss-leave@lists.mariadb.org

Rong Kang

19 Apr 19 Apr

5:51 p.m.

Sergey Petrunia wrote:

...

Hello,

A MariaDB query optimizer member here.

I think it would be nice to get VIDEX for MariaDB.

As far I'm informed, the basics of interaction between the query optimizer and storage engine are the same between MariaDB and MySQL:

- ha_engine::info() call - ha_engine::records_in_range() call - extended keys feature

Execution part (index/table read methods) is also very similar although there are some cosmetic differences.

There are bigger differences in how partitioning is handled, data dictionary and column histograms. I think it's feasible to work around those.

To sum up, at first glance adoption to MariaDB seems doable. I'd like to study VINDEX further and see where/how we MariaDB devs could lend a hand.

Best Regards,

Dear Sergei Petrunia, I'm delighted to see your interest in adapting VIDEX for MariaDB and am very willing to collaborate on this effort. Regarding the interaction between the query optimizer and storage engine, your understanding is indeed comprehensive and accurate. Additionally, there might be some configurations related to index status, such as `handlerton->flags`. Furthermore, VIDEX primarily focuses on query optimizer-related operations, thus skipping all interfaces related to index queries (like `index_first`, `index_read`), which I believe could be skipped similarly in MariaDB. I'm very open to discussing the best collaboration approach. Given your expertise in the MariaDB Query Optimizer team and your deep understanding of its internal interfaces, I would be especially grateful and it would be extremely helpful if you could consider contributing to the MariaDB plugin part in this adaptation. Your insights into MariaDB's architecture would be invaluable for efficiently modifying the VIDEX plugin to work seamlessly with MariaDB. Besides the MariaDB plugin part, I can take responsibility for any necessary modifications to the VIDEX Statistics Server and the existing MySQL plugin. Our goal is to have the VIDEX statistics server provide a unified interface that's compatible with both MariaDB and MySQL, rather than forcing MariaDB to adapt to the existing MySQL interface. Of course, I could also handle validation work on various benchmarks (TPC-H, JOB, etc.) to ensure everything works properly. Thank you again for your interest and support. I look forward to further discussing the details with you. Best regards, Rong ByteBrain Team, Bytedance

Sergei Golubchik

21 Apr 21 Apr

11:52 p.m.

Hi, Rong, Could you please explain why did you chose this particular architecture? What is the use case? You refer to PostgreSQL, Oracle, and IBM DB2, but neither of them involves taking via REST API to a separate server and neither mentions "AI-boosted cardinality and ndv algorithms". So, what is "AI-boosted cardinality and ndv algorithms" and why a separate server? Regards, Sergei Chief Architect, MariaDB Server and security@mariadb.org On Apr 16, Rong Kang via discuss wrote:

...

We've developed a virtual index storage engine (plugin) called VIDEX. It can simulate the impact of indexes on query plan costs without requiring real data. VIDEX forwards the statistical information, ndv (records_per_key), and cardinality information (records_in_range) needed for making query plans via a RESTful approach, with responses provided by an "algorithm service." This disaggregated design is particularly well-suited for integrating various AI-boosted cardinality and ndv algorithms. VIDEX has been successfully deployed at ByteDance, serving thousands of MySQL instances daily and over millions of SQL queries for index optimization tasks.

*About VIDEX* VIDEX GitHub repository: https://github.com/bytedance/videx/

Virtual index technology is already available in PostgreSQL, Oracle, and IBM DB2, and would provide MariaDB users with valuable capabilities for query optimization and database research. VIDEX offers key features that would benefit MariaDB users:

- What-if analysis for index strategies without production impact - Support for integration of custom or AI-based estimation algorithms - Accurate simulation of query plans, verified with complex benchmarks like TPC-H and JOB

If you're interested in collaborating or have guidance to offer on adapting our code to MariaDB, we'd greatly appreciate hearing from you.

Thank you for your consideration.

Rong Kang

22 Apr 22 Apr

7:28 p.m.

Dear Sergei Golubchik, That's a good question! I would like to discuss the VIDEX architecture, which is inspired by industry practices at ByteDance, particularly in its cloud environment. Additionally, you may refer to the VIDEX arXiv paper, which presents further design considerations [1]. VIDEX aims to simulate the query plans of a target instance (with large volume data) in the absence of actual data. We categorize the information requested by the upper-layer optimizer from the underlying engine into two types: 1. **Simple statistic data**: This can be straightforwardly and efficiently obtained from the target instance (e.g., mysql.innodb_index_stats, information_schema.TABLES, etc.). This includes data such as table_rows, AVG_ROW_LENGTH, and clustered_index_size. 2. **Complex statistic data**: These include complex statistics that cannot be directly obtained and usually require estimation, primarily multi-column NDV (number of distinct values) estimation needed for records_per_key, and multi-column cardinality estimation needed for records_in_range. Considering that what-if analysis might create indexes for any combination of columns (up to 2^N combinations), we must employ heuristic or AI-boosted algorithmic models for estimation, rather than enumerating and pre-storing all combinations. ## Q1: Why are AI-boosted cardinality and NDV algorithms required? Simple statistics are directly logged into system tables. However, accurately estimating multi-column joint distributions poses significant challenges. Rule-based or heuristic methods often result in inaccuracies (e.g. PostgreSQL's native implementation). Most of these research papers [2] employ query-driven or data-driven AI models to enhance the precision of NDV and cardinality estimates. The advantage of AI models is that they learn data distributions from data or queries without actually creating indexes or running statistics, providing more generalized and accurate estimates of cardinality and NDV. ## Q2: Why separate a RESTful Statistic server from the VIDEX engine side? Although much research has been conducted, most of it is not practically applicable in production. For instance, the Postgres cardinality hacking approach [3] requires researchers to write fixed-order cardinality results to a file, which the Postgres optimizer then reads. This method is not conducive to runtime operations. Moreover, AI researchers encounter substantial challenges when integrating AI models into database engines. On one hand, these engines must manage complex underlying interfaces; on the other, AI models frequently necessitate Python and GPU environments. We advocate for a seperate RESTful Statistic server that facilitates the seamless integration of various models (AI and non-AI) into the query plans of MariaDB/MySQL. This method has proven to be genuinely practical. Validated through extensive use at ByteDance's cloud service, VIDEX processes over 5,000 index recommendation tasks and tens of thousands of slow SQL queries daily. Furthermore, we have developed and successfully integrated multiple NDV and cardinality algorithms into VIDEX, accepted by VLDB and SIGMOD [4][5][6]. ## Q3: How to Balance Estimation Accuracy and Architectural Simplicity In the current implementation, we forward all requests (whatever simple or complex) to the server for simplicity. However, this involved an additional server module. Ideally, we believe VIDEX can strike a balance between estimation accuracy and architectural simplicity: 1. **Heuristic estimation mode**: All statistical data is stored in the database's statistic tables (e.g., table_rows, cluster_size, single-column NDV, single-column histograms). The VIDEX plugin uses simple assumptions (e.g., independence between columns, uniform distribution across a single column) to provide estimates for records_in_range and records_per_key. 2. **AI-boosted estimation mode**: To enhance accuracy, complex multi-column joint NDV and cardinality estimates are forwarded to a RESTful statistic server, which can be developed using Python, C, or Java, and deployed on CPU or GPU environments. [1] VIDEX design paper: Rong Kang, Shuai Wang, et al. VIDEX: A Disaggregated and Extensible Virtual Index for the Cloud and AI Era, https://arxiv.org/abs/2503.23776 [2] AI models in DB optimization: Yuxing Han et al., “Cardinality estimation in DBMS: a comprehensive benchmark evaluation,” Proc. VLDB 2021. [3] PG cardinality hacking: https://github.com/Nathaniel-Han/End-to-End-CardEst-Benchmark/ [3] VIDEX NDV paper 1: Xianghong Xu, Tieying Zhang, et al, AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators, VLDB, 2025 (accepted) [4] VIDEX NDV paper 2: Xianghong Xu, Xiao He, et al, PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models, SIGMOD, 2025 (accepted) [5] VIDEX Cardinality paper 3: Peizhi Wu, Rong Kang, et al. Data-Agnostic Cardinality Learning from Imperfect Workloads, VLDB, 2025 (accepted) I hope this addresses your concerns. If you have any further questions, please feel free to continue the discussion. Best, Rong, ByteBrain Team, ByteDance

Rong Kang

28 Apr 28 Apr

7:31 p.m.

Dear Sergey Petrunia and Sergei Golubchik, I hope this message finds you well. I'm reaching out as a gentle follow-up to our previous discussion about adapting VIDEX for MariaDB compatibility. As we continue our preparation work, I wanted to seek your advice on any specific technical considerations or design adjustments we should address when adapting VIDEX to work with MariaDB's architecture. I look forward to any guidance you might offer when you have a moment to respond. Thank you for your time. Best regards, Rong ByteBrain Team, ByteDance

Sergei Petrunia

29 Apr 29 Apr

1:49 p.m.

Dear Rong, I've spent some time looking at Videx. I got it installed in a test environment but I'd like to spend some more time to try it out and understand it better. I think I should be able to do this by the end of this week. Let me get back on this by then. BR Sergei Petrunia -- On Mon, Apr 28, 2025 at 8:31 PM Rong Kang via discuss < discuss@lists.mariadb.org> wrote:

...

Dear Sergey Petrunia and Sergei Golubchik,

I hope this message finds you well. I'm reaching out as a gentle follow-up to our previous discussion about adapting VIDEX for MariaDB compatibility.

As we continue our preparation work, I wanted to seek your advice on any specific technical considerations or design adjustments we should address when adapting VIDEX to work with MariaDB's architecture.

I look forward to any guidance you might offer when you have a moment to respond. Thank you for your time.

Best regards, Rong ByteBrain Team, ByteDance _______________________________________________ discuss mailing list -- discuss@lists.mariadb.org To unsubscribe send an email to discuss-leave@lists.mariadb.org

Rong Kang

30 Apr 30 Apr

8:49 a.m.

Dear Sergei Petrunia, Thank you for your kind help and the update on your progress. I’m glad to hear things are moving forward. If you have any questions regarding the design or implementation of VIDEX, I’d be happy to assist. Additionally, I would greatly appreciate any suggestions for refactoring or optimizing the VIDEX plugin to better align with MariaDB. Thank you again for your support. Best regards, Rong ByteBrain Team, ByteDance

Adam Luciano

2 May 2 May

8:44 p.m.

Hi Rong, I am VP of product management @ MariaDB. I have a few follow-up questions for you: 1. What types of AI models have you seen work well in estimating cardinality and ndv and how have you seen them work in production? I think the what-if scenarios setup in the correct way could help AI models align to relatively well optimized outputs since data can be created to assist in AI training. 2. What are the limitations of VIDEX simulations in representing real-world database workloads that you have seen? What have you seen work relating to ensuring that AI models trained on VIDEX data generalize well to production environments? 3. Has ByteDance explored using LLMs or other natural language processing techniques in conjunction with VIDEX for database optimization tasks? If so, what were the outcomes, and are there recommendations for incorporating such models into an index automation system? Thanks Adam

Adam Luciano

8:52 p.m.

Rong - FYI - https://jira.mariadb.org/browse/MDEV-36737

Rong Kang

4 May 4 May

8:33 p.m.

Hi Adam, Thank you for your thoughtful questions. I'm pleased to share our real-world experience at ByteDance, where VIDEX currently powers our internal index recommendation service, processing thousands of optimization tasks daily. We're also planning to launch it on our public cloud (https://www.volcengine.com/) within the next 1-2 quarters. Regarding your specific inquiries: 1. **AI models for cardinality and NDV estimation in production:** NDV estimation approaches fall into two categories: sampling-based and dataless. When partial data access is available, many classical NDV algorithms exist, though they typically excel only with specific distributions [1]. VIDEX employs AdaNDV (our work accepted by VLDB'25 [1]), an adaptive approach that combines multiple NDV algorithms for optimal results. For users with strict privacy requirements or those needing rapid recommendations (<5s), we deploy PLM4NDV, our dataless solution accepted by SIGMOD'25 [2], which ranks among the leading approaches in this domain. Cardinality estimation methods are generally classified as query-driven or data-driven. Data-driven methods (such as Naru and DeepDB) provide superior single-table cardinality accuracy but require greater preprocessing resources. For privacy-conscious cloud users or environments where full-scans are restricted, query-driven methods like MSCN [4] are more appropriate. Our query-driven approach GRASP [3]has achieved state-of-the-art results and has been accepted at VLDB'25. For index recommendations in practice, we provide further details in point #2 below. 2. **Limitations and production generalization:** The primary challenge for VIDEX in production environments is accurately modeling multi-column join distributions with limited data access. Our approach varies according to customer requirements: - With sampling permission, we gather data via PK-based sampling and utilize AdaNDV for NDV estimation. We construct histograms for single-column cardinality estimations and employ correlation coefficients for multi-column cardinality. - In zero-sampling scenarios, we rely on our pre-trained models (PLM4NDV and a dataless CardEst method). Our testing across 5,000+ index recommendation tasks demonstrates that these approaches consistently outperform traditional sampling-based recommendations. 3. **Natural language models with VIDEX:** Regarding NDV, PLM4NDV (our SIGMOD 2025 paper) leverages pre-trained language models to extract semantic schema information without accessing actual data. This approach is particularly valuable in cloud environments where data access is restricted. Our models are pre-trained on thousands of public schema datasets, making them immediately applicable to new business scenarios without additional training. In terms of cardinality, we've achieved promising results using language models for entirely dataless cardinality estimation. Thank you again for your interest. I welcome any additional questions regarding our research technology or business implementations. - [1] AdaNDV (Our NDV work, VLDB 2025): Xu, X., Zhang, T., He, X., Li, H., Kang, R., Wang, S., ... & Chen, J. (2025). AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators. - [2] PLM4NDV (Our language-model-based NDV work, SIGMOD 2025): Xu, X., He, X., Zhang, T., Zhang, L., Shi, R., & Chen, J. PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models - [3] GRASP (Our query-driven cardinality work, VLDB 2025): Peizhi Wu, Rong Kang, Tieying Zhang*, Jianjun Chen, Ryan Marcus, Zachary G. Ives. Data-Agnostic Cardinality Learning from Imperfect Workloads. - [4] MSCN: A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper, “Learned Cardinalities: Estimating Correlated Joins with Deep Learning,” Dec. 18, 2018, arXiv: arXiv:1809.00677. doi: 10.48550/arXiv.1809.00677. Best regards, Rong ByteBrain Team, ByteDance

Adam Luciano

5 May 5 May

10:25 p.m.

Thank you Rong, great info! Will circle back if I have more questions. Appreciate the deep dive response!

Sergey Petrunia

4 May 4 May

9:13 p.m.

Hi Rong, Short: I've found a limitation in VIDEX which isn't mentioned on its homepage. Are you aware of this? The good part here is that when one hits the limitation the server returns an error (instead of silently prodicing a wrong query plan). Long: I was looking at VIDEX and got this question: MariaDB (and MySQL) optimizer may read individual table rows for constant tables, and further optimization makes use of the column values it has read. How does VIDEX deal with this situation? It looks like it doesn't. A basic example using the test dataset. Assume the VIDEX is set up as shown in README.md. Let's first try a query using a real table: mysql> use tpch_tiny; Database changed mysql> explain select count(*) from customer,orders where c_custkey=o_custkey and c_custkey=150000; +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+ | 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | no matching row in const table | +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+ 1 row in set, 1 warning (0.00 sec) mysql> Then, let's try VIDEX: mysql> use videx_tpch_tiny; Database changed mysql> explain select count(*) from customer,orders where c_custkey=o_custkey and c_custkey=150000; ERROR 1031 (HY000): Table storage engine for 'customer' doesn't have this option One can say this query is not very meaningful but that's just a basic example. One may construct more meaningful examples. Another example similar to the above: mysql> explain select * from customer where c_custkey= (select max(c_custkey) from customer); ERROR 1031 (HY000): Table storage engine for 'customer' doesn't have this option BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Blog: http://petrunia.net

Rong Kang

6 May 6 May

9:18 a.m.

Hi Sergei: Great observation! We've encountered this same issue internally. We'll add this limitation in the VIDEX/README. Thanks you. We think this may happen because MariaDB/MySQL don't completely separate query optimization from execution. When dealing with equality conditions on PK/UK columns, MySQL/MariaDB calls index_init/index_read functions that attempt to access actual data. Fow now, VIDEX simply returns `HA_ERR_WRONG_COMMAND`, resulting in the 1031 error. We've explored several potential solutions: 1. Skip direct index execution by setting `subquery_to_derived=on`, it works on MySQL 8.0, e.g.: ```sql

...

SET optimizer_switch='subquery_to_derived=off'; explain select count(*) from customer,orders where c_custkey=o_custkey and c_custkey=150000; (1031, "Table storage engine for 'customer' doesn't have this option")

...

SET optimizer_switch='subquery_to_derived=on'; explain select count(*) from customer,orders where c_custkey=o_custkey and c_custkey=150000; +----+-------------+----------+------------+-------+---------------+------------+---------+-------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+-------+---------------+------------+---------+-------+------+----------+-------------+ | 1 | SIMPLE | customer | <null> | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.0 | Using index | | 1 | SIMPLE | orders | <null> | ref | ORDERS_FK1 | ORDERS_FK1 | 4 | const | 1 | 100.0 | Using index | +----+-------------+----------+------------+-------+---------------+------------+---------+-------+------+----------+-------------+


2. Keep the error as is - it's somewhat reasonable for index recommendation since queries that can use PK/UK often don't need additional indexes.

3. Let index_read directly return 0, indicating to the optimizer that no matching rows exist.

4. Provide virtual data in the buffer based on input conditions. (it may involve complex implementation work).

We would appreciate your thoughts on the best solution path, for both MariaDB and MySQL.

Best
Rong

Sergey Petrunia

4 May 4 May

9:57 p.m.

New subject: VIDEX: possible issue with records_in_range(... page_range) parameter

Hi Rong, I've discovered one possible issue: in MySQL, records_in_range() call is defined as: virtual ha_rows records_in_range(uint inx [[maybe_unused]], key_range *min_key [[maybe_unused]], key_range *max_key [[maybe_unused]]) in MariaDB, it is: virtual ha_rows records_in_range(uint inx, const key_range *min_key, const key_range *max_key, page_range *res) Note the page_range parameter. It is an "OUT" parameter, it is defined as: /* Store first and last leaf page accessed by records_in_range */ typedef struct st_page_range { ulonglong first_page; ulonglong last_page; } page_range; The storage engine provides the SQL layer with information about how "far apart" the ranges are in the index. The SQL layer uses that information when computing the overall costs in handler::multi_range_read_info_const(). The commit that added this in MariaDB: http://github.com/mariadb/server/commit/f36ca142f7fa59598e34e842b60372b96306... I'm not sure how big the impact of this is. BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Blog: http://petrunia.net

Sergey Petrunia

10:06 p.m.

New subject: VIDEX: multi-column hypothetical indexes?

Hi Rong, It might not be important for the first test implementation, but I'm wondering, does VIDEX intend to handle multi-column hypothetical indexes? I see that videx_build_env.py only collects single-column histograms. If one has histograms on columns A, B, and then tries to produce a multi-column records_in_range estimate for INDEX(A, B), the results may be way off. (and, as a consequence, the page_range problem of records_in_range() would be unimportant in comparison). Is there any way to collect more statistics than what videx_build_env.py does? p.s. I've got a couple more things I'd like to check. I hope I'll get back with those soon. BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Blog: http://petrunia.net

Rong Kang

6 May 6 May

10:08 a.m.

New subject: VIDEX: multi-column hypothetical indexes?

Hi Sergei: You mention a very important algorithm task, that is the "multi-column cardinality estimation". This is a challenge that all what-if analysis databases need to address. In the current VIDEX open-source version, we released a simple solution that assumes independence between columns, so Card(AB) = Card(A)/total_row * Card(B)/total_row. This performs well on benchmarks like TPCH but tends to under-estimate in more complex scenarios. In ByteDance's production environment, when sampling is permitted, we pre-collect up to 100k rows covering all columns that appear in query conditions, and estimate joint cardinality based on this sample. Without sampling, we use a pre-trained language model: faster but coarser. We are currently preparing a paper on this work, and will release it in the future. Nevertheless, existing methods still don't perfectly solve this problem. That's why we've opened the algorithm interfaces and welcome research contributions. Best, Rong

Rong Kang

9:51 a.m.

New subject: VIDEX: possible issue with records_in_range(... page_range) parameter

Hi Sergei: About the `page_range` parameter, I'm trying to understand the rationale behind this addition. As I understand it, records_in_range() is to calculate how many rows match the given condition. If so, the existing parameters (index_name, min_key, max_key) seem sufficient for cardinality estimation. If my understanding is correct, `page_range` isn't information provided to the storage engine, but rather data that the engine needs to populate and return for the optimizer? For example, when the engine reports that first_page and end_page are the same, the index cost might be calculated as slightly lower? This is quite an interesting feature. For our first phase, I think we could temporarily ignore it without causing significant estimation errors. After the first version, I'd like to create examples where page_range is effective and enhance VIDEX's support for page_range. What do you think? :) Best, Rong

Rong Kang

26 May 26 May

5:36 p.m.

Hi Sergei and Adam, I'm excited to share that VIDEX has been accepted into OSPP-2025 (Open Source Promotion Plan), a program where students apply to contribute to open source projects. We now have enthusiastic students ready to work on the MariaDB adaptation of VIDEX. Given your effective research and code investigation so far, I'd like to inquire about the progress and discuss potential collaboration arrangements. Your expertise would be invaluable in scheduling and coordinating this work to avoid duplicate efforts. The VIDEX students can contribute to either the optimizer/plugin side or the statistics server modifications - we're flexible and would appreciate your guidance on where their contributions would be most beneficial. The OSPP program runs from June 25 to September 30. Within this timeframe, we hope to achieve meaningful milestones that could serve as a foundation for potentially including VIDEX in the official MariaDB plugin list in the future. I look forward to hearing your thoughts on the best way to coordinate this effort. For reference, here's the OSPP-2025 project page (available in English): https://summer-ospp.ac.cn/org/prodetail/25a990328?list=org&navpage=org Best regards, Rong ByteBrain Team, ByteDance

Sergey Vojtovich

30 May 30 May

2:38 p.m.

Hi Rong, Greetings from another Sergey, MariaDB Foundation. Just to set the roles, do I understand correctly that you (ByteDance?) are will be the primary driver of this effort and you need assistance from our side. Not vice versa. Correct? In either way, here is what I'd do if I were to implement it myself: 1. Pick the primary target branch. I'd use main if you're expecting many changes to MariaDB codebase, otherwise 10.11 should be alright. 2. I'd consider MySQL and MariaDB as different products from APIs point of view. E.g. Spider in MariaDB 10.3 was attempting to keep code compatible with all supported MySQL and MariaDB versions and it got overbloated with version check ifdefs such that code was almost unreadable. 3. Same for different MariaDB versions: I'd keep the code in separate branches for each supported version and merge from the lowest one to the highest. 4. Then we have 3 source code entities: - MariaDB source code itself - storage/videx - storage engine implementation - storage/videx/videx - videx server and other common stuff storage/videx can be either a submodule or go under MariaDB git history. Currently all storage engines use the latter approach. Directory contents: CMakeLists.txt ha_videx.cc videx/ mysql-test/ storage/videx/videx should probably be a submodule in your case. 5. I'd move all common code out of current ha_videx.cc. 6. Create MariaDB storage/videx implementation, make it compile. 7. Start populating storage/videx/mysql-test with various use cases, fix bugs. I'm unsure about MySQL and MariaDB regression testing systems compatibility, but some common tests can live in storage/videx/videx and imported during compilation. 8. Once things are in a decent state we should be able to make videx part of MariaDB. Speaking of communication channels: in this thread you have emails of most relevant MariaDB engineers that you may want to deal with. I can help you deal with general questions and coordinate this project on MariaDB side. Sergey Petrunia should definitely step in at point 6-7, once we start dealing with optimizer differences. You should be able to use @svoj in your github issues/pull requests to ping me. MDEV-36737 and this mailing list are also alright. Regards, Sergey On Mon, May 26, 2025 at 7:36 PM Rong Kang via discuss < discuss@lists.mariadb.org> wrote:

...

Hi Sergei and Adam,

I'm excited to share that VIDEX has been accepted into OSPP-2025 (Open Source Promotion Plan), a program where students apply to contribute to open source projects. We now have enthusiastic students ready to work on the MariaDB adaptation of VIDEX.

Given your effective research and code investigation so far, I'd like to inquire about the progress and discuss potential collaboration arrangements. Your expertise would be invaluable in scheduling and coordinating this work to avoid duplicate efforts.

The VIDEX students can contribute to either the optimizer/plugin side or the statistics server modifications - we're flexible and would appreciate your guidance on where their contributions would be most beneficial.

The OSPP program runs from June 25 to September 30. Within this timeframe, we hope to achieve meaningful milestones that could serve as a foundation for potentially including VIDEX in the official MariaDB plugin list in the future.

I look forward to hearing your thoughts on the best way to coordinate this effort.

For reference, here's the OSPP-2025 project page (available in English): https://summer-ospp.ac.cn/org/prodetail/25a990328?list=org&navpage=org

Best regards, Rong ByteBrain Team, ByteDance _______________________________________________ discuss mailing list -- discuss@lists.mariadb.org To unsubscribe send an email to discuss-leave@lists.mariadb.org

Rong Kang

3 Jun 3 Jun

7:35 p.m.

Dear Sergey Vojtovich: Thank you for your detailed and actionable guidance. I agree with the steps you've outlined and would like to confirm a few points: 1. Regarding roles and collaboration: Yes, we (ByteDance) can drive this effort. I'll be working with a student selected for OSPP to contribute to VIDEX for MariaDB. The final student selection will be confirmed on June 25, and he/she will begin participating then. (Of course, we're also happy to have you drive the process if preferred. We're flexible and eager to collaborate either way) 2. Regarding milestones: Should we maintain one large PR until all requirements are met and then merge everything at once? Or would you prefer splitting it into multiple milestones, completing functionality incrementally? 3. Regarding API adaptation: For steps 6-7, which we consider the most challenging parts (adapting the API, compiling and running VIDEX in MariaDB), would MariaDB Foundation collaborate on PRs and contribute code, or should ByteDance handle all code contributions while MariaDB Foundation only provide guidance? (I noticed Petrunia has already reviewed the VIDEX code in depth) 4. Regarding code quality: We'd appreciate your advices on whether the current VIDEX implementation needs refactoring to meet MariaDB standards. Additionally, VIDEX currently lack C tests for the storage engine itself and would appreciate your guidance - such as test case frameworks to be filled. 5. Regarding communication: This mailing list has grown quite long, making it difficult to track new issues or mention multiple people. GitHub issues may be better, or shall we join a Slack channel for real-time discussions? 6. Feature detail #1: If the VIDEX-stats-server operates as a submodule, it requires a Python environment and launches an HTTP service. This design is intended to be AI-model integration friendly, but is this suitable for MariaDB? 7. Feature detail #2: VIDEX currently has a limitation regarding index_read. As Petrunia mentioned, MySQL/MariaDB query optimization phase may call index_init/index_read, but VIDEX doesn't currently implement them and throws a 1031 error. Can we leave this as a future implementation issue? Thank you, and I look forward to collaborating on this interesting project! Best, Rong

Sergey Vojtovich

9 Jun 9 Jun

6:26 p.m.

Rong, On Tue, Jun 3, 2025 at 9:35 PM Rong Kang via discuss < discuss@lists.mariadb.org> wrote:

...

Dear Sergey Vojtovich:

Thank you for your detailed and actionable guidance. I agree with the steps you've outlined and would like to confirm a few points:

1. Regarding roles and collaboration: Yes, we (ByteDance) can drive this effort. I'll be working with a student selected for OSPP to contribute to VIDEX for MariaDB. The final student selection will be confirmed on June 25, and he/she will begin participating then. (Of course, we're also happy to have you drive the process if preferred. We're flexible and eager to collaborate either way)

It was my expectation, just wanted to confirm. Let's stick to your plan. 2. Regarding milestones: Should we maintain one large PR until all

...

requirements are met and then merge everything at once? Or would you prefer splitting it into multiple milestones, completing functionality incrementally?

It would be good to have some functional and useful minimal implementation in the beginning. Then we can switch to an incremental approach. Otherwise it is up to you to set milestones.

...

3. Regarding API adaptation: For steps 6-7, which we consider the most challenging parts (adapting the API, compiling and running VIDEX in MariaDB), would MariaDB Foundation collaborate on PRs and contribute code, or should ByteDance handle all code contributions while MariaDB Foundation only provide guidance? (I noticed Petrunia has already reviewed the VIDEX code in depth)

If ByteDance drives this effort, we take a secondary role. Which is mostly about providing guidance. But I don't see anything preventing us from contributing code for specific issues. This can be negotiated on a case by case basis.

...

4. Regarding code quality: We'd appreciate your advices on whether the current VIDEX implementation needs refactoring to meet MariaDB standards. Additionally, VIDEX currently lack C tests for the storage engine itself and would appreciate your guidance - such as test case frameworks to be filled.

MariaDB server and plugins can have different maturity levels. E.g. server is stable, while ha_example storage engine is marked experimental. First versions of VIDEX can be either experimental or alpha. grep for MariaDB_PLUGIN_MATURITY_EXPERIMENTAL. I can't foresee any specific refactorings at the moment. But the code will have to evolve for sure. What kind of C tests are you referring to? I believe our mysql-test framework should be capable of testing VIDEX. E.g. see tests for storage engines like storage/oqgraph, storage/sphinx or storage/test_sql_discovery. It should be possible to install videx plugin, run external indexer and run/query VIDEX daemon. We can help you to develop a test suite, for detailed information see https://mariadb.com/kb/en/mariadb-test-overview/

...

5. Regarding communication: This mailing list has grown quite long, making it difficult to track new issues or mention multiple people. GitHub issues may be better, or shall we join a Slack channel for real-time discussions?

Agree. Github issues are definitely fine. Slack can be an option, I will need to discuss it with my colleagues though. We also have MariaDB zulip publicly available: https://mariadb.zulipchat.com .

...

6. Feature detail #1: If the VIDEX-stats-server operates as a submodule, it requires a Python environment and launches an HTTP service. This design is intended to be AI-model integration friendly, but is this suitable for MariaDB?

Yes, it should be alright. We'll definitely have problems with packaging rpms/debs, but it must be manageable. VIDEX should be in a separate package.

...

7. Feature detail #2: VIDEX currently has a limitation regarding index_read. As Petrunia mentioned, MySQL/MariaDB query optimization phase may call index_init/index_read, but VIDEX doesn't currently implement them and throws a 1031 error. Can we leave this as a future implementation issue?

Yes, it is up to you to decide which functionality is available and which is not. Thank you, and I look forward to collaborating on this interesting project! Likewise! I had a discussion regarding git layout with my colleagues and it appears to be slightly more complex. In a nutshell storage/videx should go under MariaDB git history. storage/videx/videx should be submodule. In this case if we change API/build process/etc we can fix VIDEX right away on our side. If storage/videx were a submodule we have no control over it, and the only option we have in such a case is disabling it. So storage/videx contents should look like: CMakeLists.txt ha_videx.cc videx/ -- submodule, common stuff mysql-test/ -- some tests can come from videx/ submodule if needed (so that they work both in MariaDB and MySQL) Regards, Sergey

Rong Kang

30 Jun 30 Jun

8:27 a.m.

Hi Sergey, Thrilled to introduce Haibo (@haibo942us), who will be developing VIDEX for MariaDB as part of OSPP-2025! He will focus on achieving our initial milestones by September 30. Of course, we are committed to continuing our contributions to the long-term integration well beyond the project timeline. To kick things off, could you please create a ticket or thread for this project? We're ready to jump in to discuss the initial development and define the first milestone together. Whatever platform you think is best works for us (e.g., a new thread on lists.mariadb.org, a Jira like https://jira.mariadb.org/browse/MDEV-36737, or Zulip?). Looking forward to a great collaboration! Best, Rong

haibo942us＠gmail.com

8:48 a.m.

Hi MariaDB Community, My name is Haibo Yang, and I am currently a Master's student in Computer Science at the University of California, Santa Barbara (UCSB). I am honored to have been selected as a student contributor in the OSPP program to participate in the “VIDEX Adaptation for MariaDB” project. Currently, I am working on the initial debugging and adaptation of VIDEX for MariaDB 11. I look forward to deepening my understanding of the MariaDB architecture throughout the project cycle and to actively collaborating with the community to learn and grow together. Best regards, Haibo Yang haibo942us@gmail.com

Sergey Vojtovich

11:08 p.m.

Hi Rong, Habin! Habin, nice to meet you! Welcome aboard! I have created "general / VIDEX integration" Zulip thread. https://mariadb.zulipchat.com/#narrow/channel/118759-general/topic/VIDEX.20i... You should be able to sign up for MariaDB Zulip at https://mariadb.zulipchat.com/. Let's give it a try and see how it goes. Regards, Sergey On Mon, Jun 30, 2025 at 10:27 AM Rong Kang via discuss < discuss@lists.mariadb.org> wrote:

...

Hi Sergey,

Thrilled to introduce Haibo (@haibo942us), who will be developing VIDEX for MariaDB as part of OSPP-2025!

He will focus on achieving our initial milestones by September 30. Of course, we are committed to continuing our contributions to the long-term integration well beyond the project timeline.

To kick things off, could you please create a ticket or thread for this project? We're ready to jump in to discuss the initial development and define the first milestone together.

Whatever platform you think is best works for us (e.g., a new thread on lists.mariadb.org, a Jira like https://jira.mariadb.org/browse/MDEV-36737, or Zulip?).

Looking forward to a great collaboration!

Best, Rong _______________________________________________ discuss mailing list -- discuss@lists.mariadb.org To unsubscribe send an email to discuss-leave@lists.mariadb.org

Age (days ago)

Last active (days ago)

List overview

24 comments

8 participants

participants (8)

Adam Luciano
haibo942us＠gmail.com
Rong Kang
Sergei Golubchik
Sergei Petrunia
Sergey Petrunia
Sergey Vojtovich
康荣

Request to Adapt VIDEX (Virtual Index for what-if Analysis) for MariaDB

tags

participants (8)