Request to Adapt VIDEX (Virtual Index for what-if Analysis) for MariaDB

Dear MariaDB Community, I hope this message finds you well. We'd like to introduce VIDEX (Virtual Index for MySQL) and humbly seek your guidance and assistance in adapting it for MariaDB compatibility. We've developed a virtual index storage engine (plugin) called VIDEX. It can simulate the impact of indexes on query plan costs without requiring real data. VIDEX forwards the statistical information, ndv (records_per_key), and cardinality information (records_in_range) needed for making query plans via a RESTful approach, with responses provided by an "algorithm service." This disaggregated design is particularly well-suited for integrating various AI-boosted cardinality and ndv algorithms. VIDEX has been successfully deployed at ByteDance, serving thousands of MySQL instances daily and over millions of SQL queries for index optimization tasks. We implemented VIDEX based on MySQL 8.0 and now, thanks to Federico Razzoli for his suggestion (LinkedIn post 1 <https://www.linkedin.com/feed/update/urn:li:activity:7307955728613351424/>, LinkedIn post 2 <https://www.linkedin.com/posts/federicorazzoli_github-bytedancevidex-virtual-index-for-activity-7308254768194342912-owBu/>, , Github Issue <https://github.com/bytedance/videx/issues/1>), we'd like to adapt it to MariaDB. However, we've discovered significant differences between MySQL 8.0 and the latest MariaDB codebase. Therefore, we're seeking help from the MariaDB community, and would greatly appreciate collaboration with experienced developers to modify the code. We're very willing to cooperate in this effort. In the future, we also hope to have the opportunity to add VIDEX to the official MariaDB plugin list. *About VIDEX* VIDEX GitHub repository: https://github.com/bytedance/videx/ Virtual index technology is already available in PostgreSQL, Oracle, and IBM DB2, and would provide MariaDB users with valuable capabilities for query optimization and database research. VIDEX offers key features that would benefit MariaDB users: - What-if analysis for index strategies without production impact - Support for integration of custom or AI-based estimation algorithms - Accurate simulation of query plans, verified with complex benchmarks like TPC-H and JOB If you're interested in collaborating or have guidance to offer on adapting our code to MariaDB, we'd greatly appreciate hearing from you. Thank you for your consideration. Best regards, Rong ByteBrain Team, Bytedance

Hello, A MariaDB query optimizer member here. I think it would be nice to get VIDEX for MariaDB. As far I'm informed, the basics of interaction between the query optimizer and storage engine are the same between MariaDB and MySQL: - ha_engine::info() call - ha_engine::records_in_range() call - extended keys feature Execution part (index/table read methods) is also very similar although there are some cosmetic differences. There are bigger differences in how partitioning is handled, data dictionary and column histograms. I think it's feasible to work around those. To sum up, at first glance adoption to MariaDB seems doable. I'd like to study VINDEX further and see where/how we MariaDB devs could lend a hand. Best Regards, -- Sergei Petrunia Query Optimizer team MariaDB Corporation On Wed, Apr 16, 2025 at 09:41:16PM -0500, 康荣 via discuss wrote:
Dear MariaDB Community,
I hope this message finds you well. We'd like to introduce VIDEX (Virtual Index for MySQL) and humbly seek your guidance and assistance in adapting it for MariaDB compatibility.
We've developed a virtual index storage engine (plugin) called VIDEX. It can simulate the impact of indexes on query plan costs without requiring real data. VIDEX forwards the statistical information, ndv (records_per_key), and cardinality information (records_in_range) needed for making query plans via a RESTful approach, with responses provided by an "algorithm service." This disaggregated design is particularly well-suited for integrating various AI-boosted cardinality and ndv algorithms. VIDEX has been successfully deployed at ByteDance, serving thousands of MySQL instances daily and over millions of SQL queries for index optimization tasks.
We implemented VIDEX based on MySQL 8.0 and now, thanks to Federico Razzoli for his suggestion (LinkedIn post 1, LinkedIn post 2, , Github Issue), we'd like to adapt it to MariaDB. However, we've discovered significant differences between MySQL 8.0 and the latest MariaDB codebase. Therefore, we're seeking help from the MariaDB community, and would greatly appreciate collaboration with experienced developers to modify the code. We're very willing to cooperate in this effort. In the future, we also hope to have the opportunity to add VIDEX to the official MariaDB plugin list.
About VIDEX VIDEX GitHub repository: https://github.com/bytedance/videx/
Virtual index technology is already available in PostgreSQL, Oracle, and IBM DB2, and would provide MariaDB users with valuable capabilities for query optimization and database research. VIDEX offers key features that would benefit MariaDB users:
• What-if analysis for index strategies without production impact • Support for integration of custom or AI-based estimation algorithms • Accurate simulation of query plans, verified with complex benchmarks like TPC-H and JOB
If you're interested in collaborating or have guidance to offer on adapting our code to MariaDB, we'd greatly appreciate hearing from you.
Thank you for your consideration.
Best regards, Rong ByteBrain Team, Bytedance
_______________________________________________ discuss mailing list -- discuss@lists.mariadb.org To unsubscribe send an email to discuss-leave@lists.mariadb.org

Sergey Petrunia wrote:
Hello,
A MariaDB query optimizer member here.
I think it would be nice to get VIDEX for MariaDB.
As far I'm informed, the basics of interaction between the query optimizer and storage engine are the same between MariaDB and MySQL:
- ha_engine::info() call - ha_engine::records_in_range() call - extended keys feature
Execution part (index/table read methods) is also very similar although there are some cosmetic differences.
There are bigger differences in how partitioning is handled, data dictionary and column histograms. I think it's feasible to work around those.
To sum up, at first glance adoption to MariaDB seems doable. I'd like to study VINDEX further and see where/how we MariaDB devs could lend a hand.
Best Regards,
Dear Sergei Petrunia, I'm delighted to see your interest in adapting VIDEX for MariaDB and am very willing to collaborate on this effort. Regarding the interaction between the query optimizer and storage engine, your understanding is indeed comprehensive and accurate. Additionally, there might be some configurations related to index status, such as `handlerton->flags`. Furthermore, VIDEX primarily focuses on query optimizer-related operations, thus skipping all interfaces related to index queries (like `index_first`, `index_read`), which I believe could be skipped similarly in MariaDB. I'm very open to discussing the best collaboration approach. Given your expertise in the MariaDB Query Optimizer team and your deep understanding of its internal interfaces, I would be especially grateful and it would be extremely helpful if you could consider contributing to the MariaDB plugin part in this adaptation. Your insights into MariaDB's architecture would be invaluable for efficiently modifying the VIDEX plugin to work seamlessly with MariaDB. Besides the MariaDB plugin part, I can take responsibility for any necessary modifications to the VIDEX Statistics Server and the existing MySQL plugin. Our goal is to have the VIDEX statistics server provide a unified interface that's compatible with both MariaDB and MySQL, rather than forcing MariaDB to adapt to the existing MySQL interface. Of course, I could also handle validation work on various benchmarks (TPC-H, JOB, etc.) to ensure everything works properly. Thank you again for your interest and support. I look forward to further discussing the details with you. Best regards, Rong ByteBrain Team, Bytedance

Hi, Rong, Could you please explain why did you chose this particular architecture? What is the use case? You refer to PostgreSQL, Oracle, and IBM DB2, but neither of them involves taking via REST API to a separate server and neither mentions "AI-boosted cardinality and ndv algorithms". So, what is "AI-boosted cardinality and ndv algorithms" and why a separate server? Regards, Sergei Chief Architect, MariaDB Server and security@mariadb.org On Apr 16, Rong Kang via discuss wrote:
We've developed a virtual index storage engine (plugin) called VIDEX. It can simulate the impact of indexes on query plan costs without requiring real data. VIDEX forwards the statistical information, ndv (records_per_key), and cardinality information (records_in_range) needed for making query plans via a RESTful approach, with responses provided by an "algorithm service." This disaggregated design is particularly well-suited for integrating various AI-boosted cardinality and ndv algorithms. VIDEX has been successfully deployed at ByteDance, serving thousands of MySQL instances daily and over millions of SQL queries for index optimization tasks.
*About VIDEX* VIDEX GitHub repository: https://github.com/bytedance/videx/
Virtual index technology is already available in PostgreSQL, Oracle, and IBM DB2, and would provide MariaDB users with valuable capabilities for query optimization and database research. VIDEX offers key features that would benefit MariaDB users:
- What-if analysis for index strategies without production impact - Support for integration of custom or AI-based estimation algorithms - Accurate simulation of query plans, verified with complex benchmarks like TPC-H and JOB
If you're interested in collaborating or have guidance to offer on adapting our code to MariaDB, we'd greatly appreciate hearing from you.
Thank you for your consideration.

Dear Sergei Golubchik, That's a good question! I would like to discuss the VIDEX architecture, which is inspired by industry practices at ByteDance, particularly in its cloud environment. Additionally, you may refer to the VIDEX arXiv paper, which presents further design considerations [1]. VIDEX aims to simulate the query plans of a target instance (with large volume data) in the absence of actual data. We categorize the information requested by the upper-layer optimizer from the underlying engine into two types: 1. **Simple statistic data**: This can be straightforwardly and efficiently obtained from the target instance (e.g., mysql.innodb_index_stats, information_schema.TABLES, etc.). This includes data such as table_rows, AVG_ROW_LENGTH, and clustered_index_size. 2. **Complex statistic data**: These include complex statistics that cannot be directly obtained and usually require estimation, primarily multi-column NDV (number of distinct values) estimation needed for records_per_key, and multi-column cardinality estimation needed for records_in_range. Considering that what-if analysis might create indexes for any combination of columns (up to 2^N combinations), we must employ heuristic or AI-boosted algorithmic models for estimation, rather than enumerating and pre-storing all combinations. ## Q1: Why are AI-boosted cardinality and NDV algorithms required? Simple statistics are directly logged into system tables. However, accurately estimating multi-column joint distributions poses significant challenges. Rule-based or heuristic methods often result in inaccuracies (e.g. PostgreSQL's native implementation). Most of these research papers [2] employ query-driven or data-driven AI models to enhance the precision of NDV and cardinality estimates. The advantage of AI models is that they learn data distributions from data or queries without actually creating indexes or running statistics, providing more generalized and accurate estimates of cardinality and NDV. ## Q2: Why separate a RESTful Statistic server from the VIDEX engine side? Although much research has been conducted, most of it is not practically applicable in production. For instance, the Postgres cardinality hacking approach [3] requires researchers to write fixed-order cardinality results to a file, which the Postgres optimizer then reads. This method is not conducive to runtime operations. Moreover, AI researchers encounter substantial challenges when integrating AI models into database engines. On one hand, these engines must manage complex underlying interfaces; on the other, AI models frequently necessitate Python and GPU environments. We advocate for a seperate RESTful Statistic server that facilitates the seamless integration of various models (AI and non-AI) into the query plans of MariaDB/MySQL. This method has proven to be genuinely practical. Validated through extensive use at ByteDance's cloud service, VIDEX processes over 5,000 index recommendation tasks and tens of thousands of slow SQL queries daily. Furthermore, we have developed and successfully integrated multiple NDV and cardinality algorithms into VIDEX, accepted by VLDB and SIGMOD [4][5][6]. ## Q3: How to Balance Estimation Accuracy and Architectural Simplicity In the current implementation, we forward all requests (whatever simple or complex) to the server for simplicity. However, this involved an additional server module. Ideally, we believe VIDEX can strike a balance between estimation accuracy and architectural simplicity: 1. **Heuristic estimation mode**: All statistical data is stored in the database's statistic tables (e.g., table_rows, cluster_size, single-column NDV, single-column histograms). The VIDEX plugin uses simple assumptions (e.g., independence between columns, uniform distribution across a single column) to provide estimates for records_in_range and records_per_key. 2. **AI-boosted estimation mode**: To enhance accuracy, complex multi-column joint NDV and cardinality estimates are forwarded to a RESTful statistic server, which can be developed using Python, C, or Java, and deployed on CPU or GPU environments. [1] VIDEX design paper: Rong Kang, Shuai Wang, et al. VIDEX: A Disaggregated and Extensible Virtual Index for the Cloud and AI Era, https://arxiv.org/abs/2503.23776 [2] AI models in DB optimization: Yuxing Han et al., “Cardinality estimation in DBMS: a comprehensive benchmark evaluation,” Proc. VLDB 2021. [3] PG cardinality hacking: https://github.com/Nathaniel-Han/End-to-End-CardEst-Benchmark/ [3] VIDEX NDV paper 1: Xianghong Xu, Tieying Zhang, et al, AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators, VLDB, 2025 (accepted) [4] VIDEX NDV paper 2: Xianghong Xu, Xiao He, et al, PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models, SIGMOD, 2025 (accepted) [5] VIDEX Cardinality paper 3: Peizhi Wu, Rong Kang, et al. Data-Agnostic Cardinality Learning from Imperfect Workloads, VLDB, 2025 (accepted) I hope this addresses your concerns. If you have any further questions, please feel free to continue the discussion. Best, Rong, ByteBrain Team, ByteDance

Dear Sergey Petrunia and Sergei Golubchik, I hope this message finds you well. I'm reaching out as a gentle follow-up to our previous discussion about adapting VIDEX for MariaDB compatibility. As we continue our preparation work, I wanted to seek your advice on any specific technical considerations or design adjustments we should address when adapting VIDEX to work with MariaDB's architecture. I look forward to any guidance you might offer when you have a moment to respond. Thank you for your time. Best regards, Rong ByteBrain Team, ByteDance

Dear Rong, I've spent some time looking at Videx. I got it installed in a test environment but I'd like to spend some more time to try it out and understand it better. I think I should be able to do this by the end of this week. Let me get back on this by then. BR Sergei Petrunia -- On Mon, Apr 28, 2025 at 8:31 PM Rong Kang via discuss < discuss@lists.mariadb.org> wrote:
Dear Sergey Petrunia and Sergei Golubchik,
I hope this message finds you well. I'm reaching out as a gentle follow-up to our previous discussion about adapting VIDEX for MariaDB compatibility.
As we continue our preparation work, I wanted to seek your advice on any specific technical considerations or design adjustments we should address when adapting VIDEX to work with MariaDB's architecture.
I look forward to any guidance you might offer when you have a moment to respond. Thank you for your time.
Best regards, Rong ByteBrain Team, ByteDance _______________________________________________ discuss mailing list -- discuss@lists.mariadb.org To unsubscribe send an email to discuss-leave@lists.mariadb.org

Dear Sergei Petrunia, Thank you for your kind help and the update on your progress. I’m glad to hear things are moving forward. If you have any questions regarding the design or implementation of VIDEX, I’d be happy to assist. Additionally, I would greatly appreciate any suggestions for refactoring or optimizing the VIDEX plugin to better align with MariaDB. Thank you again for your support. Best regards, Rong ByteBrain Team, ByteDance

Hi Rong, I am VP of product management @ MariaDB. I have a few follow-up questions for you: 1. What types of AI models have you seen work well in estimating cardinality and ndv and how have you seen them work in production? I think the what-if scenarios setup in the correct way could help AI models align to relatively well optimized outputs since data can be created to assist in AI training. 2. What are the limitations of VIDEX simulations in representing real-world database workloads that you have seen? What have you seen work relating to ensuring that AI models trained on VIDEX data generalize well to production environments? 3. Has ByteDance explored using LLMs or other natural language processing techniques in conjunction with VIDEX for database optimization tasks? If so, what were the outcomes, and are there recommendations for incorporating such models into an index automation system? Thanks Adam

Hi Adam, Thank you for your thoughtful questions. I'm pleased to share our real-world experience at ByteDance, where VIDEX currently powers our internal index recommendation service, processing thousands of optimization tasks daily. We're also planning to launch it on our public cloud (https://www.volcengine.com/) within the next 1-2 quarters. Regarding your specific inquiries: 1. **AI models for cardinality and NDV estimation in production:** NDV estimation approaches fall into two categories: sampling-based and dataless. When partial data access is available, many classical NDV algorithms exist, though they typically excel only with specific distributions [1]. VIDEX employs AdaNDV (our work accepted by VLDB'25 [1]), an adaptive approach that combines multiple NDV algorithms for optimal results. For users with strict privacy requirements or those needing rapid recommendations (<5s), we deploy PLM4NDV, our dataless solution accepted by SIGMOD'25 [2], which ranks among the leading approaches in this domain. Cardinality estimation methods are generally classified as query-driven or data-driven. Data-driven methods (such as Naru and DeepDB) provide superior single-table cardinality accuracy but require greater preprocessing resources. For privacy-conscious cloud users or environments where full-scans are restricted, query-driven methods like MSCN [4] are more appropriate. Our query-driven approach GRASP [3]has achieved state-of-the-art results and has been accepted at VLDB'25. For index recommendations in practice, we provide further details in point #2 below. 2. **Limitations and production generalization:** The primary challenge for VIDEX in production environments is accurately modeling multi-column join distributions with limited data access. Our approach varies according to customer requirements: - With sampling permission, we gather data via PK-based sampling and utilize AdaNDV for NDV estimation. We construct histograms for single-column cardinality estimations and employ correlation coefficients for multi-column cardinality. - In zero-sampling scenarios, we rely on our pre-trained models (PLM4NDV and a dataless CardEst method). Our testing across 5,000+ index recommendation tasks demonstrates that these approaches consistently outperform traditional sampling-based recommendations. 3. **Natural language models with VIDEX:** Regarding NDV, PLM4NDV (our SIGMOD 2025 paper) leverages pre-trained language models to extract semantic schema information without accessing actual data. This approach is particularly valuable in cloud environments where data access is restricted. Our models are pre-trained on thousands of public schema datasets, making them immediately applicable to new business scenarios without additional training. In terms of cardinality, we've achieved promising results using language models for entirely dataless cardinality estimation. Thank you again for your interest. I welcome any additional questions regarding our research technology or business implementations. - [1] AdaNDV (Our NDV work, VLDB 2025): Xu, X., Zhang, T., He, X., Li, H., Kang, R., Wang, S., ... & Chen, J. (2025). AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators. - [2] PLM4NDV (Our language-model-based NDV work, SIGMOD 2025): Xu, X., He, X., Zhang, T., Zhang, L., Shi, R., & Chen, J. PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models - [3] GRASP (Our query-driven cardinality work, VLDB 2025): Peizhi Wu, Rong Kang, Tieying Zhang*, Jianjun Chen, Ryan Marcus, Zachary G. Ives. Data-Agnostic Cardinality Learning from Imperfect Workloads. - [4] MSCN: A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper, “Learned Cardinalities: Estimating Correlated Joins with Deep Learning,” Dec. 18, 2018, arXiv: arXiv:1809.00677. doi: 10.48550/arXiv.1809.00677. Best regards, Rong ByteBrain Team, ByteDance

Thank you Rong, great info! Will circle back if I have more questions. Appreciate the deep dive response!

Hi Rong, Short: I've found a limitation in VIDEX which isn't mentioned on its homepage. Are you aware of this? The good part here is that when one hits the limitation the server returns an error (instead of silently prodicing a wrong query plan). Long: I was looking at VIDEX and got this question: MariaDB (and MySQL) optimizer may read individual table rows for constant tables, and further optimization makes use of the column values it has read. How does VIDEX deal with this situation? It looks like it doesn't. A basic example using the test dataset. Assume the VIDEX is set up as shown in README.md. Let's first try a query using a real table: mysql> use tpch_tiny; Database changed mysql> explain select count(*) from customer,orders where c_custkey=o_custkey and c_custkey=150000; +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+ | 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | no matching row in const table | +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+ 1 row in set, 1 warning (0.00 sec) mysql> Then, let's try VIDEX: mysql> use videx_tpch_tiny; Database changed mysql> explain select count(*) from customer,orders where c_custkey=o_custkey and c_custkey=150000; ERROR 1031 (HY000): Table storage engine for 'customer' doesn't have this option One can say this query is not very meaningful but that's just a basic example. One may construct more meaningful examples. Another example similar to the above: mysql> explain select * from customer where c_custkey= (select max(c_custkey) from customer); ERROR 1031 (HY000): Table storage engine for 'customer' doesn't have this option BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Blog: http://petrunia.net

Hi Sergei: Great observation! We've encountered this same issue internally. We'll add this limitation in the VIDEX/README. Thanks you. We think this may happen because MariaDB/MySQL don't completely separate query optimization from execution. When dealing with equality conditions on PK/UK columns, MySQL/MariaDB calls index_init/index_read functions that attempt to access actual data. Fow now, VIDEX simply returns `HA_ERR_WRONG_COMMAND`, resulting in the 1031 error. We've explored several potential solutions: 1. Skip direct index execution by setting `subquery_to_derived=on`, it works on MySQL 8.0, e.g.: ```sql
SET optimizer_switch='subquery_to_derived=off'; explain select count(*) from customer,orders where c_custkey=o_custkey and c_custkey=150000; (1031, "Table storage engine for 'customer' doesn't have this option")
SET optimizer_switch='subquery_to_derived=on'; explain select count(*) from customer,orders where c_custkey=o_custkey and c_custkey=150000; +----+-------------+----------+------------+-------+---------------+------------+---------+-------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+-------+---------------+------------+---------+-------+------+----------+-------------+ | 1 | SIMPLE | customer | <null> | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.0 | Using index | | 1 | SIMPLE | orders | <null> | ref | ORDERS_FK1 | ORDERS_FK1 | 4 | const | 1 | 100.0 | Using index | +----+-------------+----------+------------+-------+---------------+------------+---------+-------+------+----------+-------------+
2. Keep the error as is - it's somewhat reasonable for index recommendation since queries that can use PK/UK often don't need additional indexes.
3. Let index_read directly return 0, indicating to the optimizer that no matching rows exist.
4. Provide virtual data in the buffer based on input conditions. (it may involve complex implementation work).
We would appreciate your thoughts on the best solution path, for both MariaDB and MySQL.
Best
Rong

Hi Rong, I've discovered one possible issue: in MySQL, records_in_range() call is defined as: virtual ha_rows records_in_range(uint inx [[maybe_unused]], key_range *min_key [[maybe_unused]], key_range *max_key [[maybe_unused]]) in MariaDB, it is: virtual ha_rows records_in_range(uint inx, const key_range *min_key, const key_range *max_key, page_range *res) Note the page_range parameter. It is an "OUT" parameter, it is defined as: /* Store first and last leaf page accessed by records_in_range */ typedef struct st_page_range { ulonglong first_page; ulonglong last_page; } page_range; The storage engine provides the SQL layer with information about how "far apart" the ranges are in the index. The SQL layer uses that information when computing the overall costs in handler::multi_range_read_info_const(). The commit that added this in MariaDB: http://github.com/mariadb/server/commit/f36ca142f7fa59598e34e842b60372b96306... I'm not sure how big the impact of this is. BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Blog: http://petrunia.net

Hi Rong, It might not be important for the first test implementation, but I'm wondering, does VIDEX intend to handle multi-column hypothetical indexes? I see that videx_build_env.py only collects single-column histograms. If one has histograms on columns A, B, and then tries to produce a multi-column records_in_range estimate for INDEX(A, B), the results may be way off. (and, as a consequence, the page_range problem of records_in_range() would be unimportant in comparison). Is there any way to collect more statistics than what videx_build_env.py does? p.s. I've got a couple more things I'd like to check. I hope I'll get back with those soon. BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Blog: http://petrunia.net

Hi Sergei: You mention a very important algorithm task, that is the "multi-column cardinality estimation". This is a challenge that all what-if analysis databases need to address. In the current VIDEX open-source version, we released a simple solution that assumes independence between columns, so Card(AB) = Card(A)/total_row * Card(B)/total_row. This performs well on benchmarks like TPCH but tends to under-estimate in more complex scenarios. In ByteDance's production environment, when sampling is permitted, we pre-collect up to 100k rows covering all columns that appear in query conditions, and estimate joint cardinality based on this sample. Without sampling, we use a pre-trained language model: faster but coarser. We are currently preparing a paper on this work, and will release it in the future. Nevertheless, existing methods still don't perfectly solve this problem. That's why we've opened the algorithm interfaces and welcome research contributions. Best, Rong

Hi Sergei: About the `page_range` parameter, I'm trying to understand the rationale behind this addition. As I understand it, records_in_range() is to calculate how many rows match the given condition. If so, the existing parameters (index_name, min_key, max_key) seem sufficient for cardinality estimation. If my understanding is correct, `page_range` isn't information provided to the storage engine, but rather data that the engine needs to populate and return for the optimizer? For example, when the engine reports that first_page and end_page are the same, the index cost might be calculated as slightly lower? This is quite an interesting feature. For our first phase, I think we could temporarily ignore it without causing significant estimation errors. After the first version, I'd like to create examples where page_range is effective and enhance VIDEX's support for page_range. What do you think? :) Best, Rong
participants (6)
-
Adam Luciano
-
Rong Kang
-
Sergei Golubchik
-
Sergei Petrunia
-
Sergey Petrunia
-
康荣