
Hi Adam, Thank you for your thoughtful questions. I'm pleased to share our real-world experience at ByteDance, where VIDEX currently powers our internal index recommendation service, processing thousands of optimization tasks daily. We're also planning to launch it on our public cloud (https://www.volcengine.com/) within the next 1-2 quarters. Regarding your specific inquiries: 1. **AI models for cardinality and NDV estimation in production:** NDV estimation approaches fall into two categories: sampling-based and dataless. When partial data access is available, many classical NDV algorithms exist, though they typically excel only with specific distributions [1]. VIDEX employs AdaNDV (our work accepted by VLDB'25 [1]), an adaptive approach that combines multiple NDV algorithms for optimal results. For users with strict privacy requirements or those needing rapid recommendations (<5s), we deploy PLM4NDV, our dataless solution accepted by SIGMOD'25 [2], which ranks among the leading approaches in this domain. Cardinality estimation methods are generally classified as query-driven or data-driven. Data-driven methods (such as Naru and DeepDB) provide superior single-table cardinality accuracy but require greater preprocessing resources. For privacy-conscious cloud users or environments where full-scans are restricted, query-driven methods like MSCN [4] are more appropriate. Our query-driven approach GRASP [3]has achieved state-of-the-art results and has been accepted at VLDB'25. For index recommendations in practice, we provide further details in point #2 below. 2. **Limitations and production generalization:** The primary challenge for VIDEX in production environments is accurately modeling multi-column join distributions with limited data access. Our approach varies according to customer requirements: - With sampling permission, we gather data via PK-based sampling and utilize AdaNDV for NDV estimation. We construct histograms for single-column cardinality estimations and employ correlation coefficients for multi-column cardinality. - In zero-sampling scenarios, we rely on our pre-trained models (PLM4NDV and a dataless CardEst method). Our testing across 5,000+ index recommendation tasks demonstrates that these approaches consistently outperform traditional sampling-based recommendations. 3. **Natural language models with VIDEX:** Regarding NDV, PLM4NDV (our SIGMOD 2025 paper) leverages pre-trained language models to extract semantic schema information without accessing actual data. This approach is particularly valuable in cloud environments where data access is restricted. Our models are pre-trained on thousands of public schema datasets, making them immediately applicable to new business scenarios without additional training. In terms of cardinality, we've achieved promising results using language models for entirely dataless cardinality estimation. Thank you again for your interest. I welcome any additional questions regarding our research technology or business implementations. - [1] AdaNDV (Our NDV work, VLDB 2025): Xu, X., Zhang, T., He, X., Li, H., Kang, R., Wang, S., ... & Chen, J. (2025). AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators. - [2] PLM4NDV (Our language-model-based NDV work, SIGMOD 2025): Xu, X., He, X., Zhang, T., Zhang, L., Shi, R., & Chen, J. PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models - [3] GRASP (Our query-driven cardinality work, VLDB 2025): Peizhi Wu, Rong Kang, Tieying Zhang*, Jianjun Chen, Ryan Marcus, Zachary G. Ives. Data-Agnostic Cardinality Learning from Imperfect Workloads. - [4] MSCN: A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper, “Learned Cardinalities: Estimating Correlated Joins with Deep Learning,” Dec. 18, 2018, arXiv: arXiv:1809.00677. doi: 10.48550/arXiv.1809.00677. Best regards, Rong ByteBrain Team, ByteDance