[MariaDB discuss] Re: VIDEX: multi-column hypothetical indexes?

6 May 2025

      Hi Sergei:

You mention a very important algorithm task, that is the "multi-column cardinality estimation". This is a challenge that all what-if analysis databases need to address.

In the current VIDEX open-source version, we released a simple solution that assumes independence between columns, so Card(AB) = Card(A)/total_row * Card(B)/total_row. This performs well on benchmarks like TPCH but tends to under-estimate in more complex scenarios.

In ByteDance's production environment, when sampling is permitted, we pre-collect up to 100k rows covering all columns that appear in query conditions, and estimate joint cardinality based on this sample. Without sampling, we use a pre-trained language model: faster but coarser. We are currently preparing a paper on this work, and will release it in the future.

Nevertheless, existing methods still don't perfectly solve this problem. That's why we've opened the algorithm interfaces and welcome research contributions.

Best,
Rong

[MariaDB discuss] Re: VIDEX: multi-column hypothetical indexes?

Rong Kang