Vector indexing questions for deep-image-96-angular dataset

Greetings! I am doing some benchmarking on MariaDB and got to the deep-image-96-angular dataset from https://github.com/erikbern/ann-benchmarks. This dataset has 9.99 million vectors of dimension 96. I am running the vector index create with M=24,ef_construction = 200 (similar parameters as pgvector in ANN-Benchmark). I wanted to check and confirm a couple of observations based on the progress of the index create at present - MariaDB [(none)]> show processlist; +----+--------+-----------+------+---------+------+-------------------+----------------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+--------+-----------+------+---------+------+-------------------+----------------------------------------------------------+----------+ | 4 | ubuntu | localhost | ann | Query | 4092 | copy to tmp table | ALTER TABLE t1 ADD VECTOR INDEX (v) M=24 DISTANCE=cosine | 9.636 | MariaDB [(none)]> select trx_rows_modified from information_schema.innodb_trx; +-------------------+ | trx_rows_modified | +-------------------+ | 125215100 | +-------------------+ 1 row in set (0.004 sec) -rw-rw---- 1 ubuntu ubuntu 40420507648 Feb 19 06:20 undo002 Can you please confirm that the undo log reaching 40GB+ is expected for a progress of 9%? Did I miss something? I have configured the hnsw cache size and buffer pool to 16GB each. Should I increase them further? I want to benchmark for "hnsw index fits in memory" use-case. MariaDB [(none)]> show variables like '%hnsw%'; +------------------------+-------------+ | Variable_name | Value | +------------------------+-------------+ | mhnsw_default_distance | euclidean | | mhnsw_default_m | 24 | | mhnsw_ef_search | 200 | | mhnsw_max_cache_size | 17179869184 | +------------------------+-------------+ 4 rows in set (0.001 sec) Just to let know, I am working on the vector plugin for MySQL (MyVector) Thanks!

The index build completed in around 4 hours. The undo log reached ~ 107GB -. The base table & index ibd files are in 4GB-6GB range -rw-rw---- 1 ubuntu ubuntu 107739086848 Feb 19 09:18 undo002 -rw-rw---- 1 ubuntu ubuntu 6287261696 Feb 19 09:43 t1#i#01.ibd -rw-rw---- 1 ubuntu ubuntu 1859 Feb 19 05:08 t1.frm -rw-rw---- 1 ubuntu ubuntu 4500488192 Feb 19 09:43 t1.ibd Let me know if i need to tune any other memory related parameter

Hi, shiyer22---, On Feb 19, shiyer22--- via discuss wrote:
Greetings!
I am doing some benchmarking on MariaDB and got to the deep-image-96-angular dataset from https://github.com/erikbern/ann-benchmarks.
This dataset has 9.99 million vectors of dimension 96. I am running the vector index create with M=24,ef_construction = 200 (similar parameters as pgvector in ANN-Benchmark). I wanted to check and confirm a couple of observations based on the progress of the index create at present -
ef_construction is not configurable in MariaDB, ef_search is, but it has no effect on the index construction.
MariaDB [(none)]> show processlist; +----+--------+-----------+------+---------+------+-------------------+----------------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+--------+-----------+------+---------+------+-------------------+----------------------------------------------------------+----------+ | 4 | ubuntu | localhost | ann | Query | 4092 | copy to tmp table | ALTER TABLE t1 ADD VECTOR INDEX (v) M=24 DISTANCE=cosine | 9.636 |
MariaDB [(none)]> select trx_rows_modified from information_schema.innodb_trx; +-------------------+ | trx_rows_modified | +-------------------+ | 125215100 | +-------------------+ 1 row in set (0.004 sec)
-rw-rw---- 1 ubuntu ubuntu 40420507648 Feb 19 06:20 undo002
Can you please confirm that the undo log reaching 40GB+ is expected for a progress of 9%? Did I miss something? I have configured the hnsw cache size and buffer pool to 16GB each. Should I increase them further? I want to benchmark for "hnsw index fits in memory" use-case.
with the index being ~6GB range, you shouldn't need notably more than than for mhnsw_max_cache_size. If you'll watch the mariadbd memory usage in top, you'll likely see that it'll grow by 6GB and then will stop, even if mhnsw_max_cache_size will be much larger than that. There're no optimizations for bulk index construction at the moment, so your ALTER TABLE means many millions of updates internally, all done within one transaction. It seems to be possible for the undo log to grow that large.
MariaDB [(none)]> show variables like '%hnsw%'; +------------------------+-------------+ | Variable_name | Value | +------------------------+-------------+ | mhnsw_default_distance | euclidean | | mhnsw_default_m | 24 | | mhnsw_ef_search | 200 | | mhnsw_max_cache_size | 17179869184 | +------------------------+-------------+ 4 rows in set (0.001 sec)
Just to let know, I am working on the vector plugin for MySQL (MyVector)
Regards, Sergei Chief Architect, MariaDB Server and security@mariadb.org

Thanks Sergei! Is it possible to let me know the internally decided ef_construction value? That will help me in putting recall results in context with other benchmarks. Is the ef_construction computed based on M / dimension etc?

Hi! The implementation of the HNSW algorithm in MariaDB is not identical to the paper (hence the m from modified or MariaDB in mHNSW). You should *not* try to compare against equal M / ef_search values of other implementations, but rather compare QPS for a particular Recall target (say 95%, 98%, 99%) That will give you an apples-to-apples comparison. Vicențiu On Wed, 19 Feb 2025 at 14:46, shiyer22--- via discuss < discuss@lists.mariadb.org> wrote:
Thanks Sergei! Is it possible to let me know the internally decided ef_construction value? That will help me in putting recall results in context with other benchmarks. Is the ef_construction computed based on M / dimension etc? _______________________________________________ discuss mailing list -- discuss@lists.mariadb.org To unsubscribe send an email to discuss-leave@lists.mariadb.org
participants (3)
-
Sergei Golubchik
-
shiyer22@gmail.com
-
Vicențiu Ciorbaru