
Hi, shiyer22---, On Feb 19, shiyer22--- via discuss wrote:
Greetings!
I am doing some benchmarking on MariaDB and got to the deep-image-96-angular dataset from https://github.com/erikbern/ann-benchmarks.
This dataset has 9.99 million vectors of dimension 96. I am running the vector index create with M=24,ef_construction = 200 (similar parameters as pgvector in ANN-Benchmark). I wanted to check and confirm a couple of observations based on the progress of the index create at present -
ef_construction is not configurable in MariaDB, ef_search is, but it has no effect on the index construction.
MariaDB [(none)]> show processlist; +----+--------+-----------+------+---------+------+-------------------+----------------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+--------+-----------+------+---------+------+-------------------+----------------------------------------------------------+----------+ | 4 | ubuntu | localhost | ann | Query | 4092 | copy to tmp table | ALTER TABLE t1 ADD VECTOR INDEX (v) M=24 DISTANCE=cosine | 9.636 |
MariaDB [(none)]> select trx_rows_modified from information_schema.innodb_trx; +-------------------+ | trx_rows_modified | +-------------------+ | 125215100 | +-------------------+ 1 row in set (0.004 sec)
-rw-rw---- 1 ubuntu ubuntu 40420507648 Feb 19 06:20 undo002
Can you please confirm that the undo log reaching 40GB+ is expected for a progress of 9%? Did I miss something? I have configured the hnsw cache size and buffer pool to 16GB each. Should I increase them further? I want to benchmark for "hnsw index fits in memory" use-case.
with the index being ~6GB range, you shouldn't need notably more than than for mhnsw_max_cache_size. If you'll watch the mariadbd memory usage in top, you'll likely see that it'll grow by 6GB and then will stop, even if mhnsw_max_cache_size will be much larger than that. There're no optimizations for bulk index construction at the moment, so your ALTER TABLE means many millions of updates internally, all done within one transaction. It seems to be possible for the undo log to grow that large.
MariaDB [(none)]> show variables like '%hnsw%'; +------------------------+-------------+ | Variable_name | Value | +------------------------+-------------+ | mhnsw_default_distance | euclidean | | mhnsw_default_m | 24 | | mhnsw_ef_search | 200 | | mhnsw_max_cache_size | 17179869184 | +------------------------+-------------+ 4 rows in set (0.001 sec)
Just to let know, I am working on the vector plugin for MySQL (MyVector)
Regards, Sergei Chief Architect, MariaDB Server and security@mariadb.org