Re: Inquiry Regarding Slow Index Creation with MariaDB 11.7 RC (SIFT1M)
Hi, kase, First: please send questions like this to discuss@lists.mariadb.org. it's a public mailing list dedicated to MariaDB and how to use it better. I am subscribed, so I'll see you mail there, and you may be sure I will, because it won't be accidentally catched by my spam filter, or sorted out in some obscure folder. Furthermore other subscribers will see your question and could reply if I will be not available (e.g. I could be travelling). Thank you. On Jan 12, kase jojo wrote:
Dear Sir
I hope you are doing well. I recently read your blog https://mariadb.com/resources/blog/how-fast-is-mariadb-vector/ and was particularly impressed by the efficient index-building times demonstrated in your tests. However, when I attempted similar experiments on MariaDB 11.7 RC, using the SIFT1M dataset and building an index with M=32, I noticed that the index creation process was much slower than expected.
In my case, I have been inserting data into the table gradually, and I wanted to inquire about the process you mentioned in your blog: "We build the index slowly as we insert the data row by row." Could you clarify how this process works? Specifically, I am curious to know if there are any steps or techniques you followed to ensure such efficient index construction, as it seems to differ from my experience.
To get faster inserts you need to use a smaller M. Try M=8, for example. It will reduce the recall, and you'll need to increase ef_search to compensate for that. Look at it this way: MariaDB needs to do the work to get good recall. It has to do it *somewhere*. But you can decide where to do it. MariaDB can spend more time doing inserts, build a better index, and search in it quickly. Or it can insert faster, the index will be of worse quality, and it'll need to spend more time searching in it. It's a trade-off and you decide what is more important for your application. See https://github.com/vuvova/ann-benchmarks/blob/dev/ann_benchmarks/algorithms/... For faster inserts you can use M=8 and ef_search=800. Of course, always make sure that the mhnsw_max_cache_size is big enough to hold your entire data set. SIFT1M is rather small, 300M should likely be enough. I'd use at least mhnsw_max_cache_size=1G to be safe, it's an upper limit, MariaDB won't use more memory than necessary anyway. Regards, Sergei Chief Architect, MariaDB Server and security@mariadb.org
participants (1)
-
Sergei Golubchik