Hi, Rinat! On Oct 21, Rinat Ibragimov wrote:
+ + // Avoid creating partial n-grams by stopping at word boundaries. + // Underscore is treated as a word character too, since identifiers in + // programming languages often contain them. + if (!is_alnum(char_type) && !is_underscore(cs, next, end)) { + start= next + char_len; + next= start; + n_chars= 0; + continue;
I don't think it's a good idea. On the opposite, I would rather create ngrams over word boundaries, so that, say, I'd split "n-gram plugin" to
"n-g", "-gr", "gra", "ram", "am ", "m p", " pl", "plu", "lug", "ugi", "gin".
I see how this can increase index size: more different strings are included. Implementation will become a tiny bit simpler. But can it make searching better somehow?
Yes, the idea is that it'll rank "One can use n-gram plugin in MariaDB fulltext search" higher than "Plugins are useful for extending functionality of a program" because some ngrams encode that "n-gram" and "plugin" words must be together as a phrase.
I can't decide. From my point of view, the current approach is fine. Please pick a variant, and I'll try to implement that.
No, I cannot guess which approach will produce more relevant searches. Implement something and then we test what works better
+ } + + next += char_len; + n_chars++; + + if (n_chars == ngram_token_size) { + bool_info->position= static_cast<uint>(start - doc);
nope, we don't have MYSQL_FTPARSER_FULL_BOOLEAN_INFO::position and this plugin is definitely not the reason to break the API and add it.
There is a patch in the patchset that adds the "position" field. Without that field, it's possible to trigger an assertion in InnoDB's FTS code by executing:
INSTALL PLUGIN simple_parser SONAME 'mypluglib'; CREATE TABLE articles( a TEXT DEFAULT NULL, b TEXT DEFAULT NULL, FULLTEXT (a, b) WITH PARSER simple_parser ) ENGINE=InnoDB; INSERT INTO articles VALUES ('111', '1234 1234 1234'); DROP TABLE articles; UNINSTALL PLUGIN simple_parser;
Same assertion may be triggered by any fts plugin. Are you sure you want that patch to be removed?
I don't think InnoDB assert is a reason to extend the fulltext plugin API. What about this fix instead: diff --git a/storage/innobase/fts/fts0fts.cc b/storage/innobase/fts/fts0fts.cc --- a/storage/innobase/fts/fts0fts.cc +++ b/storage/innobase/fts/fts0fts.cc @@ -4669,7 +4669,7 @@ fts_tokenize_add_word_for_parser( ut_ad(boolean_info->position >= 0); position = boolean_info->position + fts_param->add_pos; */ - position = fts_param->add_pos; + position = fts_param->add_pos++; fts_add_token(result_doc, str, position); this fixed the crash for me.
+ n_chars= ngram_token_size - 1; + is_first= false; + } + } + + // If nothing was emitted yet, emit the remainder.
what reminder, what is it for?
It's "remainder", with "a" before "i". Remainder of the processed string. That way strings that are too short for having even a single n-gram, will still generate some index entries and will be discoverable by exact queries.
Perhaps the comment should say it? That if the string too short, it'll be emitted here?
+ case MYSQL_FTPARSER_FULL_BOOLEAN_INFO: + // Reusing InnoDB's boolean mode parser to handle all special characters. + while (fts_get_word(cs, &start, end, &word, &bool_info)) {
nope. InnoDB might not even be compiled it. Use param->mysql_parse()
Copied fts_get_word() implementation into the plugin code.
As far as I understand, param->mysql_parse() cannot be used here as it splits a string into words like default fulltext search does. This plugin however must generate n-grams.
Of course, it can. Note that fts_get_word() doesn't generate n-grams either, it gets the whole word and the n-gram plugin later splits it into n-grams. Similarly param->mysql_parse() will extract words for you and you'll split them into n-grams. Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org