Fulltext index tokenization rules / DOC_ID
Hi all, first question: is there a document that describes the rules how MariaDB builds the fulltext index words? I set "innodb_ft_min_token_size=1" to include also one letter words in the index. But this does not seem to work for words like "D'Artagnan": the "d" letter is not found in "information_schema.INNODB_FT_INDEX_TABLE". It works for "D Artagnan". So I assume that single letters that are followed by a quote are ignored when building the index? You could also post the link to the source code file, but I don't know whether my C knowledge is sufficient ;-) Second question: the doc (https://mariadb.com/kb/en/information-schema-innodb_ft_index_table-table/) states that "information_schema.INNODB_FT_INDEX_TABLE.doc_id" matches to the id column of the original table: "DOC_ID: Document ID of the newly added row, either an appropriate ID column or an internal InnoDB value." But this did not happen for me - the doc_id values do not match my ID column. I tried the datatypes "int" and "bigint". Best regards Wolfgang
Question 1: this was changed by https://jira.mariadb.org/browse/MDEV-20797 Question 2: still open. I asked the same here: https://stackoverflow.com/questions/78969640/mariadb-fulltext-index-tokeniza... Best regards Wolfgang
Hi, wknauf---, On Sep 12, wknauf--- via discuss wrote:
Question 1: this was changed by https://jira.mariadb.org/browse/MDEV-20797
It was https://github.com/MariaDB/server/commit/f076dc2f667d9270ca08421d466dcbc3527... and the commit message is - InnoDB should ignore the single word followed by apostrophe while tokenising the document. Example is that if the input string is O'brien then right now, InnoDB seperates into two tokens as O, brien. But after this patch, InnoDB can ignore the token 'O' and consider only 'brien'. Does it help?
Question 2: still open. I asked the same here: https://stackoverflow.com/questions/78969640/mariadb-fulltext-index-tokeniza...
I'll let InnoDB developer answer that Regards, Sergei Chief Architect, MariaDB Server and security@mariadb.org
On Thu, Sep 12, 2024 at 12:47 PM wknauf--- via discuss < discuss@lists.mariadb.org> wrote:
Question 1: this was changed by https://jira.mariadb.org/browse/MDEV-20797 Question 2: still open. I asked the same here: https://stackoverflow.com/questions/78969640/mariadb-fulltext-index-tokeniza...
To answer your question 2: If an FTS_DOC_ID column is not defined, InnoDB automatically adds a hidden FTS_DOC_ID column when the full-text index is created. If you create a full-text index at CREATE TABLE time and do not specify an FTS_DOC_ID column, InnoDB adds a hidden FTS_DOC_ID column. Example: CREATE TABLE t1 ( ID INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), FULLTEXT(title) ) ENGINE = InnoDB; INSERT INTO t1(title) VALUES('mysql'); INSERT INTO t1(title) VALUES('database'); .... SELECT * FROM t1; ID title 1 mysql 2 database SET GLOBAL innodb_ft_aux_table="test/t1"; SELECT WORD, DOC_ID FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE; WORD DOC_ID database 3 mysql 2 Here the hidden FTS_DOC_ID value is 2 and 3 for mysql and database. It doesn't have to match with ID column in the table. Another way to create your own FTS_DOC_ID column, the column must be defined as BIGINT UNSIGNED NOT NULL and named FTS_DOC_ID (all uppercase) Example: CREATE TABLE t1 ( FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), FULLTEXT(title) ) ENGINE = InnoDB; INSERT INTO t1(title) VALUES('mysql'); INSERT INTO t1(title) VALUES('database'); ..... SELECT * FROM t1; FTS_DOC_ID title 1 mysql 2 database SET GLOBAL innodb_ft_aux_table="test/t1"; SELECT WORD, DOC_ID FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE; WORD DOC_ID database 2 mysql 1 SET GLOBAL innodb_ft_aux_table=default; Here FTS_DOC_ID for mysql and database is "1" and "2" FTS_DOC_ID doesn't have to be PRIMARY KEY. It can be part of UNIQUE INDEX as well Example: CREATE TABLE t1 ( ID INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, FTS_DOC_ID BIGINT UNSIGNED NOT NULL, UNIQUE INDEX FTS_DOC_ID_INDEX (FTS_DOC_ID), title VARCHAR(200), FULLTEXT(title) ) ENGINE = InnoDB; INSERT INTO t1(FTS_DOC_ID, title) VALUES(1, 'mysql'); INSERT INTO t1(FTS_DOC_ID, title) VALUES(2, 'database'); SELECT * FROM t1; ID FTS_DOC_ID title 1 1 mysql 2 2 database SET GLOBAL innodb_ft_aux_table="test/t1"; SELECT WORD, DOC_ID FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE; WORD DOC_ID database 2 mysql 1 -- Thirunarayanan B Software Engineer MariaDB
Thanks for the replies and the clarification! May I suggest to enhance https://mariadb.com/kb/en/information-schema-innodb_ft_index_table-table/ slightly: it currently explains for “INNODB_FT_INDEX_TABLE.doc_id”: * Document ID of the newly added row, either an appropriate ID column or an internal InnoDB value. I now learned that the rules for “appropriate id” are more complex than I thought 😉. But maybe there exists already a documentation that I did not find. Also, one question remains: if an invisible FTS_DOC_ID column is automatically added to my table, how can I view the values? I ask this, because in the process of analyzing the word split results, I wanted to match entries of “INNODB_FT_INDEX_TABLE “ to my real table or vice versa. This does not work: select FTS_DOC_ID from t1; I already found that FTS_DOC_ID is an invisible column, but the FTS_DOC_ID cannot be selected, so it must be something special. I found this JIRA issue about adding several types of invisibility: https://jira.mariadb.org/browse/MDEV-10177 - but I found no further information about this. My MariaDB version is slightly old – 10.3.39. There might have been changes to the behavior since then. Best regards Wolfgang Von: Thirunarayanan Balathandayuthapani <thiru@mariadb.com> Gesendet: Samstag, 14. September 2024 05:11 An: Wolfgang Knauf <WKnauf@hg-online.de> Cc: discuss@lists.mariadb.org Betreff: Re: [MariaDB discuss] Re: Fulltext index tokenization rules / DOC_ID ACHTUNG: Diese E-Mail stammt von einem externen Absender. Bitte achten Sie auf Anhänge oder externe Links. On Thu, Sep 12, 2024 at 12:47 PM wknauf--- via discuss <discuss@lists.mariadb.org<mailto:discuss@lists.mariadb.org>> wrote: Question 1: this was changed by https://jira.mariadb.org/browse/MDEV-20797 Question 2: still open. I asked the same here: https://stackoverflow.com/questions/78969640/mariadb-fulltext-index-tokeniza... To answer your question 2: If an FTS_DOC_ID column is not defined, InnoDB automatically adds a hidden FTS_DOC_ID column when the full-text index is created. If you create a full-text index at CREATE TABLE time and do not specify an FTS_DOC_ID column, InnoDB adds a hidden FTS_DOC_ID column. Example: CREATE TABLE t1 ( ID INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), FULLTEXT(title) ) ENGINE = InnoDB; INSERT INTO t1(title) VALUES('mysql'); INSERT INTO t1(title) VALUES('database'); .... SELECT * FROM t1; ID title 1 mysql 2 database SET GLOBAL innodb_ft_aux_table="test/t1"; SELECT WORD, DOC_ID FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE; WORD DOC_ID database 3 mysql 2 Here the hidden FTS_DOC_ID value is 2 and 3 for mysql and database. It doesn't have to match with ID column in the table. Another way to create your own FTS_DOC_ID column, the column must be defined as BIGINT UNSIGNED NOT NULL and named FTS_DOC_ID (all uppercase) Example: CREATE TABLE t1 ( FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), FULLTEXT(title) ) ENGINE = InnoDB; INSERT INTO t1(title) VALUES('mysql'); INSERT INTO t1(title) VALUES('database'); ..... SELECT * FROM t1; FTS_DOC_ID title 1 mysql 2 database SET GLOBAL innodb_ft_aux_table="test/t1"; SELECT WORD, DOC_ID FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE; WORD DOC_ID database 2 mysql 1 SET GLOBAL innodb_ft_aux_table=default; Here FTS_DOC_ID for mysql and database is "1" and "2" FTS_DOC_ID doesn't have to be PRIMARY KEY. It can be part of UNIQUE INDEX as well Example: CREATE TABLE t1 ( ID INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, FTS_DOC_ID BIGINT UNSIGNED NOT NULL, UNIQUE INDEX FTS_DOC_ID_INDEX (FTS_DOC_ID), title VARCHAR(200), FULLTEXT(title) ) ENGINE = InnoDB; INSERT INTO t1(FTS_DOC_ID, title) VALUES(1, 'mysql'); INSERT INTO t1(FTS_DOC_ID, title) VALUES(2, 'database'); SELECT * FROM t1; ID FTS_DOC_ID title 1 1 mysql 2 2 database SET GLOBAL innodb_ft_aux_table="test/t1"; SELECT WORD, DOC_ID FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE; WORD DOC_ID database 2 mysql 1 -- Thirunarayanan B Software Engineer MariaDB
participants (4)
-
Sergei Golubchik
-
Thirunarayanan Balathandayuthapani
-
wknauf@hg-online.de
-
Wolfgang Knauf