Thanks for the additional info. TokuDB probably isn't the most used engine with optimistic parallel replication. I worked with a TokuDB developer some time ago to make it work, but I'm not sure how well those fixes are maintained in MariaDB (ie. I tried now running the testcase tokudb_rpl.rpl_parallel_optimistic on 10.3.9, it fails, seemingly because it hasn't even been run after some changes to the test framework). However, I do not see why TokuDB would cause the behaviour you describe, the maintenance of mysql.gtid_slave_pos is not storage-engine dependent and you tested with the table using InnoDB. It would be useful to see if this problem is also present with earlier versions of MariaDB on the slave. The GTID position is always maintained by MariaDB, that is why you see this despite not actively using GTID. I also do not see how 10.2.4 master could influence the problem. The table data you sent show rows that are seemingly randomly selected from transactions, and it only occurs in optimistic mode. That suggests the problem is related to when a transaction is optimistically run in parallel and causes a conflict and needs to be rolled back. Maybe this rollback becomes ineffective in mysql.gtid_slave_pos for some reason? Though again I don't immediately see how this could be. It is cool that optimistic replication works well in your setup to avoid slave lag (but not cool that it causes this problem). I will try to see if I can reproduce with simple test-generated traffic. But if you know of a way I could reproduce that would be useful. - Kristian. "Reinis Rozitis" <r@roze.lv> writes:
Do you have any errors in the error log about failure to delete rows?
Nope, no errors.
Anything else special to your setup that might be causing this?
At some point I thought maybe the tokudb_analyze_in_background / tokudb_auto_analyze messes things up as it does the background check (you can also see here the row count growing):
2018-09-29 11:05:48 134488 [Note] TokuDB: Auto scheduling background analysis for ./mysql/gtid_slave_pos_TokuDB, delta_activity 423840 is greater than 40 percent of 1059601 rows. - succeeded. 2018-09-29 11:09:35 134490 [Note] TokuDB: Auto scheduling background analysis for ./mysql/gtid_slave_pos_TokuDB, delta_activity 424359 is greater than 40 percent of 1060885 rows. - succeeded. 2018-09-29 11:13:23 134488 [Note] TokuDB: Auto scheduling background analysis for ./mysql/gtid_slave_pos_TokuDB, delta_activity 424888 is greater than 40 percent of 1062196 rows. - succeeded.
(it triggers also in conservative mode but then it happens just because of a single row being >40% of the table)
I tried to switch off the gtid_pos_auto_engines to use a single gtid_pos InnoDB table and it makes no difference - in conservative mode everything is fine in optimistic the table fills up.
The odd thing is that I'm actually not using gtid for the replication:
MariaDB [mysql]> show slave status\G
Slave_IO_State: Waiting for master to send event Master_Host: 10.0.8.211 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.096519 Read_Master_Log_Pos: 79697585 Relay_Log_File: db-relay-bin.000142 Relay_Log_Pos: 78464847 Relay_Master_Log_File: mysql-bin.096519 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 79697245 Relay_Log_Space: 595992008 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 .. Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 211 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: No Gtid_IO_Pos: Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Slave_DDL_Groups: 25 Slave_Non_Transactional_Groups: 284193 Slave_Transactional_Groups: 452098720
The other "special" thing maybe is that the master is still 10.2.4 - but that shouldn’t be the problem?
I have 2 slaves (both 10.3.9 / might try to downgrade back to 10.2.x or previous versions of 10.3.x as I don't know the exact point when it started to happen) and the issue is triggered immediately when switching the parallel mode.
Can you share the contents of the mysql.gtid_slave_pos table when this happens?
Sure,
MariaDB [mysql]> select @@gtid_slave_pos; +--------------------+ | @@gtid_slave_pos | +--------------------+ | 0-211-211038653075 | +--------------------+ 1 row in set (0.000 sec)
MariaDB [mysql]> select * from gtid_slave_pos limit 10; +-----------+--------+-----------+--------------+ | domain_id | sub_id | server_id | seq_no | +-----------+--------+-----------+--------------+ | 0 | 29488 | 211 | 210594092751 | | 0 | 29490 | 211 | 210594092753 | | 0 | 29957 | 211 | 210594093220 | | 0 | 29958 | 211 | 210594093221 | | 0 | 29961 | 211 | 210594093224 | | 0 | 29962 | 211 | 210594093225 | | 0 | 30095 | 211 | 210594093358 | | 0 | 30096 | 211 | 210594093359 | | 0 | 30247 | 211 | 210594093510 | | 0 | 30275 | 211 | 210594093538 | +-----------+--------+-----------+--------------+ 10 rows in set (0.000 sec)
MariaDB [mysql]> select count(*) from gtid_slave_pos; +----------+ | count(*) | +----------+ | 2395877 | +----------+ 1 row in set (0.578 sec)
MariaDB [mysql]> select * from gtid_slave_pos_TokuDB limit 10;; +-----------+--------+-----------+--------------+ | domain_id | sub_id | server_id | seq_no | +-----------+--------+-----------+--------------+ | 0 | 29373 | 211 | 210594092636 | | 0 | 29911 | 211 | 210594093174 | | 0 | 29912 | 211 | 210594093175 | | 0 | 30282 | 211 | 210594093545 | | 0 | 30283 | 211 | 210594093546 | | 0 | 30284 | 211 | 210594093547 | | 0 | 30285 | 211 | 210594093548 | | 0 | 30287 | 211 | 210594093550 | | 0 | 30348 | 211 | 210594093611 | | 0 | 30349 | 211 | 210594093612 | +-----------+--------+-----------+--------------+ 10 rows in set (0.001 sec)
MariaDB [mysql]> select count(*) from gtid_slave_pos_TokuDB; +----------+ | count(*) | +----------+ | 840001 |
Is there something wrong with the purger? (something similar like in https://jira.mariadb.org/browse/MDEV-12147 ? )
That bug is rather different - the row count in the table is not growing, but number of unpurged rows is.
True, I just searched the Jira for similar (related to gtid_slave_pos) kind of issues and saw this still being open.
The optimistic mode makes a big difference in our setup as with the conservative there are times when the slaves start to lag several days behind.
rr
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp