Do you have any errors in the error log about failure to delete rows?
Nope, no errors.
Anything else special to your setup that might be causing this?
At some point I thought maybe the tokudb_analyze_in_background / tokudb_auto_analyze messes things up as it does the background check (you can also see here the row count growing): 2018-09-29 11:05:48 134488 [Note] TokuDB: Auto scheduling background analysis for ./mysql/gtid_slave_pos_TokuDB, delta_activity 423840 is greater than 40 percent of 1059601 rows. - succeeded. 2018-09-29 11:09:35 134490 [Note] TokuDB: Auto scheduling background analysis for ./mysql/gtid_slave_pos_TokuDB, delta_activity 424359 is greater than 40 percent of 1060885 rows. - succeeded. 2018-09-29 11:13:23 134488 [Note] TokuDB: Auto scheduling background analysis for ./mysql/gtid_slave_pos_TokuDB, delta_activity 424888 is greater than 40 percent of 1062196 rows. - succeeded. (it triggers also in conservative mode but then it happens just because of a single row being >40% of the table) I tried to switch off the gtid_pos_auto_engines to use a single gtid_pos InnoDB table and it makes no difference - in conservative mode everything is fine in optimistic the table fills up. The odd thing is that I'm actually not using gtid for the replication: MariaDB [mysql]> show slave status\G Slave_IO_State: Waiting for master to send event Master_Host: 10.0.8.211 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.096519 Read_Master_Log_Pos: 79697585 Relay_Log_File: db-relay-bin.000142 Relay_Log_Pos: 78464847 Relay_Master_Log_File: mysql-bin.096519 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 79697245 Relay_Log_Space: 595992008 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 .. Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 211 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: No Gtid_IO_Pos: Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Slave_DDL_Groups: 25 Slave_Non_Transactional_Groups: 284193 Slave_Transactional_Groups: 452098720 The other "special" thing maybe is that the master is still 10.2.4 - but that shouldn’t be the problem? I have 2 slaves (both 10.3.9 / might try to downgrade back to 10.2.x or previous versions of 10.3.x as I don't know the exact point when it started to happen) and the issue is triggered immediately when switching the parallel mode.
Can you share the contents of the mysql.gtid_slave_pos table when this happens?
Sure, MariaDB [mysql]> select @@gtid_slave_pos; +--------------------+ | @@gtid_slave_pos | +--------------------+ | 0-211-211038653075 | +--------------------+ 1 row in set (0.000 sec) MariaDB [mysql]> select * from gtid_slave_pos limit 10; +-----------+--------+-----------+--------------+ | domain_id | sub_id | server_id | seq_no | +-----------+--------+-----------+--------------+ | 0 | 29488 | 211 | 210594092751 | | 0 | 29490 | 211 | 210594092753 | | 0 | 29957 | 211 | 210594093220 | | 0 | 29958 | 211 | 210594093221 | | 0 | 29961 | 211 | 210594093224 | | 0 | 29962 | 211 | 210594093225 | | 0 | 30095 | 211 | 210594093358 | | 0 | 30096 | 211 | 210594093359 | | 0 | 30247 | 211 | 210594093510 | | 0 | 30275 | 211 | 210594093538 | +-----------+--------+-----------+--------------+ 10 rows in set (0.000 sec) MariaDB [mysql]> select count(*) from gtid_slave_pos; +----------+ | count(*) | +----------+ | 2395877 | +----------+ 1 row in set (0.578 sec) MariaDB [mysql]> select * from gtid_slave_pos_TokuDB limit 10;; +-----------+--------+-----------+--------------+ | domain_id | sub_id | server_id | seq_no | +-----------+--------+-----------+--------------+ | 0 | 29373 | 211 | 210594092636 | | 0 | 29911 | 211 | 210594093174 | | 0 | 29912 | 211 | 210594093175 | | 0 | 30282 | 211 | 210594093545 | | 0 | 30283 | 211 | 210594093546 | | 0 | 30284 | 211 | 210594093547 | | 0 | 30285 | 211 | 210594093548 | | 0 | 30287 | 211 | 210594093550 | | 0 | 30348 | 211 | 210594093611 | | 0 | 30349 | 211 | 210594093612 | +-----------+--------+-----------+--------------+ 10 rows in set (0.001 sec) MariaDB [mysql]> select count(*) from gtid_slave_pos_TokuDB; +----------+ | count(*) | +----------+ | 840001 |
Is there something wrong with the purger? (something similar like in https://jira.mariadb.org/browse/MDEV-12147 ? )
That bug is rather different - the row count in the table is not growing, but number of unpurged rows is.
True, I just searched the Jira for similar (related to gtid_slave_pos) kind of issues and saw this still being open. The optimistic mode makes a big difference in our setup as with the conservative there are times when the slaves start to lag several days behind. rr