[Maria-developers] Parallel replication slave stops with duplicate key errors
Hi Kristian, I was trying out parallel replication and got an error that I am not sure what to do with. == Setup == I started a master with binlog_format=row, I ran sysbench there to load the dataset ("prepare" mode, about 10M rows IIRC) then I restarted the master and ran sysbench in "run" mode a few times. Then I restarted the master again an ran a "CREATE TABLE" statement manually. Just so that I have third binlog. The master state is: mysql> show binary logs; +-------------+-----------+ | Log_name | File_size | +-------------+-----------+ | pslp.000001 | 194918772 | | pslp.000002 | 28895350 | | pslp.000003 | 589 | +-------------+-----------+ == Non-parallel test run == Then I connected with a slave, with all default settings. The slave had empty datadir, so I made it a slave with this command: CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=3310, MASTER_USER='root'; All server settings were pretty much default. It took a slave one hour to catch up, but it did so, and encountered no errors. Ok, this was serial replication == Parallel test run == Then I killed the slave, deleted its data directory, put an empty directory instead I added this to slave.cnf file: slave_parallel_threads=4 slave_parallel_mode=aggressive and started the slave. It ran in parallel for a few minutes, I saw this: mysql> show processlist; +----+-------------+-----------+------+---------+------+-----------------------------------------------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+-------------+-----------+------+---------+------+-----------------------------------------------+------------------+----------+ | 3 | root | localhost | test | Query | 0 | init | show processlist | 0.000 | | 4 | system user | | NULL | Connect | 15 | Waiting for master to send event | NULL | 0.000 | | 5 | system user | | NULL | Connect | 0 | Write_rows_log_event::write_row(-1) | NULL | 0.000 | | 6 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 | | 7 | system user | | NULL | Connect | 0 | Write_rows_log_event::write_row(-1) | NULL | 0.000 | | 8 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 | | 9 | system user | | NULL | Connect | 14 | Waiting for room in worker thread event queue | NULL | 0.000 | +----+-------------+-----------+------+---------+------+-----------------------------------------------+------------------+----------+ 7 rows in set (0.00 sec) but then it stopped. The error log has this: 2016-06-15 15:09:58 140702586379008 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './psergey-desktop-relay-bin.000001' position: 4 2016-06-15 15:09:59 140702586985216 [Note] InnoDB: Online DDL : Start 2016-06-15 15:09:59 140702586985216 [Note] InnoDB: Online DDL : Start reading clustered index of the table and create temporary files 2016-06-15 15:09:59 140702586985216 [Note] InnoDB: Online DDL : End of reading clustered index of the table and create temporary files 2016-06-15 15:09:59 140702586985216 [Note] InnoDB: Online DDL : Start applying row log 2016-06-15 15:09:59 140702586985216 [Note] InnoDB: Online DDL : End of applying row log 2016-06-15 15:09:59 140702586985216 [Note] InnoDB: Online DDL : Completed 2016-06-15 15:11:18 140702586985216 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-1021' for key 'PRIMARY', Gtid 0-1-1021, Internal MariaDB error code: 1942 2016-06-15 15:11:18 140702586985216 [ERROR] Slave (additional info): Duplicate entry '0-1021' for key 'PRIMARY' Error_code: 1062 2016-06-15 15:11:18 140702586985216 [Warning] Slave: Duplicate entry '0-1021' for key 'PRIMARY' Error_code: 1062 2016-06-15 15:11:18 140702586985216 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'pslp.000002' position 8266 2016-06-15 15:11:18 140702586682112 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2016-06-15 15:11:18 140702586682112 [Warning] Slave: Duplicate entry '821827' for key 'PRIMARY' Error_code: 1062 2016-06-15 15:11:18 140702586682112 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213 2016-06-15 15:11:18 140702586682112 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2016-06-15 15:11:18 140702586682112 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'pslp.000002' position 8266 2016-06-15 15:11:18 140702602623744 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2016-06-15 15:11:18 140702602623744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2016-06-15 15:11:18 140702602623744 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'pslp.000002' position 8266 2016-06-15 15:11:18 140702602926848 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2016-06-15 15:11:18 140702602926848 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2016-06-15 15:11:18 140702602926848 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'pslp.000002' position 8266 2016-06-15 15:11:18 140702586985216 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2016-06-15 15:11:18 140702586985216 [Warning] Slave: Duplicate entry '0-1021' for key 'PRIMARY' Error_code: 1062 2016-06-15 15:11:18 140702586985216 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2016-06-15 15:11:18 140702586985216 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'pslp.000002' position 8266 2016-06-15 15:11:18 140702586682112 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2016-06-15 15:11:18 140702586682112 [Warning] Slave: Duplicate entry '821827' for key 'PRIMARY' Error_code: 1062 ... and there were lots of these errors. Eventually it stopped with: 2016-06-15 15:11:22 140702586682112 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2016-06-15 15:11:22 140702586682112 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2016-06-15 15:11:22 140702586682112 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 2016-06-15 15:11:22 140702586682112 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'pslp.000002' position 8266 2016-06-15 15:11:22 140702586379008 [Note] Slave SQL thread exiting, replication stopped in log 'pslp.000002' at position 8266 The status is: mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: root Master_Port: 3310 Connect_Retry: 60 Master_Log_File: pslp.000003 Read_Master_Log_Pos: 589 Relay_Log_File: psergey-desktop-relay-bin.000003 Relay_Log_Pos: 8549 Relay_Master_Log_File: pslp.000002 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1942 Last_Error: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-1021' for key 'PRIMARY' Skip_Counter: 0 Exec_Master_Log_Pos: 8266 Relay_Log_Space: 223815659 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1942 Last_SQL_Error: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-1021' for key 'PRIMARY' Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: No Gtid_IO_Pos: Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: aggressive Can you advise, what does this error mean? (is there any benefit in cc'ing this to maria-developers@ ? if yes, feel free to..) BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog
Sergey Petrunia <sergey@mariadb.com> writes:
I ran sysbench there to load the dataset ("prepare" mode, about 10M rows IIRC) then I restarted the master and ran sysbench in "run" mode a few times. Then I restarted the master again an ran a "CREATE TABLE" statement manually. Just so that I have third binlog.
Incidentally, you can also just use FLUSH LOGS to rotate the binlog to a new file (but master restart is fine as well, of course).
== Parallel test run == Then I killed the slave, deleted its data directory, put an empty directory instead I added this to slave.cnf file:
slave_parallel_threads=4 slave_parallel_mode=aggressive
2016-06-15 15:11:18 140702586985216 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-1021' for key 'PRIMARY', Gtid 0-1-1021, Internal MariaDB error code: 1942
I noticed that gtid_slave_pos is non transactional.. I've made it
Right, so it seems that the error is caused by non-transactional mysql.gtid_slave_pos. I was able to repeat this by running test case rpl.rpl_parallel_optimistic with mysql.gtid_slave_pos being MyISAM. What happens is that optimistic/aggresive mode runs conflicting transactions in parallel and then sometimes needs to roll back. If the rollback occurs after updating a MyISAM mysql.gtid_slave_pos table, the GTID will be left in the table, and we get duplicate key error when the transaction is retried. Optimistic/aggressive mode is not really useful with non-transactional tables, but it is not supposed to break replication. Normally the code falls back to non-parallel replication of changes to non-transactional tables, but this is checked on the master, so does not take into account mysql.gtid_slave_pos. But then, if mysql.gtid_slave_pos is non-transactional, nothing can run in parallel under optimistic/aggressive mode, which is not very useful, so probably giving an error is better. I have filed MDEV-10242 for this, thanks for reporting the problem! Usually, mysql.gtid_slave_pos is created as InnoDB. Only for users that have MyISAM as default engine is it created as MyISAM (or if they explicitly ALTER TABLE'd it). So that is probably why it has not hit a lot of users so far. Your setup was using a data directory from mysql-test-run, which explicitly changes default to MyISAM, hence you got the problem. - Kristian.
participants (2)
-
Kristian Nielsen
-
Sergey Petrunia