Michael Widenius <michael.widenius@gmail.com> writes:
I was trying to run a test that fails in the upcoming bb-10.2-jan on the normal 10.2 tree, when I noticed this strange issue:
- Test fails with timeout when running with --debug - When looking at the trace file, I notice that we get a duplicate key error for the table gtid_slave_post (MyISAM table). Is this something normal ?
Like this: 2016-09-01 10:33:20 140078976283392 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-53' for key 'PRIMARY', Gtid 0-1-52, Internal MariaDB error code: 1942 This happens because the mysql.gtid_slave_pos table is MyISAM (which is default in mysql-test-run, but not in the normal server install), and parallel replication needs to roll back a transaction after it has updated the table. Because of MyISAM, the gtid_slave_pos change cannot be rolled back. Maybe parallel replication could in this case manually undo its change in the table as part of the rollback. It's just a DELETE of the row previously inserted. In any case, currently the fix is to use InnoDB for the table: --- rpl_skr.test~ 2016-09-01 10:27:21.214633498 +0200 +++ rpl_skr.test 2016-09-01 10:35:50.660242337 +0200 @@ -8,6 +8,9 @@ --connection server_2 SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads; --source include/stop_slave.inc +SET sql_log_bin=0; +ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB; +SET sql_log_bin=1; SET GLOBAL slave_parallel_threads=10; SET GLOBAL slave_parallel_mode='conservative'; --source include/start_slave.inc
bb-10.2-jan tree is a working tree for a merge of MariaDB 10.2 and MySQL 5.7
When running rpl_skr in 10.2 it takes 2 seconds When running it in the bb-10.2-jan tree it takes either a long time or we get a timeout.
This is because of errorneous merge. The original code: if (waitee_buf_ptr) { lock_report_waiters_to_mysql(waitee_buf_ptr, start_mysql_thd, victim_trx_id); The bb-10.2-jan code: if (victim_trx && waitee_buf_ptr) { lock_report_waiters_to_mysql(waitee_buf_ptr, start_mysql_thd, victim_trx->id); So if victim_trx is NULL the waits are not reported to parallel replication at all, causing the stalls and/or hangs. victim_trx is NULL unless InnoDB itself detects a deadlock. I've attached a patch that fixes this, can also be pulled from here: https://github.com/knielsen/server/commits/montyrpl Or should I push it directly into bb-10.2-jan? This makes the rpl_skr.test complete correctly in < 1 second.
This is probably because of the new lock code in lock0lock.cc and lock0wait.cc which doesn't break conflicting transaction but instead waits for a timeout
The merge appears very rough. Shouldn't the waitee_buf be integrated into the new DeadlockChecker class? Why is it necessary to thd_report_wait_for() on internal transactions like here? /* m_trx->mysql_thd is NULL if it's an internal trx. So current_thd is used */ if (err == DB_LOCK_WAIT) { ut_ad(wait_for && wait_for->trx); wait_for->trx->abort_type = TRX_REPLICATION_ABORT; thd_report_wait_for(current_thd, wait_for->trx->mysql_thd); wait_for->trx->abort_type = TRX_SERVER_ABORT; } return(err); Maybe I should try to write a better patch for integrating this in the new InnoDB code. What do you think about changing this to use the async deadlock kill in background thread, as discussed in this thread? https://lists.launchpad.net/maria-developers/msg09902.html This would allow to simplify the code in lock0lock.cc, and avoid the locking hacks in innobase_kill_query()? - Kristian.