Kristian Nielsen <knielsen@knielsen-hq.org> writes:
I will continue and look deeper in the rpl_deadlock_innodb failure and in the other issues.
Ok, I debugged the problem in rpl.rpl_deadlock_innodb where I get this failure: CURRENT_TEST: rpl.rpl_deadlock_innodb mysqltest: In included file "./include/wait_for_slave_param.inc": included from ./include/wait_for_slave_sql_error.inc at line 41: included from ./extra/rpl_tests/rpl_deadlock.test at line 84: included from /home/knielsen/my/10.0/work-10.0-mdev520/mysql-test/suite/rpl/t/rpl_deadlock_innodb.test at line 6: At line 115: Timeout in include/wait_for_slave_param.inc This test case tries to replicate a transaction, but the slave is blocked by row locks held by a user transaction. So the slave transaction gets a "Lock wait timeout exceeded" error, and retries the transaction, this repeats until @@global.slave_transaction_retries is exceeded. The test case waits for the maximum number of retries to happen and the slave to stop with an error. However, this does not happen in your parallel-replication tree. Instead the slave loops endlessly retrying (and timing out) the transaction. The transaction is retried the correct number of times, and then an error is returned from execute_single_transaction(). But somehow this error is not caught correctly, and the slave is not stopped. Instead, execute_single_transaction() gets called again with the *same* transaction, and it fails again, and so on endlessly. Until the test case itself times out and gives up waiting for the slave to stop with an error. I did not so far find exactly where the error check is missing, but it must be somewhere up in the call chain of execute_single_transaction(). It needs to catch the error somewhere and stop the slave and set the error code and message for SHOW SLAVE STATUS. I hope you can sort it out from there, else ask again. By the way, while debugging I found something else that may be an error also. I was replicating three CREATE TABLE statements in sequence: CREATE TABLE t1 (a INT NOT NULL, KEY(a)) ENGINE=InnoDB; CREATE TABLE t2 (a INT) ENGINE=InnoDB; CREATE TABLE t3 (a INT NOT NULL, KEY(a)) ENGINE=InnoDB; It looks as if those are executed as a single transaction (a single call to execute_single_transaction()). Is this on purpose? My guess is your code may not correctly handle event groups that are not bracketed by BEGIN ... END. Basically, if there is no BEGIN ... END, then the event is an event group by itself, however these events form a grop with the following event(s) and do not constitute a group by themselves: INTVAR_EVENT RAND_EVENT USER_VAR_EVENT TABLE_MAP_EVENT ANNOTATE_ROWS_EVENT. ---- The logic around event execution failure and retry and so on (in the normal slave code) is quite tricky, and it seems likely that there will be other issues to deal with :-/. Hopefully the above can get you a bit further. For the long-term, I will try to get hold of Monty and discuss with him how to improve the slave SQL thread code. We already have multiple SQL threads for multi-slave, now your patch has multiple threads for your parallel replication, and we may get even more threads for other features. I am hoping we could make a general refactoring to support properly multiple threads, where all of the event apply and error handling code can be cleaned up and re-used by all the different features. Then your job will become a bit easier. - Kristian.