Hi Kristian,
Just FYI I confirm the "Lock wait timeout exceeded; try restarting transaction" behaviour you described.
I've duplicated & modified the rpl_parallel_optimistic.test and run it into storage/tokudb/mysql-test/tokudb_rpl/t/rpl_parallel_optimist ic.test :
./mtr --suite=tokudb_rpl <1:33:48
Logging: ./mtr --suite=tokudb_rpl
vardir: /home/joce/mariadb-10.1.16/mysql-test/var
Checking leftover processes...
Removing old var directory...
Creating var directory '/home/joce/mariadb-10.1.16/mysql-test/var'...
Checking supported features...
MariaDB Version 10.1.16-MariaDB-debug
- SSL connections supported
- binaries are debug compiled
Using suites: tokudb_rpl
Collecting tests...
Installing system database...
============================================================ ==================
TEST RESULT TIME (ms) or COMMENT
------------------------------------------------------------ --------------
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
worker[1] mysql-test-run: WARNING: running this script as _root_ will cause some tests to be skipped
tokudb_rpl.rpl_parallel_optimistic 'innodb_plugin,mix' [ fail ]
Test ended at 2016-08-08 01:26:34
CURRENT_TEST: tokudb_rpl.rpl_parallel_optimistic
mysqltest: In included file "./include/sync_with_master_gtid.inc":
included from /home/joce/mariadb-10.1.16/storage/tokudb/mysql-test/tokudb_ rpl/t/rpl_parallel_optimistic. test at line 59:
At line 50: Failed to sync with master
The result from queries just before the failure was:
< snip >
DELETE FROM t1 WHERE a=2;
INSERT INTO t1 VALUES (2,5);
DELETE FROM t1 WHERE a=3;
INSERT INTO t1 VALUES(3,2);
DELETE FROM t1 WHERE a=1;
INSERT INTO t1 VALUES(1,2);
DELETE FROM t1 WHERE a=3;
INSERT INTO t1 VALUES(3,3);
DELETE FROM t1 WHERE a=2;
INSERT INTO t1 VALUES (2,6);
include/save_master_gtid.inc
SELECT * FROM t1 ORDER BY a;
a b
1 2
2 6
3 3
include/start_slave.inc
include/sync_with_master_gtid.inc
Timeout in master_gtid_wait('0-1-20', 120), current slave GTID position is: 0-1-3.
Slave state : Waiting for master to send event 127.0.0.1 root 16000 1 master-bin.000001 3468 slave-relay-bin.000002 796 master-bin.000001 Yes No 1205 Lock wait timeout exceeded; try restarting transaction 0 772 3790 None 0 No No 0 1205 Lock wait timeout exceeded; try restarting transaction 1 Slave_Pos 0-1-20 optimistic
I've no explanation so far for the DUPLICATE KEY error I've seen.
Jocelyn
Le 15/07/2016 à 17:09, Kristian Nielsen a écrit :
jocelyn fournier <jocelyn.fournier@gmail.com> writes:
Thanks for the quick answer! I wonder if it would be possible theThat would probably be good - though it would be better to just implement
automatically disable the optimistic parallel replication for an
engine if it does not implement it ?
the necessary API, it's a very small change (basically TokuDB just needs to
inform the upper layer of any lock waits that take place inside).
However, looking more at your description, you got a "key not found"
error. Not implementing the thd_report_wait_for() could lead to deadlocks,
but it shouldn't cause key not found. In fact, in optimistic mode, all
errors are treated as "deadlock" errors, the query is rolled back, and
run again, this time not in parallel.
So I'm wondering if there is something else going on. If transactions T1 and
T2 run in parallel, it's possible that they have a row conflict. But if T2
deleted a row expected by T1, I would expect T1 to wait on a row lock held
by T2, not get a duplicate key error. And if T1 has not yet inserted a row
expected by T2, then T2 would be rolled back and retried after T1 has
committed. The first can cause deadlock, but neither case seems to cause
duplicate error.
Maybe TokuDB is doing something special with locks around replication, or
something else goes wrong. I guess TokuDB just hasn't been tested much with
parallel replication.
Does it work ok when running in conservative parallel mode?
- Kristian.