Hello All,
I have been running sysbench oltp with a mariadb 10.1 master-slave topology.  I have not seen any replication errors when slave parallel mode is conservative.  

However, when I configure slave parallel mode to optimistic and slave parallel threads = 2, I get a lock timeout replication error with TokuDB.  Just before the lock timeout error fires (which requires a tokudb lock timeout to occur), I see the one of the replication threads waiting for a lock held by the other replication thread.  gdb shows the first thread waiting on a lock inside of tokudb.  the other thread is stalled when committing the transaction in wait_for_prior_commit_2 <- wait_for_prior_commit <- THD::wait_for_prior_commit <- TC_LOG_MMAP::log_and_order <- ha_commit_trans.

Is TokuDB supposed to call the thd report wait for API just prior to a thread about to wait on a tokudb lock? 



On Sun, Aug 7, 2016 at 7:50 PM, jocelyn fournier <jocelyn.fournier@gmail.com> wrote:
Hi Kristian,


Just FYI I confirm the "Lock wait timeout exceeded; try restarting transaction" behaviour you described.

I've duplicated & modified the rpl_parallel_optimistic.test and run it into storage/tokudb/mysql-test/tokudb_rpl/t/rpl_parallel_optimistic.test :

./mtr --suite=tokudb_rpl <1:33:48
Logging: ./mtr  --suite=tokudb_rpl
vardir: /home/joce/mariadb-10.1.16/mysql-test/var
Checking leftover processes...
Removing old var directory...
Creating var directory '/home/joce/mariadb-10.1.16/mysql-test/var'...
Checking supported features...
MariaDB Version 10.1.16-MariaDB-debug
 - SSL connections supported
 - binaries are debug compiled
Using suites: tokudb_rpl
Collecting tests...
Installing system database...
==============================================================================

TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
worker[1] mysql-test-run: WARNING: running this script as _root_ will cause some tests to be skipped
tokudb_rpl.rpl_parallel_optimistic 'innodb_plugin,mix' [ fail ]
        Test ended at 2016-08-08 01:26:34

CURRENT_TEST: tokudb_rpl.rpl_parallel_optimistic
mysqltest: In included file "./include/sync_with_master_gtid.inc":
included from /home/joce/mariadb-10.1.16/storage/tokudb/mysql-test/tokudb_rpl/t/rpl_parallel_optimistic.test at line 59:
At line 50: Failed to sync with master

The result from queries just before the failure was:
< snip >
DELETE FROM t1 WHERE a=2;
INSERT INTO t1 VALUES (2,5);
DELETE FROM t1 WHERE a=3;
INSERT INTO t1 VALUES(3,2);
DELETE FROM t1 WHERE a=1;
INSERT INTO t1 VALUES(1,2);
DELETE FROM t1 WHERE a=3;
INSERT INTO t1 VALUES(3,3);
DELETE FROM t1 WHERE a=2;
INSERT INTO t1 VALUES (2,6);
include/save_master_gtid.inc
SELECT * FROM t1 ORDER BY a;
a    b
1    2
2    6
3    3
include/start_slave.inc
include/sync_with_master_gtid.inc
Timeout in master_gtid_wait('0-1-20', 120), current slave GTID position is: 0-1-3.
Slave state : Waiting for master to send event    127.0.0.1 root    16000    1    master-bin.000001    3468 slave-relay-bin.000002    796    master-bin.000001    Yes    No                         1205    Lock wait timeout exceeded; try restarting transaction    0    772    3790    None        0 No                            No    0        1205    Lock wait timeout exceeded; try restarting transaction        1 Slave_Pos    0-1-20            optimistic


I've no explanation so far for the DUPLICATE KEY error I've seen.


  Jocelyn


Le 15/07/2016 à 17:09, Kristian Nielsen a écrit :
jocelyn fournier <jocelyn.fournier@gmail.com> writes:

Thanks for the quick answer! I wonder if it would be possible the
automatically disable the optimistic parallel replication for an
engine if it does not implement it ?
That would probably be good - though it would be better to just implement
the necessary API, it's a very small change (basically TokuDB just needs to
inform the upper layer of any lock waits that take place inside).

However, looking more at your description, you got a "key not found"
error. Not implementing the thd_report_wait_for() could lead to deadlocks,
but it shouldn't cause key not found. In fact, in optimistic mode, all
errors are treated as "deadlock" errors, the query is rolled back, and
run again, this time not in parallel.

So I'm wondering if there is something else going on. If transactions T1 and
T2 run in parallel, it's possible that they have a row conflict. But if T2
deleted a row expected by T1, I would expect T1 to wait on a row lock held
by T2, not get a duplicate key error. And if T1 has not yet inserted a row
expected by T2, then T2 would be rolled back and retried after T1 has
committed. The first can cause deadlock, but neither case seems to cause
duplicate error.

Maybe TokuDB is doing something special with locks around replication, or
something else goes wrong. I guess TokuDB just hasn't been tested much with
parallel replication.

Does it work ok when running in conservative parallel mode?

  - Kristian.