Hello All, I have been running sysbench oltp with a mariadb 10.1 master-slave topology. I have not seen any replication errors when slave parallel mode is conservative. However, when I configure slave parallel mode to optimistic and slave parallel threads = 2, I get a lock timeout replication error with TokuDB. Just before the lock timeout error fires (which requires a tokudb lock timeout to occur), I see the one of the replication threads waiting for a lock held by the other replication thread. gdb shows the first thread waiting on a lock inside of tokudb. the other thread is stalled when committing the transaction in wait_for_prior_commit_2 <- wait_for_prior_commit <- THD::wait_for_prior_commit <- TC_LOG_MMAP::log_and_order <- ha_commit_trans. Is TokuDB supposed to call the thd report wait for API just prior to a thread about to wait on a tokudb lock? On Sun, Aug 7, 2016 at 7:50 PM, jocelyn fournier <jocelyn.fournier@gmail.com
wrote:
Hi Kristian,
Just FYI I confirm the "Lock wait timeout exceeded; try restarting transaction" behaviour you described.
I've duplicated & modified the rpl_parallel_optimistic.test and run it into storage/tokudb/mysql-test/tokudb_rpl/t/rpl_parallel_optimistic.test :
./mtr --suite=tokudb_rpl <1:33:48 Logging: ./mtr --suite=tokudb_rpl vardir: /home/joce/mariadb-10.1.16/mysql-test/var Checking leftover processes... Removing old var directory... Creating var directory '/home/joce/mariadb-10.1.16/mysql-test/var'... Checking supported features... MariaDB Version 10.1.16-MariaDB-debug - SSL connections supported - binaries are debug compiled Using suites: tokudb_rpl Collecting tests... Installing system database... ============================================================ ==================
TEST RESULT TIME (ms) or COMMENT --------------------------------------------------------------------------
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 worker[1] mysql-test-run: WARNING: running this script as _root_ will cause some tests to be skipped tokudb_rpl.rpl_parallel_optimistic 'innodb_plugin,mix' [ fail ] Test ended at 2016-08-08 01:26:34
CURRENT_TEST: tokudb_rpl.rpl_parallel_optimistic mysqltest: In included file "./include/sync_with_master_gtid.inc": included from /home/joce/mariadb-10.1.16/storage/tokudb/mysql-test/tokudb_ rpl/t/rpl_parallel_optimistic.test at line 59: At line 50: Failed to sync with master
The result from queries just before the failure was: < snip > DELETE FROM t1 WHERE a=2; INSERT INTO t1 VALUES (2,5); DELETE FROM t1 WHERE a=3; INSERT INTO t1 VALUES(3,2); DELETE FROM t1 WHERE a=1; INSERT INTO t1 VALUES(1,2); DELETE FROM t1 WHERE a=3; INSERT INTO t1 VALUES(3,3); DELETE FROM t1 WHERE a=2; INSERT INTO t1 VALUES (2,6); include/save_master_gtid.inc SELECT * FROM t1 ORDER BY a; a b 1 2 2 6 3 3 include/start_slave.inc include/sync_with_master_gtid.inc Timeout in master_gtid_wait('0-1-20', 120), current slave GTID position is: 0-1-3. Slave state : Waiting for master to send event 127.0.0.1 root 16000 1 master-bin.000001 3468 slave-relay-bin.000002 796 master-bin.000001 Yes No 1205 Lock wait timeout exceeded; try restarting transaction 0 772 3790 None 0 No No 0 1205 Lock wait timeout exceeded; try restarting transaction 1 Slave_Pos 0-1-20 optimistic
I've no explanation so far for the DUPLICATE KEY error I've seen.
Jocelyn
Le 15/07/2016 à 17:09, Kristian Nielsen a écrit :
jocelyn fournier <jocelyn.fournier@gmail.com> writes:
Thanks for the quick answer! I wonder if it would be possible the
automatically disable the optimistic parallel replication for an engine if it does not implement it ?
That would probably be good - though it would be better to just implement the necessary API, it's a very small change (basically TokuDB just needs to inform the upper layer of any lock waits that take place inside).
However, looking more at your description, you got a "key not found" error. Not implementing the thd_report_wait_for() could lead to deadlocks, but it shouldn't cause key not found. In fact, in optimistic mode, all errors are treated as "deadlock" errors, the query is rolled back, and run again, this time not in parallel.
So I'm wondering if there is something else going on. If transactions T1 and T2 run in parallel, it's possible that they have a row conflict. But if T2 deleted a row expected by T1, I would expect T1 to wait on a row lock held by T2, not get a duplicate key error. And if T1 has not yet inserted a row expected by T2, then T2 would be rolled back and retried after T1 has committed. The first can cause deadlock, but neither case seems to cause duplicate error.
Maybe TokuDB is doing something special with locks around replication, or something else goes wrong. I guess TokuDB just hasn't been tested much with parallel replication.
Does it work ok when running in conservative parallel mode?
- Kristian.