Re: [Maria-developers] Problem with parallel replication in 10.2

1 Sep 2016

Michael Widenius <michael.widenius@gmail.com> writes:
...
I was trying to run a test that fails in the upcoming bb-10.2-jan on
the normal 10.2 tree, when I noticed this strange issue:
- Test fails with timeout when running with --debug
- When looking at the trace file, I notice that we get a duplicate key
error for the table gtid_slave_post (MyISAM table).  Is this something
normal ?
Like this:

2016-09-01 10:33:20 140078976283392 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-53' for key 'PRIMARY', Gtid 0-1-52, Internal MariaDB error code: 1942

This happens because the mysql.gtid_slave_pos table is MyISAM (which is
default in mysql-test-run, but not in the normal server install), and
parallel replication needs to roll back a transaction after it has updated
the table. Because of MyISAM, the gtid_slave_pos change cannot be rolled
back.

Maybe parallel replication could in this case manually undo its change in
the table as part of the rollback. It's just a DELETE of the row previously
inserted.

In any case, currently the fix is to use InnoDB for the table:

--- rpl_skr.test~	2016-09-01 10:27:21.214633498 +0200
+++ rpl_skr.test	2016-09-01 10:35:50.660242337 +0200
@@ -8,6 +8,9 @@
 --connection server_2
 SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads;
 --source include/stop_slave.inc
+SET sql_log_bin=0;
+ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
+SET sql_log_bin=1;
 SET GLOBAL slave_parallel_threads=10;
 SET GLOBAL slave_parallel_mode='conservative';
 --source include/start_slave.inc
...
bb-10.2-jan tree is a working tree for a merge of MariaDB 10.2 and MySQL 5.7
When running rpl_skr in 10.2 it takes 2 seconds
When running it in the bb-10.2-jan tree it takes either  a long time
or we get a timeout.
This is because of errorneous merge. The original code:

    if (waitee_buf_ptr) {
            lock_report_waiters_to_mysql(waitee_buf_ptr,
                                         start_mysql_thd,
                                         victim_trx_id);

The bb-10.2-jan code:

    if (victim_trx && waitee_buf_ptr) {
            lock_report_waiters_to_mysql(waitee_buf_ptr,
                                         start_mysql_thd,
                                         victim_trx->id);

So if victim_trx is NULL the waits are not reported to parallel replication
at all, causing the stalls and/or hangs. victim_trx is NULL unless InnoDB
itself detects a deadlock.

I've attached a patch that fixes this, can also be pulled from here:

  https://github.com/knielsen/server/commits/montyrpl

Or should I push it directly into bb-10.2-jan? This makes the rpl_skr.test
complete correctly in < 1 second.
...
This is probably because of the new lock code in lock0lock.cc and
lock0wait.cc which doesn't break conflicting transaction but instead
waits for a timeout
The merge appears very rough. Shouldn't the waitee_buf be integrated into
the new DeadlockChecker class? Why is it necessary to thd_report_wait_for()
on internal transactions like here?

    /* m_trx->mysql_thd is NULL if it's an internal trx. So current_thd is used */
    if (err == DB_LOCK_WAIT) {
            ut_ad(wait_for && wait_for->trx);
            wait_for->trx->abort_type = TRX_REPLICATION_ABORT;
            thd_report_wait_for(current_thd, wait_for->trx->mysql_thd);
            wait_for->trx->abort_type = TRX_SERVER_ABORT;
    }
    return(err);

Maybe I should try to write a better patch for integrating this in the new
InnoDB code.

What do you think about changing this to use the async deadlock kill in
background thread, as discussed in this thread?

  https://lists.launchpad.net/maria-developers/msg09902.html

This would allow to simplify the code in lock0lock.cc, and avoid the locking
hacks in innobase_kill_query()?

 - Kristian.

    

Re: [Maria-developers] Problem with parallel replication in 10.2

Kristian Nielsen