Kristian, On Tue, May 21, 2013 at 2:31 PM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
GTIDs it's really easy to solve the problem -- rpl_slave_state has GTID of the last executed transaction, so on restart one needs just to ignore all relay logs and re-connect to master with the last GTID. Did
Yes. That is how it's done basically.
Slave restart should be crash safe with MariaDB GTID (for InnoDB DML - MyISAM DML or DDL in general is not crash safe of course, as neither MyISAM nor DDL is transactional).
It seems to me that what you said above doesn't always work quite like that. It's possible that there's something wrong with us applying GTID patch on top of 10.0.1 and maybe I'm missing something crucial (we didn't apply START SLAVE UNTIL and everything that was after that yet). But from time to time we see failures of rpl.rpl_gtid_crash test with the output like this: CURRENT_TEST: rpl.rpl_gtid_crash mysqltest: In included file "./include/wait_until_disconnected.inc": included from .../mysql-test/suite/rpl/t/rpl_gtid_crash.test at line 244: At line 24: Server failed to dissapear The result from queries just before the failure was: < snip > include/stop_slave.inc SET GLOBAL debug_dbug="+d,crash_commit_before"; START SLAVE; INSERT INTO t1 VALUES (5); include/stop_slave.inc SET GLOBAL debug_dbug="+d,crash_commit_after"; START SLAVE; INSERT INTO t1 VALUES (6); include/stop_slave.inc SET GLOBAL debug_dbug="+d,inject_crash_before_flush_rli"; START SLAVE; INSERT INTO t1 VALUES (7); Timeout in wait_condition.inc for SELECT COUNT(*) = 7 FROM t1 Id User Host db Command Time State Info Progress 2 root localhost:58472 test Query 0 init show full processlist 0.000 4 system user NULL Connect 30 Waiting for master to send event NULL 0.000 include/stop_slave.inc SET GLOBAL debug_dbug="+d,inject_crash_after_flush_rli"; START SLAVE; INSERT INTO t1 VALUES (8); More results from queries before failure can be found in .../mysql-test/var/log/rpl_gtid_crash.log - saving ... ***Warnings generated in error logs during shutdown after running tests: rpl.rpl_gtid_crash 130524 1:12:44 [Warning] Slave: Duplicate entry '7' for key 'PRIMARY' Error_code: 1062 130524 1:13:14 [Warning] Slave: Duplicate entry '7' for key 'PRIMARY' Error_code: 1062 Or another failure with this output: CURRENT_TEST: rpl.rpl_gtid_crash mysqltest: In included file "./include/check_slave_param.inc": included from ./include/check_slave_no_error.inc at line 24: included from ./include/rpl_end.inc at line 74: included from .../mysql-test/suite/rpl/t/rpl_gtid_crash.test at line 268: At line 31: Wrong value for slave parameter The result from queries just before the failure was: < snip > master-bin.000004 532 Gtid 1 574 BEGIN GTID 0-1-6 master-bin.000004 574 Query 1 666 use `test`; INSERT INTO t1 VALUES (5) master-bin.000004 666 Xid 1 697 COMMIT /* xid=35 */ master-bin.000004 697 Gtid 1 739 BEGIN GTID 0-1-7 master-bin.000004 739 Query 1 831 use `test`; INSERT INTO t1 VALUES (6) master-bin.000004 831 Xid 1 862 COMMIT /* xid=52 */ master-bin.000004 862 Gtid 1 904 BEGIN GTID 0-1-8 master-bin.000004 904 Query 1 996 use `test`; INSERT INTO t1 VALUES (7) master-bin.000004 996 Xid 1 1027 COMMIT /* xid=69 */ master-bin.000004 1027 Gtid 1 1069 BEGIN GTID 0-1-9 master-bin.000004 1069 Query 1 1161 use `test`; INSERT INTO t1 VALUES (8) master-bin.000004 1161 Xid 1 1192 COMMIT /* xid=89 */ master-bin.000004 1192 Gtid 1 1234 GTID 0-1-10 master-bin.000004 1234 Query 1 1342 use `test`; DROP TABLE `t1` /* generated by server */ **** SHOW RELAYLOG EVENTS on server_1 **** relaylog_name = 'No such row' SHOW RELAYLOG EVENTS IN 'No such row'; Log_name Pos Event_type Server_id End_log_pos Info Wrong value for Last_SQL_Errno. Expected '0', got '1062' More results from queries before failure can be found in .../mysql-test/var/log/rpl_gtid_crash.log - saving ... ***Warnings generated in error logs during shutdown after running tests: rpl.rpl_gtid_crash 130525 1:07:15 [Warning] Slave: Duplicate entry '8' for key 'PRIMARY' Error_code: 1062 Can you suggest what debugging information should I gather to understand what's happening in failures like this? Or maybe you know off the top of your head what changes were done since 10.0.1 that are required for crash recovery to work properly? Thank you, Pavel