Hi! On Fri, Nov 14, 2014 at 3:58 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Nirbhay Choubey <nirbhay@mariadb.com> writes:
##### Case 7: Stop slave into the middle of a transaction being filtered and # start it back with filtering disabled.
--echo # On master connection master; SET @@session.gtid_domain_id=1; BEGIN; INSERT INTO t2 VALUES(3); INSERT INTO t3 VALUES(3); sync_slave_with_master;
No, this does not work. Transactions are always binlogged as a whole on the master, during COMMIT.
You are right. My original intent was to test a transaction which modifies both MyISAM and InnoDB tables, where first modification is done in MyISAM table. In which case the changes to MyISAM is sent to the slave right away, while rest of trx is sent on commit. I have modified the test accordingly.
I'm still not sure you understand the scenario I had in mind. It's not about what happens on the master during the transaction. It is about what happens in case the slave disconnects in the middle of receiving an event group/transaction.
You are perhaps looking at an older version of the test. The latest says : <cut> ##### Case 7: Stop slave before a transaction (involving MyISAM and InnoDB # table) being filtered commits and start it back with filtering # disabled. ... </cut>
In general in replication, the major part of the work is not implementing the functionality for the normal case - that is usually relatively easy. The major part is handling and testing all the special cases that can occur in special scenarios, especially various error cases. The replication code is really complex in this respect, and the fact that things by their nature happen in parallel between different threads and different servers make things even more complex.
What I wanted you to think about here is what happens if the slave is disconnected from the master after having received the first half of an event group. For example due to network error. This will not happen normally in a mysql-test-case run, and if it happens in a production site for a user, it will be extremely hard to track down.
In this case, the second half of the event group could be received much later than the first half. The IO thread could have been stopped (or even the whole mysqld server could have been stopped) in-between, and the replication could have been re-configured with CHANGE MASTER. Since the IO thread is doing the filtering, it seems very important to consider what will happen if eg. filters are enabled while receiving the first half of the transaction, but disabled while receiving the second half:
Suppose we have this transaction:
BEGIN GTID 2-1-100 INSERT INTO t1 VALUES (1); INSERT INTO t1 VALUES (2); COMMIT;
What happens in the following scenario?
CHANGE MASTER TO master_use_gtid=current_pos, ignore_domain_ids=(2); START SLAVE; # slave IO thread connects to master; # slave receives: BEGIN GTID 2-1-100; INSERT INTO t1 VALUES (1); # slave IO thread is disconnected from master STOP SLAVE; # slave mysqld process is stopped and restarted. CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=(); START SLAVE; # slave IO thread connects to master; # slave IO thread receives: INSERT INTO t1 VALUES (2); COMMIT;
Are you sure that this will work correctly? And what does "work correctly" mean in this case? Will the transaction be completely ignored? Or will it be completely replicated on the slave? The bug would be if the first half would be ignored, but the second half still written into the relay log.
To test this, you would need to use DBUG error insertion. There are already some tests that do this. They use for example
SET GLOBAL debug_dbug="+d,binlog_force_reconnect_after_22_events";
The code will then (in debug builds) simulate a disconnect at some particular point in the replication stream, allowing this rare but important case to be tested. This is done using DBUG_EXECUTE_IF() in the code.
I had already added multiple cases under rpl_domain_id_filter_io_crash.test using DBUG_EXECUTE_IF("+d,"kill_io_slave_before_commit") in the previous commit. Even though, it is not exactly similar to what you suggest, it does, however,try to kill I/O thread when it receives COMMIT/XID event (cases 0 - 3) in order to test what happens when I/O exits before reading the complete transaction or group with filtering enable before/after slave restart. Following your suggestion, I have now added 2 more cases (4 and 5) using DBUG_EXECUTE_IF(+d,"kill_slave_io_after_2_events") to kill I/O after reading first INSERT in a transaction. The outcome is expected.
To work on replication without introducing nasty bugs, it is important to think through cases like this carefully, and to convince yourself that things will work correctly. Disconnects at various points, crashes on the master or slave, errors during applying events or writing to the relay logs, and so on.
I agree.
Hope this helps,
Indeed. Best, Nirbhay