Hi!

On Fri, Nov 14, 2014 at 3:58 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Nirbhay Choubey <nirbhay@mariadb.com> writes:

>> > ##### Case 7: Stop slave into the middle of a transaction being filtered
>> and
>> > #             start it back with filtering disabled.
>> >
>> > --echo # On master
>> > connection master;
>> > SET @@session.gtid_domain_id=1;
>> > BEGIN;
>> > INSERT INTO t2 VALUES(3);
>> > INSERT INTO t3 VALUES(3);
>> > sync_slave_with_master;
>>
>> No, this does not work. Transactions are always binlogged as a whole on the
>> master, during COMMIT.
>>
>
> You are right. My original intent was to test a transaction which modifies
> both MyISAM and
> InnoDB tables, where first modification is done in MyISAM table. In which
> case the changes
> to MyISAM is sent to the slave right away, while rest of trx is sent on
> commit. I have modified
> the test accordingly.

I'm still not sure you understand the scenario I had in mind. It's not about
what happens on the master during the transaction. It is about what happens in
case the slave disconnects in the middle of receiving an event
group/transaction.

You are perhaps looking at an older version of the test. The latest says :

<cut>
##### Case 7: Stop slave before a transaction (involving MyISAM and InnoDB
#             table) being filtered commits and start it back with filtering
#             disabled. 
...
</cut>


In general in replication, the major part of the work is not implementing the
functionality for the normal case - that is usually relatively easy. The major
part is handling and testing all the special cases that can occur in special
scenarios, especially various error cases. The replication code is really
complex in this respect, and the fact that things by their nature happen in
parallel between different threads and different servers make things even more
complex.

What I wanted you to think about here is what happens if the slave is
disconnected from the master after having received the first half of an event
group. For example due to network error. This will not happen normally in a
mysql-test-case run, and if it happens in a production site for a user, it
will be extremely hard to track down.

In this case, the second half of the event group could be received much later
than the first half. The IO thread could have been stopped (or even the whole
mysqld server could have been stopped) in-between, and the replication could
have been re-configured with CHANGE MASTER. Since the IO thread is doing the
filtering, it seems very important to consider what will happen if eg. filters
are enabled while receiving the first half of the transaction, but disabled
while receiving the second half: 

Suppose we have this transaction:

  BEGIN GTID 2-1-100
  INSERT INTO t1 VALUES (1);
  INSERT INTO t1 VALUES (2);
  COMMIT;

What happens in the following scenario?

  CHANGE MASTER TO master_use_gtid=current_pos, ignore_domain_ids=(2);
  START SLAVE;
  # slave IO thread connects to master;
  # slave receives: BEGIN GTID 2-1-100; INSERT INTO t1 VALUES (1);
  # slave IO thread is disconnected from master
  STOP SLAVE;
  # slave mysqld process is stopped and restarted.
  CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=();
  START SLAVE;
  # slave IO thread connects to master;
  # slave IO thread receives: INSERT INTO t1 VALUES (2); COMMIT;

Are you sure that this will work correctly? And what does "work correctly"
mean in this case? Will the transaction be completely ignored? Or will it be
completely replicated on the slave? The bug would be if the first half would
be ignored, but the second half still written into the relay log.

To test this, you would need to use DBUG error insertion. There are already
some tests that do this. They use for example

  SET GLOBAL debug_dbug="+d,binlog_force_reconnect_after_22_events";

The code will then (in debug builds) simulate a disconnect at some particular
point in the replication stream, allowing this rare but important case to be
tested. This is done using DBUG_EXECUTE_IF() in the code.

I had already added multiple cases under rpl_domain_id_filter_io_crash.test using
DBUG_EXECUTE_IF("+d,"kill_io_slave_before_commit") in the previous commit.
Even though, it is not exactly similar to what you suggest, it does, however,try to
kill I/O thread when it receives COMMIT/XID event (cases 0 - 3) in order to test what
happens when I/O exits before reading the complete transaction or group with filtering
enable before/after slave restart.

Following your suggestion, I have now added 2 more cases (4 and 5) using
DBUG_EXECUTE_IF(+d,"kill_slave_io_after_2_events") to kill I/O after reading
first INSERT in a transaction. The outcome is expected.
 

To work on replication without introducing nasty bugs, it is important to
think through cases like this carefully, and to convince yourself that things
will work correctly. Disconnects at various points, crashes on the master or
slave, errors during applying events or writing to the relay logs, and so on.

I agree.
 

Hope this helps,

Indeed. 

Best,
Nirbhay