Re: [Maria-developers] [Commits] Rev 3879: MDEV-6593 : domain_id based replication filters in lp:~maria-captains/maria/maria-10.0-galera

17 Nov 2014

      Hi!

On Fri, Nov 14, 2014 at 3:58 AM, Kristian Nielsen <knielsen@knielsen-hq.org>
wrote:
...
Nirbhay Choubey <nirbhay@mariadb.com> writes:
...
...
...
##### Case 7: Stop slave into the middle of a transaction being
filtered
and
#             start it back with filtering disabled.
--echo # On master
connection master;
SET @@session.gtid_domain_id=1;
BEGIN;
INSERT INTO t2 VALUES(3);
INSERT INTO t3 VALUES(3);
sync_slave_with_master;
No, this does not work. Transactions are always binlogged as a whole on
the
master, during COMMIT.
You are right. My original intent was to test a transaction which
modifies
both MyISAM and
InnoDB tables, where first modification is done in MyISAM table. In which
case the changes
to MyISAM is sent to the slave right away, while rest of trx is sent on
commit. I have modified
the test accordingly.
I'm still not sure you understand the scenario I had in mind. It's not
about
what happens on the master during the transaction. It is about what
happens in
case the slave disconnects in the middle of receiving an event
group/transaction.
You are perhaps looking at an older version of the test. The latest says :

<cut>
##### Case 7: Stop slave before a transaction (involving MyISAM and InnoDB
#             table) being filtered commits and start it back with filtering
#             disabled.
...
</cut>
...
In general in replication, the major part of the work is not implementing
the
functionality for the normal case - that is usually relatively easy. The
major
part is handling and testing all the special cases that can occur in
special
scenarios, especially various error cases. The replication code is really
complex in this respect, and the fact that things by their nature happen in
parallel between different threads and different servers make things even
more
complex.
What I wanted you to think about here is what happens if the slave is
disconnected from the master after having received the first half of an
event
group. For example due to network error. This will not happen normally in a
mysql-test-case run, and if it happens in a production site for a user, it
will be extremely hard to track down.
In this case, the second half of the event group could be received much
later
than the first half. The IO thread could have been stopped (or even the
whole
mysqld server could have been stopped) in-between, and the replication
could
have been re-configured with CHANGE MASTER. Since the IO thread is doing
the
filtering, it seems very important to consider what will happen if eg.
filters
are enabled while receiving the first half of the transaction, but disabled
while receiving the second half:
...
Suppose we have this transaction:
BEGIN GTID 2-1-100
  INSERT INTO t1 VALUES (1);
  INSERT INTO t1 VALUES (2);
  COMMIT;
What happens in the following scenario?
CHANGE MASTER TO master_use_gtid=current_pos, ignore_domain_ids=(2);
  START SLAVE;
  # slave IO thread connects to master;
  # slave receives: BEGIN GTID 2-1-100; INSERT INTO t1 VALUES (1);
  # slave IO thread is disconnected from master
  STOP SLAVE;
  # slave mysqld process is stopped and restarted.
  CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=();
  START SLAVE;
  # slave IO thread connects to master;
  # slave IO thread receives: INSERT INTO t1 VALUES (2); COMMIT;
Are you sure that this will work correctly? And what does "work correctly"
mean in this case? Will the transaction be completely ignored? Or will it
be
completely replicated on the slave? The bug would be if the first half
would
be ignored, but the second half still written into the relay log.
To test this, you would need to use DBUG error insertion. There are already
some tests that do this. They use for example
SET GLOBAL debug_dbug="+d,binlog_force_reconnect_after_22_events";
The code will then (in debug builds) simulate a disconnect at some
particular
point in the replication stream, allowing this rare but important case to
be
tested. This is done using DBUG_EXECUTE_IF() in the code.
I had already added multiple cases under rpl_domain_id_filter_io_crash.test
using
DBUG_EXECUTE_IF("+d,"kill_io_slave_before_commit") in the previous commit.
Even though, it is not exactly similar to what you suggest, it does,
however,try to
kill I/O thread when it receives COMMIT/XID event (cases 0 - 3) in order to
test what
happens when I/O exits before reading the complete transaction or group
with filtering
enable before/after slave restart.

Following your suggestion, I have now added 2 more cases (4 and 5) using
DBUG_EXECUTE_IF(+d,"kill_slave_io_after_2_events") to kill I/O after reading
first INSERT in a transaction. The outcome is expected.
...
To work on replication without introducing nasty bugs, it is important to
think through cases like this carefully, and to convince yourself that
things
will work correctly. Disconnects at various points, crashes on the master
or
slave, errors during applying events or writing to the relay logs, and so
on.
I agree.
...
Hope this helps,
Indeed.

Best,
Nirbhay