Hi Mark,

On Wed, Jul 20, 2016 at 10:38 AM, Mark Wadham <ubuntu@rkw.io> wrote:
Hi,

We have a repeatable failure to initiate IST with MariaDB 10.1.14 after performing a schema upgrade on a single node in RSU mode.  The error condition is when there is a delete query in the format:

delete from <table> where id >= <n>

on the non-RSU cluster nodes while the node is disconnected from the cluster.  On rejoining the node determines that it is in sync with the other cluster nodes and no IST is performed, despite the rows that were deleted in the cluster.  If we then delete the rows manually from the joining node, mysqld immediately crashes on the other nodes because they can't execute the new write transaction.

The process we followed is:

1. Set up a 3-node cluster, nodes 0,1,2
2. Enable RSU on node 0:

SET GLOBAL wsrep_OSU_method='RSU';

3. Isolate node 0 from the cluster:

SET GLOBAL wsrep_cluster_address="gcomm://";

4. Perform a backward-compatible schema change, since this is the point of this process.  In our test we added a single column to a table with a default value of null.

As discussed on IRC #mariadb, you do not really need to take the node off cluster (3).
Just set wsrep_osu_method's session value to RSU and perform the schema change.
With RSU mode enabled, the node automatically desyncs itself from the cluster before
executing any DDL,and thus other nodes in the cluster are not impacted.

Best,
Nirbhay
 

Additionally we deleted some rows from a table on nodes 1 and 2, with:

delete from <table> where id >= 100;

which affected around 20 rows.

5. Rejoin the node to the cluster:

SET GLOBAL wsrep_cluster_address="<gcomm string from config file>";

At this point the node immediately rejoins without doing IST and believes it is in sync, yet the rows are deleted on nodes 1 and 2 but not node 0.

Interestingly if the delete query is:

delete from <table> where id = <n>;

there is no problem.  Also we have not had any issue with syncing INSERT and UPDATE statements.  A combination of INSERT, UPDATE and DELETE where id >= resulted in the insert/update statements being synced but the deletes not synced.  It is as if the quorum somehow doesn't recognise delete where id >= as an event.

Our next test cases are:

1. Switching node 0 back to TOI mode before rejoining the cluster, although I can't really see how this would make a difference.

2. Upgrading to MariaDB 10.1.16 which was released a couple of days ago.

3. Testing whether regular IST is affected, ie IST that should occur normally without switching to RSU mode or dropping a node out of the cluster.


This seems like a pretty basic failure and I'm concerned that it may also affect regular IST, i.e. a node falling behind the cluster for normal reasons without any involvement of RSU mode, which would effectively make the whole system useless if it could randomly drop delete statements.

If anyone can shed any light on why this may be happening we would be very grateful!

Thanks,
Mark

_______________________________________________
Mailing list: https://launchpad.net/~maria-discuss
Post to     : maria-discuss@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-discuss
More help   : https://help.launchpad.net/ListHelp