Apologies for the cross-post; we submitted this question to the Codership forum a week ago and haven't received a response. https://groups.google.com/forum/#!topic/codership-team/OSnQ1FLloBI

Hello,

Hoping to get some perspective on an issue we saw yesterday. While scaling up from a single node cluster, we inserted into a table on node 0 and selected the row count for that table on all subsequent nodes (scripts below). For each node that was brought online, we observed that connections to node 0 were suspended during SST, and that there was an additional IST. At this point the nodes reported that they were synced. However, sometimes replication did not occur correctly - the additional nodes reported fewer rows than node 0. Each new node was added by itself, and all other nodes restarted during each scale-up. Nodes that became out-of-sync continued to have an incorrect row count even after restart, and sometimes got further out of sync (i.e. the discrepancy between the number of rows in node 0 and the other nodes increased)

We were conducting this experiment to see whether innodb_disallow_writes was functioning correctly, but aren't sure whether this issue is related. Has anyone seen similar behavior while scaling up?

Insert:

#!/bin/bash



mysql -u$db_user -p$db_password -e"create database if not exists $db_name;"



mysql -u$db_user -p$db_password -e"use $db_name; create table if not exists $table_name (val int);"





for i in `seq $start_val $end_val`; do



  echo "inserting $i"



  mysql -u$db_user -p$db_password -e"use $db_name; insert into $table_name VALUES ($i); select count(*) from $table_name;"



  sleep 0.1
done

Select:




#!/bin/bash


for i in `seq $start_val $end_val`; do



  echo "statement $i"



  mysql -u$db_user -p$db_password -e"use replication_test; select count(*) from vals;"



  sleep 0.1
done

Thank you,
Shannon

Cloud Foundry Services, Pivotal