Apologies for the cross-post; we submitted this question to the Codership forum a week ago and haven't received a response. https://groups.google.com/forum/#!topic/codership-team/OSnQ1FLloBI

Hello,

Hoping to get some perspective on an issue we saw yesterday. While scaling up from a single node cluster, we inserted into a table on node 0 and selected the row count for that table on all subsequent nodes (scripts below). For each node that was brought online, we observed that connections to node 0 were suspended during SST, and that there was an additional IST. At this point the nodes reported that they were synced. However, sometimes replication did not occur correctly - the additional nodes reported fewer rows than node 0. Each new node was added by itself, and all other nodes restarted during each scale-up. Nodes that became out-of-sync continued to have an incorrect row count even after restart, and sometimes got further out of sync (i.e. the discrepancy between the number of rows in node 0 and the other nodes increased)

We were conducting this experiment to see whether innodb_disallow_writes was functioning correctly, but aren't sure whether this issue is related. Has anyone seen similar behavior while scaling up?

Insert:

#!/bin/bash
mysql 
-u$db_user -p$db_password -e"create database if not exists $db_name;"
mysql 
-u$db_user -p$db_password -e"use $db_name; create table if not exists $table_name (val int);"


for i in `seq $start_val $end_val`; do
  echo 
"inserting $i"
  mysql 
-u$db_user -p$db_password -e"use $db_name; insert into $table_name VALUES ($i); select count(*) from $table_name;"
  sleep 
0.1
done

Select:

#!/bin/bash


for i in `seq $start_val $end_val`; do
  echo 
"statement $i"
  mysql 
-u$db_user -p$db_password -e"use replication_test; select count(*) from vals;"
  sleep 
0.1
done

Thank you,
Shannon

Cloud Foundry Services, Pivotal