Hello,
Hoping to get some perspective on an issue we saw yesterday. While scaling up from a single node cluster, we inserted into a table on node 0 and selected the row count for that table on all subsequent nodes (scripts below). For each node that was brought online, we observed that connections to node 0 were suspended during SST, and that there was an additional IST. At this point the nodes reported that they were synced. However, sometimes replication did not occur correctly - the additional nodes reported fewer rows than node 0. Each new node was added by itself, and all other nodes restarted during each scale-up. Nodes that became out-of-sync continued to have an incorrect row count even after restart, and sometimes got further out of sync (i.e. the discrepancy between the number of rows in node 0 and the other nodes increased)
We were conducting this experiment to see whether innodb_disallow_writes was functioning correctly, but aren't sure whether this issue is related. Has anyone seen similar behavior while scaling up?
Insert:
#!/bin/bash
mysql -u$db_user -p$db_password -e"create database if not exists $db_name;"
mysql -u$db_user -p$db_password -e"use $db_name; create table if not exists $table_name (val int);"
for i in `seq $start_val $end_val`; do
echo "inserting $i"
mysql -u$db_user -p$db_password -e"use $db_name; insert into $table_name VALUES ($i); select count(*) from $table_name;"
sleep 0.1
done
Select:
#!/bin/bash
for i in `seq $start_val $end_val`; do
echo "statement $i"
mysql -u$db_user -p$db_password -e"use replication_test; select count(*) from vals;"
sleep 0.1
done
Thank you,
Shannon
Cloud Foundry Services, Pivotal