Hi Brad,

do you run wsrep_desync=ON on the node before running the backup? It seems like a case of flow control triggering.

On Fri, Dec 11, 2015 at 1:29 AM Brad Jorgensen <brad@debtpaypro.com> wrote:
We have a three node (db1, db2, db3) galera cluster with MariaDB 10.0.22
on CentOS 6.7.  A couple days ago I upgraded to 10.1.9.  Xtrabackup
(2.3.2) is run every night on each node at 1am, 2am, and 3am
respectively.  Before the backup starts, the node is desynced.

The first night after upgrading to 10.1.9, the problem began.  All
connections were going to db1 until the backup started when db1 was
removed from the routing pool and new connections began going to db2.
At that time there was little traffic aside from the backup; much of it
is probably monitoring queries.  Our monitoring shows that running
threads went from about 2 just before the backup finished around 1:32am
to about 150 just after.  At the same time, the running threads on db2
went from 1 to 10.  After the backup completed, all new connections were
going to db1 again.  The running threads on db1 continued to slowly grow
until the queries that are stuck took up all of the server processes on
our application servers and we were alerted around 3:50am.  I checked
the process list and almost all of the queries were in the "query end"
state and I think they were all write queries.  I tried to kill most of
them but they just stayed in the same state.  I restarted db2 to try to
kick the cluster without losing data.  I had to force the shutdown since
three threads never ended after about 10 minutes of waiting.  The
running threads on db1 returned to normal.  db2 had do do a full SST
which took until 6:05 to complete.  At that time, the running processes
on db1 began to increase again.  When db2 was back up I downgraded to
10.1.22 and rejoined it to the cluster.  I tried to restart db1, but it
needed a full SST so I left it down.  A bit later I took down db3 to
downgrade it, too at that went fine.  The cluster was fine through the
day during normal business operation.

The next night only db2 and db3 were up and were running 10.0.22.  What
appears to be the same problem started at 3:31am, when xtrabackup paused
galera ("Provider paused at
8c53b634-9514-11e4-b8bd-dab05673fb36:875650526") on db3 for the backup.
At that time the running threads on db2 shot up and slowly increased
until I shut it down at 6:28.  I had to kill it again due to three
threads on ending.  db3 showed nothing unusual in the logs.  I got the
innodb engine status from db2 three times a few minutes apart before I
restarted; they are attached.

Additionally, I attached an excerpt from the logs on db2 and db3 during
the second incident and the my.cnf from one of the servers, it's
basically the same for the others.  I'm working on getting a clean set
of logs from the first incident, but from what I initially saw, they are
basically the same as the second set of logs.  I'm ready if the problem
arises again and I'll try to get more information including SHOW GLOBAL
STATUS.

Our environment hasn't changed for at least a month and the issue first
appeared after upgrading to 10.1.9, but since it didn't go away after
downgrading, I'm not sure where the issue is.

I found a few mentions of what might be the same problem:
http://marialog.archivist.info/2015-04-03.txt
https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1149755

_______________________________________________
Mailing list: https://launchpad.net/~maria-discuss
Post to     : maria-discuss@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-discuss
More help   : https://help.launchpad.net/ListHelp
--
Guillaume Lefranc
Remote DBA Services Manager
MariaDB Corporation