We have a three node (db1, db2, db3) galera cluster with MariaDB 10.0.22 on CentOS 6.7. A couple days ago I upgraded to 10.1.9. Xtrabackup (2.3.2) is run every night on each node at 1am, 2am, and 3am respectively. Before the backup starts, the node is desynced. The first night after upgrading to 10.1.9, the problem began. All connections were going to db1 until the backup started when db1 was removed from the routing pool and new connections began going to db2. At that time there was little traffic aside from the backup; much of it is probably monitoring queries. Our monitoring shows that running threads went from about 2 just before the backup finished around 1:32am to about 150 just after. At the same time, the running threads on db2 went from 1 to 10. After the backup completed, all new connections were going to db1 again. The running threads on db1 continued to slowly grow until the queries that are stuck took up all of the server processes on our application servers and we were alerted around 3:50am. I checked the process list and almost all of the queries were in the "query end" state and I think they were all write queries. I tried to kill most of them but they just stayed in the same state. I restarted db2 to try to kick the cluster without losing data. I had to force the shutdown since three threads never ended after about 10 minutes of waiting. The running threads on db1 returned to normal. db2 had do do a full SST which took until 6:05 to complete. At that time, the running processes on db1 began to increase again. When db2 was back up I downgraded to 10.1.22 and rejoined it to the cluster. I tried to restart db1, but it needed a full SST so I left it down. A bit later I took down db3 to downgrade it, too at that went fine. The cluster was fine through the day during normal business operation. The next night only db2 and db3 were up and were running 10.0.22. What appears to be the same problem started at 3:31am, when xtrabackup paused galera ("Provider paused at 8c53b634-9514-11e4-b8bd-dab05673fb36:875650526") on db3 for the backup. At that time the running threads on db2 shot up and slowly increased until I shut it down at 6:28. I had to kill it again due to three threads on ending. db3 showed nothing unusual in the logs. I got the innodb engine status from db2 three times a few minutes apart before I restarted; they are attached. Additionally, I attached an excerpt from the logs on db2 and db3 during the second incident and the my.cnf from one of the servers, it's basically the same for the others. I'm working on getting a clean set of logs from the first incident, but from what I initially saw, they are basically the same as the second set of logs. I'm ready if the problem arises again and I'll try to get more information including SHOW GLOBAL STATUS. Our environment hasn't changed for at least a month and the issue first appeared after upgrading to 10.1.9, but since it didn't go away after downgrading, I'm not sure where the issue is. I found a few mentions of what might be the same problem: http://marialog.archivist.info/2015-04-03.txt https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1149755