Hi Benoit,

indeed, a slow node can impact the rest of the cluster, that's why, like Jamie pointed out, DNS round robin is not a viable method to distribute load across a Galera cluster. Several solutions exist:
- HAProxy with Galera checkscript
- our own MariaDB MaxScale which includes a Galera Monitor
- glbd (small load balancing daemon which comes with Galera)

Regards,

On Mon, Dec 14, 2015 at 10:18 AM Jamie Gibbard <Jamie.Gibbard@netnames.com> wrote:
You should consider using a better method for connecting to your DB servers, than DNS round robin.

Think about using a haproxy load balancing node, with the clustercheck script (https://github.com/olafz/percona-clustercheck)

This would ensure that a node is not only accessible on its MySQL port, but ready for action!




-----Original Message-----
From: Maria-discuss [mailto:maria-discuss-bounces+jamie.gibbard=netnames.com@lists.launchpad.net] On Behalf Of Benoit Panizzon
Sent: 14 December 2015 08:31
To: MariaDB discuss
Subject: [Maria-discuss] Galera Cluster: Cluster Blocked, when one node down?

Hello

We use MariaDB Galera Cluster for our email service platform.

We decided to use Galera to create a high availability platform.

After a year of operation, we start to relaize, that somehow Galera Failures seem to be the most common cause for outages we had in the past.

So I wonder if others operating galera clusters also observe this
situation:

All our services using DB connections use a DNS round-robin name, to connect to one of our three galera instances.

While testing this setup, we usualy killed one instance, or disconnected the node from the network to simulate an outage. In this situation, this works as expected. The client connect to the two remaining nodes, no service outage.

When the node is re-started it is being re-synced quickly and service with three nodes is restored.

Now we experienced a few galera cluster fails, which seem to happen this way:
One of the nodes is getting a lot of load. DDOS Attacks, Memory Leaks or similar, which just renders the whole physical machine laggy for a short time. So the affected MariaDB node is being thrown out of the cluster by the two other nodes, probably for not syncing fast enough anymore.

But as the node is not 'down' completely, it still accepts connections from the DB clients, but does not reply to them and seems to remain in a 'db locked' situation. Strangely this then also affects the two remaining nodes, who also go into 'locked' mode and do not reply to queries on the time expected by the application anymore. Of course this then causes more DB clients (IMAP, SMTP-Auth, etc) to spawn and to create DB connections worsening the whole situation.

The situation seemingly can only be resolved by shuting down the MariaDB node that got thrown out of the cluster. Then the situations normalizes with the two remaining nodes and the third one can be restarted.

Is this expected behaviour? Is there a way to tell a MariaDB node that got excluded from the cluster to shut himself down completely so it does NOT accept any more connections from clients, blocking the whole service?

Regards

-Benoît Panizzon-
--
I m p r o W a r e   A G    -    Leiter Commerce Kunden
______________________________________________________

Zurlindenstrasse 29             Tel  +41 61 826 93 00
CH-4133 Pratteln                Fax  +41 61 826 93 01
Schweiz                         Web  http://www.imp.ch
______________________________________________________

_______________________________________________
Mailing list: https://launchpad.net/~maria-discuss
Post to     : maria-discuss@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-discuss
More help   : https://help.launchpad.net/ListHelp
NetNames, 25 Canada Square, Canary Wharf, London E14 5LQ, UK | Tel: +44 207 015 9200 | NetNames Limited, Registered in England and Wales, Company number: 3169594, VAT Number: GB 739633893
_______________________________________________
Mailing list: https://launchpad.net/~maria-discuss
Post to     : maria-discuss@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-discuss
More help   : https://help.launchpad.net/ListHelp
--
Guillaume Lefranc
Remote DBA Services Manager
MariaDB Corporation