Thanks for your reply ! If the server does an SST, the problem is way more dramatic than when it does an IST. This morning one server crashed and upon restarting it did an SST instead of an IST, and the issue was horrible. Even before being available, it blocked the donor for 15 minutes with something like those: 2022-07-27 12:02:42 7 [Note] WSREP: Processing event queue:... 20.9% ( 496/2376 events) complete. For a while it got even slower to process the queue than the queue was increasing. The same server crashed again so I started another one and it did an SST, but the problem was not as dramatic, however the processing even queue lasted 5 minutes and blocked the donor completed for that time. In very rare occasions the SST is not causing such issues, but very rare (twice in 6 months and 2 or 3 dozen of issue occurrences) and I didn’t change any settings since!? Very confusing. When servers do an SST, I usually kill the CHECK TABLE FOR UPGRADE that occurs as it appears to slow things down even more. Noticeably this morning I had 3 servers running, one went haywire, and caused another one to go down! Ended-up with a single server I had to restart caused it would complain about not being wsrep ready. It’s been a very bad day today as those 4 servers are in production and we received dozens of calls from our customers. Now I’m back with 2 servers and will wait tonight to restart the 2 others because of that issue. IMO it’s a bug as in very rare occasions it starts smoothly. But still I found galera to be unreliable and my company is asking me to install a more reliable solution ASAP or we will loose customers! So any help would be much appreciated. I’m thinking of using 3 servers with replication instead, keeping load balancing using source Ips, but I’m worried that this might be less reliable. We have 2 spare servers in another location, synched with replication but it happened too often that upon a server crash the replication would no longer start and had to be entirely restarted which shows as not being even less reliable. Sorry for the long story, but I’m no Galera expert and I’m having lots of issues I can’t find any info or solution about. This is another issue I’m facing with replication, while it seems to be caused by galera cluster: https://jira.mariadb.org/browse/MDEV-29132 De : William Edwards <wedwards@cyberfusion.nl> Envoyé : mercredi 27 juillet 2022 11:58 À : Cédric Counotte <cedric.counotte@1check.com> Cc : maria-discuss@lists.launchpad.net Objet : Re: [Maria-discuss] MariaDB server horribly slow on start Op 27 jul. 2022 om 11:46 heeft Cédric Counotte <cedric.counotte@1check.com<mailto:cedric.counotte@1check.com>> het volgende geschreven: Hello all. I hope I’m at the right place to ask this question. I opened a bug here: https://jira.mariadb.org/browse/MDEV-28969, however I was told to use this mailing list. We have 4 MariaDB servers in a Galera Cluster and it happens that a server has to be restarted (be it for a crash which I have to open a bug for) or maintenance. When that happens, the restarted server is causing huge slow down on the whole cluster, and it lasts for 10 to 30 minutes at the very least! And by huge, I mean huge, we end up with 500 to 800 pending queries on all servers as you can see on attached screenshots I’ve attached the configuration of any server for reference in case this is the source of the issue. Any way to solve this would be greatly appreciated. You seem to be focusing on effect. What is the cause? SST? Regards, 3C. [cid:image001.png@01D8A1B5.A4E197B0] _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net<mailto:maria-discuss@lists.launchpad.net> Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp