When processing the queue, it seems the donor is blocking all queries. At least that’s what it looked like, but maybe it’s just even more slow. I’m not sure what the cause is, but only notice this problem after a server restart, be it a SST or an IST. We have scripts running every 1, 2, 3 and 5 minutes that process data on the DB, and for like 30 minutes after a server start, I have to kill them or disable cron altogether to avoid worsening the issue. At some point I believed that was enough to cope with this slowness, but fact is it’s not. We are processing between 5.000 to 10.000 queries per seconds. In “normal circumstances”, a single server is enough. But upon a single server start, even 4 servers are not handling the delays. If I restart 2 servers, the issue is even more dramatic. Sorry I’m trying to focus on the cause, but apart from restarting a server there is no other cause for the issue. I did an update of Ubuntu from 21.10 to 22.04 at night 3 days ago, and they all did an IST, but still the slowness occurred, even though there is little traffic at night. Something like 2000 queries per seconds. All 4 servers ended-up with 100+ queries stuck for entire minutes! Is there a way to avoid dramatic slowness on server start? I’ve read about optimizer_search_depth which could cause slow query when different than 0, but regardless of the value I set for it, the issue is exactly the same, so it’s currently set to 0. De : William Edwards <wedwards@cyberfusion.nl> Envoyé : mercredi 27 juillet 2022 12:45 À : Cédric Counotte <cedric.counotte@1check.com> Cc : maria-discuss@lists.launchpad.net Objet : Re: [Maria-discuss] MariaDB server horribly slow on start Hi, Op 27 jul. 2022 om 12:37 heeft Cédric Counotte <cedric.counotte@1check.com<mailto:cedric.counotte@1check.com>> het volgende geschreven: Thanks for your reply ! If the server does an SST, the problem is way more dramatic than when it does an IST. This morning one server crashed and upon restarting it did an SST instead of an IST, and the issue was horrible. Even before being available, it blocked the donor for 15 minutes with something like those: 2022-07-27 12:02:42 7 [Note] WSREP: Processing event queue:... 20.9% ( 496/2376 events) complete. Does the issue occur while these messages are logged? For a while it got even slower to process the queue than the queue was increasing. The same server crashed again so I started another one and it did an SST, but the problem was not as dramatic, however the processing even queue lasted 5 minutes and blocked the donor completed for that time. In very rare occasions the SST is not causing such issues, but very rare (twice in 6 months and 2 or 3 dozen of issue occurrences) and I didn’t change any settings since!? Very confusing. When servers do an SST, I usually kill the CHECK TABLE FOR UPGRADE that occurs as it appears to slow things down even more. Noticeably this morning I had 3 servers running, one went haywire, and caused another one to go down! Ended-up with a single server I had to restart caused it would complain about not being wsrep ready. It’s been a very bad day today as those 4 servers are in production and we received dozens of calls from our customers. Again, I’d focus on cause. The effect is clear. Now I’m back with 2 servers and will wait tonight to restart the 2 others because of that issue. IMO it’s a bug as in very rare occasions it starts smoothly. But still I found galera to be unreliable and my company is asking me to install a more reliable solution ASAP or we will loose customers! So any help would be much appreciated. Whether something’s a bug is not an opinion. I’m thinking of using 3 servers with replication instead, keeping load balancing using source Ips, but I’m worried that this might be less reliable. We have 2 spare servers in another location, synched with replication but it happened too often that upon a server crash the replication would no longer start and had to be entirely restarted which shows as not being even less reliable. Sorry for the long story, but I’m no Galera expert Then you could indeed wonder if your company should be using Galera … and I’m having lots of issues I can’t find any info or solution about. This is another issue I’m facing with replication, while it seems to be caused by galera cluster: https://jira.mariadb.org/browse/MDEV-29132 De : William Edwards <wedwards@cyberfusion.nl<mailto:wedwards@cyberfusion.nl>> Envoyé : mercredi 27 juillet 2022 11:58 À : Cédric Counotte <cedric.counotte@1check.com<mailto:cedric.counotte@1check.com>> Cc : maria-discuss@lists.launchpad.net<mailto:maria-discuss@lists.launchpad.net> Objet : Re: [Maria-discuss] MariaDB server horribly slow on start Op 27 jul. 2022 om 11:46 heeft Cédric Counotte <cedric.counotte@1check.com<mailto:cedric.counotte@1check.com>> het volgende geschreven: Hello all. I hope I’m at the right place to ask this question. I opened a bug here: https://jira.mariadb.org/browse/MDEV-28969, however I was told to use this mailing list. We have 4 MariaDB servers in a Galera Cluster and it happens that a server has to be restarted (be it for a crash which I have to open a bug for) or maintenance. When that happens, the restarted server is causing huge slow down on the whole cluster, and it lasts for 10 to 30 minutes at the very least! And by huge, I mean huge, we end up with 500 to 800 pending queries on all servers as you can see on attached screenshots I’ve attached the configuration of any server for reference in case this is the source of the issue. Any way to solve this would be greatly appreciated. You seem to be focusing on effect. What is the cause? SST? Regards, 3C. [image001.png] _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net<mailto:maria-discuss@lists.launchpad.net> Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp