On Thu, Jul 28, 2022 at 12:07 PM Cédric Counotte <cedric.counotte@1check.com> wrote:
Well, one server crashed twice a few days ago and I've asked my service provided (OVH) to look into it, but they asked me to test the hardware myself, found a NVMe disk with 17000+ errors, still waiting for their feedback on this.
It sounds like you need: 1) ZFS 2) Better monitoring
Only our 2 oldest servers are experiencing crashes (6 months old only!), and it turns out the RAID NVMe have very different written data, one disk has 58TB (not a replacement) while the other is at 400+TB within the same RAID ! All other servers have identical written data size on both disks of their RAID, so it seems we got used disks and that those are having issues.
Welcome to the cloud. But this is not a bad thing, it's better than having multiple disks in the same array fail at the same time. ZFS would help you by catching those errors before the database ingests them. In normal non-ZFS RAID, it is plausible and even quite probable that the corrupted data will be loaded from disk and propagate to other nodes, either via a state transfer or via corrupted binlogs. ZFS prevents that by making sure every block's checksum is compared at read time and any errors that show up get recovered from other redundant disks. Under the current circumstances, I wouldn't trust your data integrity until you run a full extended table check on all tables on all nodes. And probably pt-table-checksum on all the tables between the nodes to make sure.
Still didn't have time to produce a crash dump and post an issue with those (to confirm the cause) as I kept having to deal with server restarts trying to reduce the slow issue for 30 minutes to one hour.
you need to be careful with that - state transfer from a node with failing disks can actually result in the corrupted data propagating to the node being bootstrapped.
There was issues with slave thread crashing which I posted an issue and got to update MariaDB to resolve, still there are issues with slave threads stopping without reason so I have written a script to restart it and posted an issue with that.
I don't think you can meaningfully debug anything until you have verified that your hardware is reliable. Do your OVH servers have ECC memory?
The original objective was to have 2 usable cluster in different sites, synched with each other using replication, however all those issues have not allowed us to move forward with this.
With 4 nodes across 2 DCs, you are going to lose writability if you lose a DC even if it is the secondary DC. Your writes are also going to be very slow because with 4 nodes, all writes have to be acknowledged by 3 nodes - and the 3rd node is always going to be slow because it is connected over a WAN. I would seriously question whether Galera is the correct solution for you. And that's on top of writing to multiple nodes which will make things far worse on top.
Not to mention the fact that we are now using OVH load balancer and that piece of hardware is sometimes thinking all our servers are down and starts showing error 503 to our customers while our servers are just running fine (no restart, no issue, nothing). So one more issue to deal with, for which we'll get a dedicated server and configure our own load balancer we can have control on.
I think you need to take a long hard look at what you are trying to achieve and re-assess: 1) Whether it is actually achievable sensibly within the constraints you imposed 2) What the best workable compromise is between what you want and what you can reasonably have Right now, I don't think you have a solution that is likely to be workable.