We're considering moving to a cluster environment from a 1 master - 3 slave configuration that has proven inflexible, but such a severe limitation seems unusable in production.
Is this still the case in 2020? What is the reasoning behind such an architectural decision, if so?
Thank you.
You should be warned that every DDL statement executed on a Galera cluster will per default BLOCK the complete cluster (not only the table, or even just the database the table resides in)! This is the default "Total Order Isolation" (TOI) mode.
The DDL statements can't be killed - once issued, it will run until completed or an error occurs.
Issuing e.g. a column-altering DDL statement on a large table will take the complete cluster out of commission until every node has completed the migration. Only some operations (e.g. changing DEFAULT values) are always short-timed and won't interefere with the cluster's opperations.
Long-running, cluster-blocking DDL are of course a no-go on a productive system. To resolve this issue, there are several resulutions; Percona, for example, provides a script to "online" migrate a table (pt-online-schema-change), or you can use the "Rolling Schema Upgrade" (RSU) for data-compatible changes. More about that in the Galera documentation.
In my opionion, this behaviour should definitely added as "observation" to this page - it's definitely not something you'd expect coming from a "normal" MariaDB/MySQL system.