Concerning table_open_cache, it’s currently set to 13869 (however in config it’s set to 16384), global status shows this on new node, 9 hours after start:

+-----------------------------------+---------+

| Table_open_cache_active_instances | 1 |

| Table_open_cache_hits | 2136757 |

| Table_open_cache_misses | 185097 |

| Table_open_cache_overflows | 146153 |

+-----------------------------------+---------+

+---------------+--------+

| Opened_tables | 159629 |

+---------------+--------+

I’ve updated table_open_cache to 65536 on 2 servers and the cache_overflows stops increasing.

In /usr/lib/systemd/system/mariadb.service, I see those:

LimitNOFILE=32768

So I don’t understand why MariaDB decided to reduce the configured value? Not sure if changing the config will have any effect on the live value either? I’ll try to set both to 65536 this evening and see if it helps.

Is it safe to increase both limits? Maybe to the value I use during mariabackup, which is 919200?

All active nodes are used for writing, the HTTP load is spread evenly on all nodes. The ratio is 1.2% writes at 7780 read/sec with 125 write/sec on peak. Both read/writes are spread upon all active nodes using a load balancer using round-robin at the moment.

During yesterday’s test the existing 2 nodes where active at first. Seeing the queries starting to be stuck I decided to activate the new node to spread the load in hope for some improvements, however it just made things even worse, so I deactivated it again.

De : Gordan Bobic <gordan.bobic@gmail.com>
Envoyé : jeudi 28 juillet 2022 08:15
À : Cédric Counotte <cedric.counotte@1check.com>
Cc : jocelyn fournier <jocelyn.fournier@gmail.com>; Marko Mäkelä <marko.makela@mariadb.com>; Mailing-List mariadb <maria-discuss@lists.launchpad.net>; Pierre LAFON <pierre.lafon@1check.com>
Objet : Re: [Maria-discuss] MariaDB server horribly slow on start

You should hardly ever need to adjust table_open_cache_instances away from defaults, but it sounds like you may need into bump your table_open_cache by a 2-4x.

Are you using more than one node for writing? And does the new node get used for anything immediately after joining?

On Thu, 28 Jul 2022, 01:20 Cédric Counotte, <cedric.counotte@1check.com> wrote:

I just tested using this setting, but it made no difference:
innodb_max_dirty_pages_pct_lwm=0.001

Tested with 2 existing nodes running and sync'ed, and one new node being attached.

As soon as SST sync ended, the first 2 nodes started piling queries, mostly stuck in commit.
If the new node was used, all queries got stuck in opening tables.

Please see attached screenshots showing part of the queues, which quickly reach 100+.

It took about 20 minutes to recover from that situation without using the new node at all! It took another 20 minutes for the new node to become usable (eg not too slow).

Restarted server, it did an IST, exact same result, just worse, pending queries above 200 this time!
How is it possible for queries to be stuck for several minutes (10!!!) while in nominal state they take less than 0.2 seconds !?

Saw this in logs just after the IST, however table_open_cache_instances = 16 is set in config file and live value is indeed 1!?

2022-07-27 23:55:09 11 [Warning] Detected table cache mutex contention at instance 1: 30% waits. Additional table cache instance cannot be activated: consider raising table_open_cache_instances. Number of active instances: 1.

FWIW we have 350+ DB each with 100 tables, and doing a mariabackup requires ulimit -n 919200, servers are processing about 3.000 queries/seconds during those tests, while it peaks at 10.000 during the day. A single server is capable of handling the peak load (verified earlier this morning) !