
Nothing in the syslog or error log to suggest that MariaDB crashed or was restarted in any way. Just what I saw in the graphs. There were some memory pressure events in error log around that time too, reporting a number of pages being released - which could be related. LimitNOFILE was set to 200000 (well below the table open cache) and mysql user ulimit for files was set to 999999. I've upped the first to 2097152 and the second to unlimited. Will apply these to the servers in turn and see if this makes a difference. Thanks again for all your help! Derick On 27/03/2025 21:43, Gordan Bobic via discuss wrote:
On Thu, 27 Mar 2025 at 20:14, Derick Turner <derick@e-learndesign.co.uk> wrote:
We had another event today.
Everything went from fine with respect to cache hits (99.9% open table cache) and INNODB buffer pool all good (22GB size) to 15% Open table cache hit with 0 file opens and 3.11 misses and INNODB buffer pool size of 475MB. The graphs on SSM were interesting (and where I got that information) Are you saying that your buffer pool dropped from 22GB to 475MB? The only thing that can cause that is if mysqld/mariadbd crashed and was restarted.
Do you have enough file handles? The defaults in the MariaDB systemd service aren't particularly generous, it is possible your increase of table_open_cache didn't actually fully take effect because you are maxed out on file handles.
Do: systemctl edit mariadb
and add: [Service] LimitNOFILE=1048576
then: systemctl daemon-reload systemctl restart mariadb
and see if that makes a difference.
Unfortunately it is rather difficult to guess what's going on based purely on the data points you mentioned thus far.
Only unusual entry in the error log was:
2025-03-27 17:37:56 3194063 [Warning] InnoDB: A long wait (152 seconds) was observed for dict_sys.latch
(17:35 was when SSM was showing everything nose-diving)
This wait time kept growing over the next few minutes till:
2025-03-27 17:41:17 3193777 [Warning] InnoDB: A long wait (354 seconds) was observed for dict_sys.latch
I'd already switched our webservers off of the stricken DB server but everything came unstuck after that last error log entry.
What would be causing the dict_sys.latch issue? What can be done to fix it? There seem to be at least 13 still open bugs (plus probably some more that have been merged for next release) that could be causing this: https://jira.mariadb.org/browse/MDEV-34988?jql=status%20%3D%20Open%20AND%20t...
-- Derick Turner - He/Him