
■ Environment ●Cluster: Galera Cluster (3 nodes) ●OS: CentOS 7.4 ●DBMS: MariaDB 10.6.15 ●DB Uptime: 509 days ■ Issue Overview ●Time of Occurrence: Between 00:00 and 02:00 ●Initial Symptom: Single-row INSERT and DELETE queries were delayed by several seconds and eventually stalled ●Around 00:34: Massive UPDATE queries (targeting same PK) led to X locks and an increase in active sessions ●00:35: CPU usage on DB server hit 100% and stayed at critical levels; thread count spiked ●00:41: Galera node DB01 shut down automatically Error log excerpt: [ERROR][FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold for dict_sys.latch was exceeded. See : https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ ■ Root Cause (Internal Analysis) ●dict_sys.latch exceeded the innodb_fatal_semaphore_wait_threshold (default: 600 seconds) ●This caused InnoDB to forcefully kill the MariaDB process ●The dict_sys.latch is a global latch for the InnoDB data dictionary, which can become a severe bottleneck under high concurrency ❗ What’s Unusual: ●No clear sign of typical row locks or massive spike in transaction volume ●Even single-row INSERT and DELETE queries were delayed by thousands of seconds, which is highly abnormal ●No obvious external factors (lock contention, CPU saturation, or connection floods) were identified ●Strong suspicion of internal engine behavior or a bug ❓ Questions and Request for Input ●Has anyone experienced a similar issue related to dict_sys.latch in Galera Cluster environments? ●Are there known bugs or release notes in MariaDB 10.6.x or Galera that mention severe delays or process termination related to this latch? ●Any known workarounds or best practices to prevent this from recurring? Your experience and advice would be greatly appreciated. Thank you in advance!