I've been trying to investigate issues with an application failing, and I have reason to believe that the culprit lies somewhere in the database backend. To this end, I started collecting metrics from the backend MariaDB Galera cluster (currently running on MariaDB version 10.3.16), hoping that the failures would reflect on the metrics collected.
Indeed, about 12 hours before the application started failing spectacularly, the values reported by the 'Master' node (i.e. the node to which the application directs the writes) for Innodb_row_lock_time started growing at a rate never recorded before. I'm not sure the failures are linked to this metric, but it's the only trend I've been able to notice which correlates to the failures. Here's a link to a graph demonstrating this fact over the week preceding the last failure:
Innodb_row_lock_time change per minute
Note that the graph displays change, not the current value of the metric. MariaDB servers are polled every 90 seconds and datapoints in the graph refer to change per minute. The big drop near the end of the graph indicates the time when the MariaDB service was restarted on the 'Master' node.
My question is how to further investigate this symptom and possibly identify the culprit queries or operation. I also log the output of InnoDB Monitor and slow queries in logfiles, but I haven't been able to find anything out of the ordinary during the period when wait_time was growing rapidly (although I'm no DB expert).
Is there any other logging functionality I can enable to provide more information on this? And if InnoDB Monitor Output should provide the information needed, what should I be looking for exactly? What kind of operation could lead to rows being locked for so long, considering the sudden manifestation of the issue? Is there any way of knowing which rows were locked and whether it was a read or write lock?
Lastly, is there any reason to think this could be attributed to a MariaDB bug instead of the applications misbehaving in any way?
Thank you in advance,
George