Hello,
I'm trying to troubleshoot a performance a replication issue between two version of MariaDB. I know that we are using two EOL versions, but I want to rule out all issues before prioritizing migrating to a new version.
I have built a new dedicated replication server (B) to replace the current dedicated
replica (A). A and B are both running identical OSes and versions and near identical hardware in a RAID 10 configuration. The main difference is that B has newer disks that slightly outperform A. I benchmarked both A and B to validate this. Both A and
B are configured identically (other then server-id).
Replica A (and primary) is running MariaDB
10.3.31.
Replica B is running 10.3.39.
A few other things to note:
-
B is not feeding off of the primary, but A -- that is until we sunset A.
-
A is handing moderate read only traffic
-
B is just running as a replica with no other responsibilities
I have
external monitoring to alert when either hits 500+ Seconds_Behind_Master. Since standing B up, I am seeing
it fall behind a few times a day. The interesting thing is that A is not falling behind from the primary.
Upon further examination, a scheduled job that executes hundreds of similar delete operations
off the same table (that has approx. 3 million records) is where B is largely getting hung up. Something like this:
DELETE FROM table WHERE x = 800000 AND y = 40000
The table has an index on just x.
Running
ANALYZE FORMAT=JSON SELECT * FROM table WHERE x = 800000 AND y = 40000 from B show a striking difference in total time spent:
REPLICA B:
{
"query_block": {
"select_id": 1,
"r_loops": 1,
"r_total_time_ms": 490.13,
"table": {
"table_name": "table",
"access_type": "ref",
"possible_keys": ["table_index"],
"key": "table_index",
"key_length": "4",
"used_key_parts": ["x"],
"ref": ["const"],
"r_loops": 1,
"rows": 795555,
"r_rows": 31,
"r_total_time_ms": 490.08,
"filtered": 100,
"r_filtered": 100,
"index_condition": "table.y = '40000'"
}
}
}
vs:
REPLICA A:
{
"query_block": {
"select_id": 1,
"r_loops": 1,
"r_total_time_ms": 106.37,
"table": {
"table_name": "table",
"access_type": "ref",
"possible_keys": ["table_index"],
"key": "table_index",
"key_length": "4",
"used_key_parts": ["x"],
"ref": ["const"],
"r_loops": 1,
"rows": 1146482,
"r_rows": 31,
"r_total_time_ms": 106.34,
"filtered": 100,
"r_filtered": 100,
"index_condition": "table.y = '40000'"
}
}
}
That is a significant difference performance wise. From a hardware point of view, B should have a slight edge over A. But the contrary is true by a large margin.
As a short term "fix", I added a secondary index:
create index table_index2 on table(x, y); This significantly speeds up this operation, and, for
now, seems to deal with much of the experienced replication lab in B:
SERVER B:
{
"query_block": {
"select_id": 1,
"r_loops": 1,
"r_total_time_ms": 0.253,
"table": {
"table_name": "table",
"access_type": "ref",
"possible_keys": [
"table_index",
"table_index2"
],
"key": "table_index2,
"key_length": "8",
"used_key_parts": ["x", "y"],
"ref": ["const", "const"],
"r_loops": 1,
"rows": 31,
"r_rows": 31,
"r_total_time_ms": 0.2037,
"filtered": 100,
"r_filtered": 100
}
}
}
This seems reasonable as a
general optimization, but provides little comfort in understanding the root cause of my performance issue with B. I'm hesitant to retire A in favor of B until I can get to the bottom of this.
The question I have for the community is:
what would you recommend that I do to get further perspective on the issue to determine a course of action?