Hello,
I'm trying to troubleshoot a performance a replication issue between two version of MariaDB. I know that we are using two EOL versions, but I want to rule out all issues before prioritizing migrating to a new version.
I have built a new dedicated replication server (B) to replace the current dedicated
replica (A). A and B are both running identical OSes and versions and near identical hardware in a RAID 10 configuration. The main difference is that B has newer disks that slightly outperform A. I benchmarked both A and B to validate this. Both A and
B are configured identically (other then server-id).
Replica A (and primary) is running MariaDB
10.3.31.
Replica B is running 10.3.39.
A few other things to note:
-
B is not feeding off of the primary, but A -- that is until we sunset A.
-
A is handing moderate read only traffic
-
B is just running as a replica with no other responsibilities
I have
external monitoring to alert when either hits 500+ Seconds_Behind_Master. Since standing B up, I am seeing
it fall behind a few times a day. The interesting thing is that A is not falling behind from the primary.
Upon further examination, a scheduled job that executes hundreds of similar delete operations
off the same table (that has approx. 3 million records) is where B is largely getting hung up. Something like this:
DELETE FROM table WHERE x = 800000 AND y = 40000
The table has an index on just x.
Running
ANALYZE FORMAT=JSON SELECT * FROM table WHERE x = 800000 AND y = 40000 from B show a striking difference in total time spent:
REPLICA B:
{
"query_block": {
"select_id": 1,
"r_loops": 1,
"r_total_time_ms": 490.13,
"table": {
"table_name": "table",
"access_type": "ref",
"possible_keys": ["table_index"],
"key": "table_index",
"key_length": "4",
"used_key_parts": ["x"],
"ref": ["const"],
"r_loops": 1,
"rows": 795555,
"r_rows": 31,
"r_total_time_ms": 490.08,
"filtered": 100,
"r_filtered": 100,
"index_condition": "table.y = '40000'"
}
}
}
vs:
REPLICA A:
{
"query_block": {
"select_id": 1,
"r_loops": 1,
"r_total_time_ms": 106.37,
"table": {
"table_name": "table",
"access_type": "ref",
"possible_keys": ["table_index"],
"key": "table_index",
"key_length": "4",
"used_key_parts": ["x"],
"ref": ["const"],
"r_loops": 1,
"rows": 1146482,
"r_rows": 31,
"r_total_time_ms": 106.34,
"filtered": 100,
"r_filtered": 100,
"index_condition": "table.y = '40000'"
}
}
}
That is a significant difference performance wise. From a hardware point of view, B should have a slight edge over A. But the contrary is true by a large margin.
As a short term "fix", I added a secondary index:
create index table_index2 on table(x, y); This significantly speeds up this operation, and, for
now, seems to deal with much of the experienced replication lab in B:
SERVER B:
{
"query_block": {
"select_id": 1,
"r_loops": 1,
"r_total_time_ms": 0.253,
"table": {
"table_name": "table",
"access_type": "ref",
"possible_keys": [
"table_index",
"table_index2"
],
"key": "table_index2,
"key_length": "8",
"used_key_parts": ["x", "y"],
"ref": ["const", "const"],
"r_loops": 1,
"rows": 31,
"r_rows": 31,
"r_total_time_ms": 0.2037,
"filtered": 100,
"r_filtered": 100
}
}
}
This seems reasonable as a
general optimization, but provides little comfort in understanding the root cause of my performance issue with B. I'm hesitant to retire A in favor of B until I can get to the bottom of this.
The question I have for the community is:
what would you recommend that I do to get further perspective on the issue to determine a course of action?
Thanks everyone,
Mike
Confidentiality Notice: The information contained in this electronic message and any attachments to this message are intended only for the individual(s) addressed in the message and may contain proprietary
and confidential information. If you are not the intended recipient, you should not disseminate, distribute, or copy this e-mail. Please notify the sender and destroy this message. WARNING: Computer viruses can be transmitted via email. The recipient should
scan this email before opening it. The company accepts no liability for any damage caused by any virus transmitted by this email.