On Sun, 14 Jan 2024, 04:39 Michael Caplan via discuss, < discuss@lists.mariadb.org> wrote:
Hello,
I'm trying to troubleshoot a performance a replication issue between two version of MariaDB. I know that we are using two EOL versions, but I want to rule out all issues before prioritizing migrating to a new version.
I have built a new dedicated replication server (B) to replace the current dedicated replica (A). A and B are both running identical OSes and versions and near identical hardware in a RAID 10 configuration. The main difference is that B has newer disks that slightly outperform A. I benchmarked both A and B to validate this. Both A and B are configured identically (other then server-id).
Replica A (and primary) is running MariaDB 10.3.31. Replica B is running 10.3.39.
A few other things to note:
- B is not feeding off of the primary, but A -- that is until we sunset A. - A is handing moderate read only traffic - B is just running as a replica with no other responsibilities
I have external monitoring to alert when either hits 500+ Seconds_Behind_Master. Since standing B up, I am seeing it fall behind a few times a day. The interesting thing is that A is not falling behind from the primary.
Upon further examination, a scheduled job that executes hundreds of similar delete operations off the same table (that has approx. 3 million records) is where B is largely getting hung up. Something like this:
DELETE FROM table WHERE x = 800000 AND y = 40000
The table has an index on just x.
Running *ANALYZE FORMAT=JSON SELECT * FROM table WHERE x = 800000 AND y = 40000* from B show a striking difference in total time spent:
REPLICA B: { "query_block": { "select_id": 1, "r_loops": 1, * "r_total_time_ms": 490.13,* "table": { "table_name": "table", "access_type": "ref", "possible_keys": ["table_index"], "key": "table_index", "key_length": "4", "used_key_parts": ["x"], "ref": ["const"], "r_loops": 1, "rows": 795555, "r_rows": 31, "r_total_time_ms": 490.08, "filtered": 100, "r_filtered": 100, "index_condition": "table.y = '40000'" } } }
vs:
REPLICA A: { "query_block": { "select_id": 1, "r_loops": 1, * "r_total_time_ms": 106.37,* "table": { "table_name": "table", "access_type": "ref", "possible_keys": ["table_index"], "key": "table_index", "key_length": "4", "used_key_parts": ["x"], "ref": ["const"], "r_loops": 1, "rows": 1146482, "r_rows": 31, "r_total_time_ms": 106.34, "filtered": 100, "r_filtered": 100, "index_condition": "table.y = '40000'" } } }
That is a *significant* difference performance wise. From a hardware point of view, B should have a slight edge over A. But the contrary is true by a large margin.
As a short term "fix", I added a secondary index: *create index table_index2 on table(x, y); *This significantly speeds up this operation, and, for now, seems to deal with much of the experienced replication lab in B:
SERVER B: { "query_block": { "select_id": 1, "r_loops": 1, * "r_total_time_ms": 0.253,* "table": { "table_name": "table", "access_type": "ref", "possible_keys": [ "table_index", "table_index2" ], "key": "table_index2, "key_length": "8", "used_key_parts": ["x", "y"], "ref": ["const", "const"], "r_loops": 1, "rows": 31, "r_rows": 31, "r_total_time_ms": 0.2037, "filtered": 100, "r_filtered": 100 } } }
This seems reasonable as a *general optimization*, but provides little comfort in understanding the root cause of my performance issue with B. I'm hesitant to retire A in favor of B until I can get to the bottom of this.
The question I have for the community is: *what would you recommend that I do to get further perspective on the issue to determine a course of action?*
Give the explain uses the same index on both, cold buffer pool could be one possible explanation. Did you make sure configuration is the same on both servers, particularly innodb_buffer_pool% and innodb_log% settings? Otherwise: - Use binlog_format=ROW on all servers - Make sure all your tables have primary keys defined - It sounds like you should have that secondary index anyway, on all the servers - if a lot of rows match, use pt-archiver instead, which will delete matching rows in small batches by primary keys, which will minimize lag. - increase slave_parallel_threads (but not too high)