Replication Speed / Lag 10.3.31 vs 10.3.39
Hello, I'm trying to troubleshoot a performance a replication issue between two version of MariaDB. I know that we are using two EOL versions, but I want to rule out all issues before prioritizing migrating to a new version. I have built a new dedicated replication server (B) to replace the current dedicated replica (A). A and B are both running identical OSes and versions and near identical hardware in a RAID 10 configuration. The main difference is that B has newer disks that slightly outperform A. I benchmarked both A and B to validate this. Both A and B are configured identically (other then server-id). Replica A (and primary) is running MariaDB 10.3.31. Replica B is running 10.3.39. A few other things to note: * B is not feeding off of the primary, but A -- that is until we sunset A. * A is handing moderate read only traffic * B is just running as a replica with no other responsibilities I have external monitoring to alert when either hits 500+ Seconds_Behind_Master. Since standing B up, I am seeing it fall behind a few times a day. The interesting thing is that A is not falling behind from the primary. Upon further examination, a scheduled job that executes hundreds of similar delete operations off the same table (that has approx. 3 million records) is where B is largely getting hung up. Something like this: DELETE FROM table WHERE x = 800000 AND y = 40000 The table has an index on just x. Running ANALYZE FORMAT=JSON SELECT * FROM table WHERE x = 800000 AND y = 40000 from B show a striking difference in total time spent: REPLICA B: { "query_block": { "select_id": 1, "r_loops": 1, "r_total_time_ms": 490.13, "table": { "table_name": "table", "access_type": "ref", "possible_keys": ["table_index"], "key": "table_index", "key_length": "4", "used_key_parts": ["x"], "ref": ["const"], "r_loops": 1, "rows": 795555, "r_rows": 31, "r_total_time_ms": 490.08, "filtered": 100, "r_filtered": 100, "index_condition": "table.y = '40000'" } } } vs: REPLICA A: { "query_block": { "select_id": 1, "r_loops": 1, "r_total_time_ms": 106.37, "table": { "table_name": "table", "access_type": "ref", "possible_keys": ["table_index"], "key": "table_index", "key_length": "4", "used_key_parts": ["x"], "ref": ["const"], "r_loops": 1, "rows": 1146482, "r_rows": 31, "r_total_time_ms": 106.34, "filtered": 100, "r_filtered": 100, "index_condition": "table.y = '40000'" } } } That is a significant difference performance wise. From a hardware point of view, B should have a slight edge over A. But the contrary is true by a large margin. As a short term "fix", I added a secondary index: create index table_index2 on table(x, y); This significantly speeds up this operation, and, for now, seems to deal with much of the experienced replication lab in B: SERVER B: { "query_block": { "select_id": 1, "r_loops": 1, "r_total_time_ms": 0.253, "table": { "table_name": "table", "access_type": "ref", "possible_keys": [ "table_index", "table_index2" ], "key": "table_index2, "key_length": "8", "used_key_parts": ["x", "y"], "ref": ["const", "const"], "r_loops": 1, "rows": 31, "r_rows": 31, "r_total_time_ms": 0.2037, "filtered": 100, "r_filtered": 100 } } } This seems reasonable as a general optimization, but provides little comfort in understanding the root cause of my performance issue with B. I'm hesitant to retire A in favor of B until I can get to the bottom of this. The question I have for the community is: what would you recommend that I do to get further perspective on the issue to determine a course of action? Thanks everyone, Mike Confidentiality Notice: The information contained in this electronic message and any attachments to this message are intended only for the individual(s) addressed in the message and may contain proprietary and confidential information. If you are not the intended recipient, you should not disseminate, distribute, or copy this e-mail. Please notify the sender and destroy this message. WARNING: Computer viruses can be transmitted via email. The recipient should scan this email before opening it. The company accepts no liability for any damage caused by any virus transmitted by this email.
On Sun, 14 Jan 2024, 04:39 Michael Caplan via discuss, < discuss@lists.mariadb.org> wrote:
Hello,
I'm trying to troubleshoot a performance a replication issue between two version of MariaDB. I know that we are using two EOL versions, but I want to rule out all issues before prioritizing migrating to a new version.
I have built a new dedicated replication server (B) to replace the current dedicated replica (A). A and B are both running identical OSes and versions and near identical hardware in a RAID 10 configuration. The main difference is that B has newer disks that slightly outperform A. I benchmarked both A and B to validate this. Both A and B are configured identically (other then server-id).
Replica A (and primary) is running MariaDB 10.3.31. Replica B is running 10.3.39.
A few other things to note:
- B is not feeding off of the primary, but A -- that is until we sunset A. - A is handing moderate read only traffic - B is just running as a replica with no other responsibilities
I have external monitoring to alert when either hits 500+ Seconds_Behind_Master. Since standing B up, I am seeing it fall behind a few times a day. The interesting thing is that A is not falling behind from the primary.
Upon further examination, a scheduled job that executes hundreds of similar delete operations off the same table (that has approx. 3 million records) is where B is largely getting hung up. Something like this:
DELETE FROM table WHERE x = 800000 AND y = 40000
The table has an index on just x.
Running *ANALYZE FORMAT=JSON SELECT * FROM table WHERE x = 800000 AND y = 40000* from B show a striking difference in total time spent:
REPLICA B: { "query_block": { "select_id": 1, "r_loops": 1, * "r_total_time_ms": 490.13,* "table": { "table_name": "table", "access_type": "ref", "possible_keys": ["table_index"], "key": "table_index", "key_length": "4", "used_key_parts": ["x"], "ref": ["const"], "r_loops": 1, "rows": 795555, "r_rows": 31, "r_total_time_ms": 490.08, "filtered": 100, "r_filtered": 100, "index_condition": "table.y = '40000'" } } }
vs:
REPLICA A: { "query_block": { "select_id": 1, "r_loops": 1, * "r_total_time_ms": 106.37,* "table": { "table_name": "table", "access_type": "ref", "possible_keys": ["table_index"], "key": "table_index", "key_length": "4", "used_key_parts": ["x"], "ref": ["const"], "r_loops": 1, "rows": 1146482, "r_rows": 31, "r_total_time_ms": 106.34, "filtered": 100, "r_filtered": 100, "index_condition": "table.y = '40000'" } } }
That is a *significant* difference performance wise. From a hardware point of view, B should have a slight edge over A. But the contrary is true by a large margin.
As a short term "fix", I added a secondary index: *create index table_index2 on table(x, y); *This significantly speeds up this operation, and, for now, seems to deal with much of the experienced replication lab in B:
SERVER B: { "query_block": { "select_id": 1, "r_loops": 1, * "r_total_time_ms": 0.253,* "table": { "table_name": "table", "access_type": "ref", "possible_keys": [ "table_index", "table_index2" ], "key": "table_index2, "key_length": "8", "used_key_parts": ["x", "y"], "ref": ["const", "const"], "r_loops": 1, "rows": 31, "r_rows": 31, "r_total_time_ms": 0.2037, "filtered": 100, "r_filtered": 100 } } }
This seems reasonable as a *general optimization*, but provides little comfort in understanding the root cause of my performance issue with B. I'm hesitant to retire A in favor of B until I can get to the bottom of this.
The question I have for the community is: *what would you recommend that I do to get further perspective on the issue to determine a course of action?*
Give the explain uses the same index on both, cold buffer pool could be one possible explanation. Did you make sure configuration is the same on both servers, particularly innodb_buffer_pool% and innodb_log% settings? Otherwise: - Use binlog_format=ROW on all servers - Make sure all your tables have primary keys defined - It sounds like you should have that secondary index anyway, on all the servers - if a lot of rows match, use pt-archiver instead, which will delete matching rows in small batches by primary keys, which will minimize lag. - increase slave_parallel_threads (but not too high)
Michael Caplan via discuss <discuss@lists.mariadb.org> writes:
I'm trying to troubleshoot a performance a replication issue between two version of MariaDB.
Replica A (and primary) is running MariaDB 10.3.31. Replica B is running 10.3.39.
* B is not feeding off of the primary, but A -- that is until we sunset A. * A is handing moderate read only traffic * B is just running as a replica with no other responsibilities
I have external monitoring to alert when either hits 500+ Seconds_Behind_Master. Since standing B up, I am seeing it fall behind a few times a day. The interesting thing is that A is not falling behind from the primary.
Upon further examination, a scheduled job that executes hundreds of similar delete operations off the same table (that has approx. 3 million records) is where B is largely getting hung up. Something like this:
DELETE FROM table WHERE x = 800000 AND y = 40000
The table has an index on just x.
I assume that the table is using InnoDB. What is the approximate size of the table on disk, and what is the size of the InnoDB buffer pool?
Running ANALYZE FORMAT=JSON SELECT * FROM table WHERE x = 800000 AND y = 40000 from B show a striking difference in total time spent:
REPLICA B: { "query_block": {
"key": "table_index",
"rows": 795555, "r_rows": 31, "r_total_time_ms": 490.08,
REPLICA A: { "query_block": {
"key": "table_index",
"rows": 1146482, "r_rows": 31, "r_total_time_ms": 106.34,
So the queries execute the same on either server (same query plan). IIUC, the query needs to examine a large number of rows, ~1M, to find the few matching rows, 31 here.
As a short term "fix", I added a secondary index: create index table_index2 on table(x, y); This significantly speeds up this operation, and, for now, seems to deal with much of the experienced replication lab in B:
"query_block": {
"key": "table_index2,
"rows": 31, "r_rows": 31, "r_total_time_ms": 0.2037,
Here, the query only needs to touch the few matching rows, 31 again. But the query is still taking more that 40% of the time the query that examined almost a million rows. This all suggests that the bottleneck is disk I/O. It would be surprising to see a query spend 0.2 *seconds* to read 31 rows through an index memory-only. Unless there are some huge blobs being selected in each row? What happens if the query is re-run again immediately afterwards? Does the fast query (using table_index2) then complete much faster? That would be expected if the first run is bottlenecked on I/O, since the second run would run completely from cache. If I/O is the bottleneck, due to "cold" data not in the buffer pool, that could explain why B is lagging behind while A is not. Since A is running read traffic, this traffic may be keeping the table in the buffer pool, allowing the queries to run a little faster, while on B the replication lag will spike while the table is being read into the buffer pool. This is rather speculative, of course, it is hard to determine the root performance bottleneck from only this kind of data. If the table really has only 3 million rows, and 1 million of those are scanned in 0.5 seconds by a single query, it's a bit hard to imagine that I/O for that table alone could cause 500+ seconds of lag. Still, with a 31-row index scan taking 0.2 seconds, I would start by examining I/O as the likely bottleneck. Optimistic parallel replication might be an effective means to eliminate the lag, as it would allow to run these queries in parallel on slave B. If the bottleneck is I/O, this could help even if there are many conflicts between the parallel queries, as it will populate the buffer pool faster. Hope this helps, - Kristian.
participants (3)
-
Gordan Bobic
-
Kristian Nielsen
-
Michael Caplan