On Sun, 14 Jan 2024, 04:39 Michael Caplan via discuss, <discuss@lists.mariadb.org> wrote:
Hello,

I'm trying to troubleshoot a performance a replication issue between two version of MariaDB.  I know that we are using two EOL versions, but I want to rule out all issues before prioritizing migrating to a new version.

I have built a new dedicated replication server (B) to replace the current dedicated replica (A).  A and B are both running identical OSes and versions and near identical hardware in a RAID 10 configuration.  The main difference is that B has newer disks that slightly outperform A.  I benchmarked both A and B to validate this.  Both A and B are configured identically (other then server-id).

Replica A (and primary)  is running MariaDB 10.3.31.  Replica B is running  10.3.39.

A few other things to note:
  • B is not feeding off of the primary, but A -- that is until we sunset A.
  • A is handing moderate read only traffic 
  • B is just running as a replica with no other responsibilities

I have external monitoring to alert when either hits 500+ Seconds_Behind_Master.  Since standing B up, I am seeing it fall behind a few times a day.   The interesting thing is that A is not falling behind from the primary.

Upon further examination, a scheduled job that executes hundreds of similar delete operations off the same table (that has approx. 3 million records) is where B is largely getting hung up.  Something like this:

DELETE FROM table WHERE x = 800000 AND y = 40000

The table has an index on just x.

Running ANALYZE FORMAT=JSON SELECT * FROM table WHERE x = 800000 AND y = 40000  from B show a striking difference in total time spent: 

REPLICA B:
{
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 490.13,
    "table": {
      "table_name": "table",
      "access_type": "ref",
      "possible_keys": ["table_index"],
      "key": "table_index",
      "key_length": "4",
      "used_key_parts": ["x"],
      "ref": ["const"],
      "r_loops": 1,
      "rows": 795555,
      "r_rows": 31,
      "r_total_time_ms": 490.08,
      "filtered": 100,
      "r_filtered": 100,
      "index_condition": "table.y = '40000'"
    }
  }
}
vs:
REPLICA A:
{
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 106.37,
    "table": {
      "table_name": "table",
      "access_type": "ref",
      "possible_keys": ["table_index"],
      "key": "table_index",
      "key_length": "4",
      "used_key_parts": ["x"],
      "ref": ["const"],
      "r_loops": 1,
      "rows": 1146482,
      "r_rows": 31,
      "r_total_time_ms": 106.34,
      "filtered": 100,
      "r_filtered": 100,
      "index_condition": "table.y = '40000'"
    }
  }
}

That is a significant difference performance wise.  From a hardware point of view, B should have a slight edge over A.  But the contrary is true by a large margin.


As a short term "fix", I added a secondary index: create index table_index2 on table(x, y);   This significantly speeds up this operation, and, for now, seems to deal with much of the experienced replication lab in B:

SERVER B:
{
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 0.253,
    "table": {
      "table_name": "table",
      "access_type": "ref",
      "possible_keys": [
        "table_index",
        "table_index2"
      ],
      "key": "table_index2,
      "key_length": "8",
      "used_key_parts": ["x", "y"],
      "ref": ["const", "const"],
      "r_loops": 1,
      "rows": 31,
      "r_rows": 31,
      "r_total_time_ms": 0.2037,
      "filtered": 100,
      "r_filtered": 100
    }
  }
}

This seems reasonable as a general optimization, but provides little comfort in understanding the root cause of my performance issue with B.  I'm hesitant to retire A in favor of B until I can get to the bottom of this.

The question I have for the community is: what would you recommend that I do to get further perspective on the issue to determine a course of action?

Give the explain uses the same index on both, cold buffer pool could be one possible explanation.

Did you make sure configuration is the same on both servers, particularly innodb_buffer_pool% and innodb_log% settings?

Otherwise:

- Use binlog_format=ROW on all servers

- Make sure all your tables have primary keys defined

- It sounds like you should have that secondary index anyway, on all the servers

- if a lot of rows match, use pt-archiver instead, which will delete matching rows in small batches by primary keys, which will minimize lag.

- increase slave_parallel_threads (but not too high)