[Maria-discuss] Long email about a replication issue
Hi All, This weekend I had to repair a replication problem on one of our clusters. I've attempted to get to the root cause but not sure where I can go from here. Replication broke on Friday night because of foreign key errors. I initially retrieved the correct row, restarted replication, which then broke again due to a different tables FK error. I fixed this and then started replication again. All was well for an hour and then it broke again with a similar issue. We called it a night and decided to address on Saturday AM as there was no immediate production impact. The next day I discovered that records were missing from a bunch of tables so I recovered the cluster using a backup. I initially recovered the slaves attached to M2 from a backup but using the binlogs from M2. This caused the slaves to break. So instead I then recovered from the binlog on M1, before pointing them back to M2. This worked fine and everything in the cluster is currently well. Clearly something is wrong with the binlogs of M2. This is the first time I have seen this issue on the cluster that has been running fine for the last year. There have been no recent changes in the version of MariaDB or the applications using the cluster. We are running version 10.0.23-MariaDB-log. We are running replication with GTIDs, semi-sync repl between masters, and we run regular pt-table-checksum jobs. slave_sql_verify_checksum is on but master_sql_verify_checksum is off. The setup is M1<->M2 (with attached slaves). M1 is the active master receiving all writes. Access is controlled through an F5 and I don't think any errant transactions have occurred on the inactive master (M2). I've checked this by grepping the binlogs for the M2 server_id. The initial associated record that broke replication was attached to a "user" table record. This user was created on Friday at 16:21PM. Replication broke around 11:30PM that night. The user record had a GTID of GTID 0-1-36823254 (recovered from M1) I've looked into the appropriate binlog from M2... The head mysqlbinlog mysql-bin.002133 | grep GTID | head #160521 5:02:32 server id 1 end_log_pos 392 CRC32 0x97a118a8 GTID 0-1-36613622 #160521 5:02:32 server id 1 end_log_pos 768 CRC32 0x1859f6fb GTID 0-1-36613623 The tail... mysqlbinlog mysql-bin.002133 | grep GTID | tail #160528 11:58:04 server id 1 end_log_pos 103889368 CRC32 0x66e7c623 GTID 0-1-36851211 #160528 11:58:04 server id 1 end_log_pos 103971892 CRC32 0xe55c4842 GTID 0-1-36851212 If I grep for the specific GTID on M2 I get nothing... mysqlbinlog mysql-bin.002133 | grep "GTID 0-1-36823254"; <- Note the GTID is between the head & tail range. If I grep for this record by email address I also get nothing. So I must conclude this record (and a bunch of others), never got to master 2. We have pretty extensive monitoring of these servers and we should have been alerted to replication lag, locking etc but we got nothing until replication broke due to the FK errors. You would expect replication to break here because of a gap in the GTIDs. This did not happen and I'm almost certain that GTID replication could not have been deactivated and the positions messed around with. I'm unsure of where to go now. Any ideas? Any thoughts are appreciated. Regards, Rhys
<Rhys.Campbell@swisscom.com> writes:
This weekend I had to repair a replication problem on one of our clusters. I've attempted to get to the root cause but not sure where I
Is this pure MariaDB replication, or is it Galera? I think it is the former, but the term "cluster" is somewhat overloaded, which is why I ask...
The setup is M1<->M2 (with attached slaves). M1 is the active master receiving all writes. Access is controlled through an F5 and I don't think any errant transactions have occurred on the inactive master (M2). I've checked this by grepping the binlogs for the M2 server_id.
The initial associated record that broke replication was attached to a "user" table record. This user was created on Friday at 16:21PM. Replication broke around 11:30PM that night. The user record had a GTID of GTID 0-1-36823254 (recovered from M1)
I've looked into the appropriate binlog from M2...
If I grep for the specific GTID on M2 I get nothing...
If I grep for this record by email address I also get nothing. So I must conclude this record (and a bunch of others), never got to master
until replication broke due to the FK errors. You would expect replication to break here because of a gap in the GTIDs. This did not happen and I'm almost certain that GTID replication could not have been deactivated and the positions messed around with.
Yeah, even if the slave was set to MASTER_USE_GTID=no, the GTIDs should still have been there in the M2 binlog.
I'm unsure of where to go now. Any ideas? Any thoughts are appreciated.
I guess you need to figure out why M2 did not apply those transactions. Some suggestions: - Check the error log on M2 for disconnect/reconnects around the time of the transactions that are missing (or any disconects/reconnects). Such messages should also say at what position M2 disconnected and reconnected, this could be compared to the problem GTID. This could show if transactions were skipped because of reconnecting at a wrong position. - Also check for local slave stop/start message in the M2 error log, to see if anything looks related or could indicate changes in the replication config (most replication changes require stopping the slave threads). - You can also check the binlog on M1 for any out-of-order GTIDs, which could cause problems at slave reconnect (seems unlikely though). - Replication filtering could cause this - double-check that no filtering was turned on or something. Also stuff like --gtid-ignore-duplicates. Good luck, - Kristian.
Hi Kristian Thanks for your inputs... - Standard MariaDB. No Galera. - Error log is not interesting at all. Last replication connection error happened in March of this year. - Last STOP SLAVE back in 2015 (apart from Friday night). - Last change master statement logged over a year ago 150127 - gtid_ignore_duplicates = OFF on all hosts. - Replication filtering - None configured on either host (we don't use it here). I'll get my script on another day to check for OOO gtids Thanks again Rhys -----Original Message----- From: Kristian Nielsen [mailto:knielsen@knielsen-hq.org] Sent: Wednesday, June 01, 2016 3:00 PM To: Campbell Rhys, INI-ON-FIT-TSO-QPM <Rhys.Campbell@swisscom.com> Cc: maria-discuss@lists.launchpad.net Subject: Re: [Maria-discuss] Long email about a replication issue <Rhys.Campbell@swisscom.com> writes:
This weekend I had to repair a replication problem on one of our clusters. I've attempted to get to the root cause but not sure where I
Is this pure MariaDB replication, or is it Galera? I think it is the former, but the term "cluster" is somewhat overloaded, which is why I ask...
The setup is M1<->M2 (with attached slaves). M1 is the active master receiving all writes. Access is controlled through an F5 and I don't think any errant transactions have occurred on the inactive master (M2). I've checked this by grepping the binlogs for the M2 server_id.
The initial associated record that broke replication was attached to a "user" table record. This user was created on Friday at 16:21PM. Replication broke around 11:30PM that night. The user record had a GTID of GTID 0-1-36823254 (recovered from M1)
I've looked into the appropriate binlog from M2...
If I grep for the specific GTID on M2 I get nothing...
If I grep for this record by email address I also get nothing. So I must conclude this record (and a bunch of others), never got to master
until replication broke due to the FK errors. You would expect replication to break here because of a gap in the GTIDs. This did not happen and I'm almost certain that GTID replication could not have been deactivated and the positions messed around with.
Yeah, even if the slave was set to MASTER_USE_GTID=no, the GTIDs should still have been there in the M2 binlog.
I'm unsure of where to go now. Any ideas? Any thoughts are appreciated.
I guess you need to figure out why M2 did not apply those transactions. Some suggestions: - Check the error log on M2 for disconnect/reconnects around the time of the transactions that are missing (or any disconects/reconnects). Such messages should also say at what position M2 disconnected and reconnected, this could be compared to the problem GTID. This could show if transactions were skipped because of reconnecting at a wrong position. - Also check for local slave stop/start message in the M2 error log, to see if anything looks related or could indicate changes in the replication config (most replication changes require stopping the slave threads). - You can also check the binlog on M1 for any out-of-order GTIDs, which could cause problems at slave reconnect (seems unlikely though). - Replication filtering could cause this - double-check that no filtering was turned on or something. Also stuff like --gtid-ignore-duplicates. Good luck, - Kristian.
participants (2)
-
Kristian Nielsen
-
Rhys.Campbell@swisscom.com