[Maria-discuss] Fix different gtid-positions on domain 0 in multi-master
Hello, After moving half of the databases on our primary master-master-cluster to a different cluster, we have a problem on our backup-server which is now a slave of both servers. Topology before: master1 <-> master2 -> backup-slave Topology during migrations: master1 <-> master2 -> backup-slave <-> master3 <-> master4 Final topology: master1 <-> master2 -> backup-slave <- master3 <-> master master4 The gtid_domain_ids before/during the migration were: master1/master2 : 0 master3/master4 : 20 Replication-settings: master2 -> backup: Replicate_Do_Domain_Ids: 0 master3 -> backup: Replicate_Do_Domain_Ids: 20 backup->master3: Replicate_Do_DB: DbToMigrate1,DbToMigrate2,etc Server version: 10.1.9 So after the migration, a typical gtid_position would be on: master1/2: 0-1-12345 master3/4: 0-1-12345,20-21-5678 As long as I keep the connection backup->master running (safe, because no new transactions on the migrated databases are occurring anymore on master1/2), the position on domain 0 gets recorded on master3. The problem is, as soon as I stop that connection, that master2 and master3 have different gtid-positions for domain0, and stop/start on replication master3->backup results in the error: "Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-3898746614, which is not in the master's binlog' I have tried moving master1/2 to domain_id:1, and removing the domain_id:0 from the gtid_slave_pos on backup, but starting the replication master2->backup results in the error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.' I have tried to find a way to insert an empty transaction, with the last gtid on domain_id:0 on the master3, to bring master2/master3 in sync again on that domain, but I could not find a way to do that on MariaDB. I tried finding a way to purge domain:0 from master3/master4, but the only way sofar I have found is doing a "RESET MASTER" on master3, which would break replication between master3 and master4. Are there other ways to fix this issue, so I can have reliable replication master3->backup without having to keep the dummy replication backup->master3 indefinitely? Regards, Reinder Cuperus
Reinder Cuperus <reinder@snt.utwente.nl> writes:
The problem is, as soon as I stop that connection, that master2 and master3 have different gtid-positions for domain0, and stop/start on replication master3->backup results in the error: "Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-3898746614, which is not in the master's binlog'
Yes. backup sees that it is ahead of master3 in domain 0, so it aborts to avoid risk of diverging replication.
I have tried moving master1/2 to domain_id:1, and removing the domain_id:0 from the gtid_slave_pos on backup, but starting the replication master2->backup results in the error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.'
Yes. Because backup now sees that it is far behind in domain 0 (as it sees the world), and aborts to not silently lose transactions.
I tried finding a way to purge domain:0 from master3/master4, but the only way sofar I have found is doing a "RESET MASTER" on master3, which would break replication between master3 and master4.
Yes, I guess this is what you need. You have made a copy and removed half of the data, and now you need to similarly remove half of the binlog. Even if there are no actual transactions left from a domain in non-purged binlogs, the binlogs still remember the history of all domains, in order to not silently lose transactions for a slave that gets far behind. It would be useful in general to be able to purge a domain from a binlog. But currently the only way I can think of is RESET MASTER. You can see how this binlog history looks by checking @@gtid_binlog_state, and in the GTID_LIST events at the head of each binlog file.
I have tried to find a way to insert an empty transaction, with the last gtid on domain_id:0 on the master3, to bring master2/master3 in sync again on that domain, but I could not find a way to do that on MariaDB.
The server will not binlog an empty transaction, but a dummy transaction should work, eg. create and drop a dummy table, for example: CREATE TABLE dummy_table (a INT PRIMARY KEY); SET gtid_domain_id= 0; SET gtid_server_id= 1; SET gtid_seq_no= 3898746614; DROP TABLE dummy_table; Maybe this way you can make the binlogs look like they are in sync to the replication, not sure. It might be tricky, but then you do seem to have a good grasp of the various issues involved.
Are there other ways to fix this issue, so I can have reliable replication master3->backup without having to keep the dummy replication backup->master3 indefinitely?
I guess you would need to stop traffic to master3/master4 while getting them in sync with one another and the do RESET MASTER on both and SET GLOBAL gtid_slave_pos="" to start replication from scratch. You would then also need to have server 'backup' up-to-date with master3 before RESET MASTER, and remove domain id 20 from the gtid_slave_pos on backup after the RESET MASTER. So that is quite intrusive. - Kristian.
On Fri, Mar 17, 2017 at 08:33:07AM +0100, Kristian Nielsen wrote:
Reinder Cuperus <reinder@snt.utwente.nl> writes:
The problem is, as soon as I stop that connection, that master2 and master3 have different gtid-positions for domain0, and stop/start on replication master3->backup results in the error: "Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-3898746614, which is not in the master's binlog'
Yes. backup sees that it is ahead of master3 in domain 0, so it aborts to avoid risk of diverging replication.
I have tried moving master1/2 to domain_id:1, and removing the domain_id:0 from the gtid_slave_pos on backup, but starting the replication master2->backup results in the error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.'
Yes. Because backup now sees that it is far behind in domain 0 (as it sees the world), and aborts to not silently lose transactions.
I tried finding a way to purge domain:0 from master3/master4, but the only way sofar I have found is doing a "RESET MASTER" on master3, which would break replication between master3 and master4.
Yes, I guess this is what you need. You have made a copy and removed half of the data, and now you need to similarly remove half of the binlog. Even if there are no actual transactions left from a domain in non-purged binlogs, the binlogs still remember the history of all domains, in order to not silently lose transactions for a slave that gets far behind.
It would be useful in general to be able to purge a domain from a binlog. But currently the only way I can think of is RESET MASTER.
You can see how this binlog history looks by checking @@gtid_binlog_state, and in the GTID_LIST events at the head of each binlog file.
Thank you for this insightful information.
I have tried to find a way to insert an empty transaction, with the last gtid on domain_id:0 on the master3, to bring master2/master3 in sync again on that domain, but I could not find a way to do that on MariaDB.
The server will not binlog an empty transaction, but a dummy transaction should work, eg. create and drop a dummy table, for example:
CREATE TABLE dummy_table (a INT PRIMARY KEY); SET gtid_domain_id= 0; SET gtid_server_id= 1; SET gtid_seq_no= 3898746614; DROP TABLE dummy_table;
Maybe this way you can make the binlogs look like they are in sync to the replication, not sure. It might be tricky, but then you do seem to have a good grasp of the various issues involved.
I have created a test-setup with Vagrant, and will use it to test this method. Sofar this seems to be the sollution with the last impact. If at a later point MariaDB gets the option of purging a domain, I'll use that to remove this unused domain from all the machines.
Are there other ways to fix this issue, so I can have reliable replication master3->backup without having to keep the dummy replication backup->master3 indefinitely?
I guess you would need to stop traffic to master3/master4 while getting them in sync with one another and the do RESET MASTER on both and SET GLOBAL gtid_slave_pos="" to start replication from scratch. You would then also need to have server 'backup' up-to-date with master3 before RESET MASTER, and remove domain id 20 from the gtid_slave_pos on backup after the RESET MASTER.
So that is quite intrusive.
Stopping traffic to master3/4 might indeed be quite disruptive. I could imagine the following sequence to be working in that case, assuming master3 is primary in keepalived: - master3> FLUSH TABLES WITH READ LOCK; STOP SLAVE; - master4> wait till slave is quiet --> STOP SLAVE; - master3> RESET MASTER; SET GLOBAL gtid_slave_pos=""; - master4> RESET MASTER; SET GLOBAL gtid_slave_pos=""; - master3> UNLOCK TABLES; - master4> START SLAVE; - master3> START SLAVE; Properly scripted, and run at a moment when there are no long transactions and no replication-lag, the impact would not be noticeable to the end-users. But failure would mean broken replication, and a need to rebuild master4 causing a temporary loss of redundancy. Thank you for the information. Regards, Reinder Cuperus
participants (2)
-
Kristian Nielsen
-
Reinder Cuperus