[Maria-discuss] GTID and missing domain
Greetings, I've had this issue before and never quite got to the bottom of it. It keeps biting me and I'm hoping I can figure out how to definitively solve it. In brief, my replica server has this for gtid_slave_pos: 0-303-67739600,1-303-7360639083,100-303-4337869 And my primary server has: gtid_binlog_pos 0-303-67739600,1-303-7363061243,100-303-4338582 gtid_binlog_state 0-302-67690294,0-301-67719794,0-303-67739600,1-301-7350472534,1-302-7350381758,1-303-7363061243,100-302-4242958,100-301-4332195,100-303-4338582 That's all well and good and I can connect that way. But if I do this on the replica server: stop slave; select @@global.gtid_slave_pos; 0-303-67739600,1-303-7360639083,100-303-4337869 set global gtid_slave_pos = '1-303-7360639083,100-303-4337869'; start slave; I get: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged. Even though I'm positive there are no domain 0 transactions (again, hasn't been in service for years). If I add the `0-303-67739600` back in to gtid_slave_pos I can reconnect. At this point, I have only one server of a four server chain that I can actually connect to with gtid (as above). So I'm a little reluctant to do too much experimenting. But just to ask the question, if I: FLUSH BINARY LOGS DELETE_DOMAIN_ID=(0) on the master, would I then be able to connect to it via set global gtid_slave_pos = '1-303-7360639083,100-303-4337869'; from the replica? At this point, I cannot connect any of my other servers (which currently are all replicating from the same master) to each other, even the ones that don't show the 0- domain in the gtid-binlog vars. I'm hoping if I can figure out the above scenario it might help me deal with the rest, but I just keep feeling like I'm missing something. TIA, Dan
mariadb@Biblestuph.com writes:
And my primary server has:
gtid_binlog_pos 0-303-67739600,1-303-7363061243,100-303-4338582
gtid_binlog_state 0-302-67690294,0-301-67719794,0-303-67739600,1-301-7350472534,1-302-7350381758,1-303-7363061243,100-302-4242958,100-301-4332195,100-303-4338582
set global gtid_slave_pos = '1-303-7360639083,100-303-4337869'; start slave;
Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
Even though I'm positive there are no domain 0 transactions (again, hasn't been in service for years).
Yes. You write that "there are no domain 0 transactions". But from the point of view of the database, there _are_ domain 0 transactions, even though they may be long in the past. These are seen in gtid_binlog_pos (and gtid_binlog_state). When your slave has the 0-domain in the gtid_slave_pos, the master knows that the slave is missing no transactions. When you delete the 0-domain from the slave, this is the same conceptually as saying the slave is missing _all_ transactions in domain 0, and the master must send them all (or error out if they have been purged, as here). In general, when a slave connects, the master needs to send all transaction in a domain that the slave did not apply yet - otherwise the slave will be missing transactions and have the wrong data. This holds regardless of how old those missing transactions might be. If a slave connects two years after last being active, the system should still give a reasonable error, not silently let the slave continue with incorrect data. That is why you get the error.
if I:
FLUSH BINARY LOGS DELETE_DOMAIN_ID=(0)
on the master, would I then be able to connect to it via
set global gtid_slave_pos = '1-303-7360639083,100-303-4337869';
Yes. With this command, we are re-defining the history of the master to say that there were never any transactions in domain 0. Therefore, any slave that connects cannot be missing any such transactions. Hope this helps, - Kristian.
Kristian, Thank you! You helped me on this once before but I _think_ I've finally got it now. At this point domain 0 is removed from all four servers (both slave_pos and binlog_state) and it looks like they're able to connect around to each other as expected. Gives me much more confidence in being able to bounce them around as needed in the future. Thanks again! Dan On 2/27/2021 11:03 AM, Kristian Nielsen wrote:
mariadb@Biblestuph.com writes:
And my primary server has:
gtid_binlog_pos 0-303-67739600,1-303-7363061243,100-303-4338582
gtid_binlog_state 0-302-67690294,0-301-67719794,0-303-67739600,1-301-7350472534,1-302-7350381758,1-303-7363061243,100-302-4242958,100-301-4332195,100-303-4338582
set global gtid_slave_pos = '1-303-7360639083,100-303-4337869'; start slave;
Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
Even though I'm positive there are no domain 0 transactions (again, hasn't been in service for years).
Yes.
You write that "there are no domain 0 transactions". But from the point of view of the database, there _are_ domain 0 transactions, even though they may be long in the past. These are seen in gtid_binlog_pos (and gtid_binlog_state).
When your slave has the 0-domain in the gtid_slave_pos, the master knows that the slave is missing no transactions. When you delete the 0-domain from the slave, this is the same conceptually as saying the slave is missing _all_ transactions in domain 0, and the master must send them all (or error out if they have been purged, as here).
In general, when a slave connects, the master needs to send all transaction in a domain that the slave did not apply yet - otherwise the slave will be missing transactions and have the wrong data. This holds regardless of how old those missing transactions might be. If a slave connects two years after last being active, the system should still give a reasonable error, not silently let the slave continue with incorrect data.
That is why you get the error.
if I:
FLUSH BINARY LOGS DELETE_DOMAIN_ID=(0)
on the master, would I then be able to connect to it via
set global gtid_slave_pos = '1-303-7360639083,100-303-4337869';
Yes.
With this command, we are re-defining the history of the master to say that there were never any transactions in domain 0. Therefore, any slave that connects cannot be missing any such transactions.
Hope this helps,
- Kristian.
participants (2)
-
Kristian Nielsen
-
mariadb@Biblestuph.com