Pavel Ivanov <pivanof@google.com> writes:
assume we have server S1 that is master working with domain_id=0, server S2 is master working with domain_id=1, servers S3 and S4 are slaves and replicate from both of these masters, i.e. they have both domains in their databases. Now let's say S1 has last GTID 0-1-100, S2 has last GTID 1-2-100. Before S3 and S4 were able to fully catch up with S1 and S2 power got cut out from S1 and S2. As replication from two masters goes independently it's possible that S3 will have last transactions 0-1-100, 1-2-99 while S4 will have last transactions 0-1-99, 1-2-100. As my masters are out I want either S3 or S4 to
Right, this will be a common situation.
connect to S3 because S3 doesn't have 1-2-100. Ideally I'd want for S3 to replicate from S4 in domain 1 and S4 to replicate from S3 in domain 0, and when they are equal in their position I can declare one of them
Yes, this is the idea.
master for both domains. But it looks like there are no tools to do such operation.
Actually, I am implementing this right now, should have something working next week. The idea is to have START SLAVE UNTIL master_gtid_pos='xxx'. To make S3 the new master, we temporarily point S3 to replicate from S4, and do START SLAVE UNTIL master_gtid_pos='0-1-99,1-2-100'. This will replicate 1-2-100 to S3 and then stop. After this, S3 is strictly ahead of S4, and we can continue with S3 the master and S4 the slave. Note that S3 will ask to start at 0-1-100 but stop at 0-1-99. S4 will allow this because it has the stop position 0-1-99 in the binlog - so there is no problem that the start position 0-1-100 is missing. This requires support for START SLAVE UNTIL master_gtid_pos, of course. This is the general method to promote S1 as a master among slaves S1, S2, ..., Sn: - Let X be the current GTID state of server S2. Temporarily point S1 to replicate from S2, execute START SLAVE UNTIL master_gtid_pos=X. Execute MASTER_GTID_WAIT(X), when this stops we know S1 is strictly ahead of S2. - Repeat with the remaining servers S3, S4, ..., Sn. - Now we know S1 is ahead of all other servers, so we can make it the new master and point the other slaves to replicate from it. START SLAVE UNTIL master_gtid_pos is not available in the current code, but I am implementing it now (and after that MASTER_GTID_WAIT()). ---- There is actually another possible answer, related to strict mode. In strict mode, sequence numbers are always increasing. So it is safe to allow a slave to connect to a master starting at a GTID not (yet) present in the master binlog. If there really is a hole, we will give the error as soon as the hole is reached (as we discussed in the previous mail). So if we implement this, one could just connect S3 to S4 (and get no error), wait for it to catch up, then make S3 master. Not sure if it is a good idea to allow connect at a future GTID in strict mode. It does seem to go a bit against the idea with "strict", on the other hand the error is still caught later. The main reason for giving the error in non-strict mode is to avoid that slave asks for 0-1-3 in [0-1-1 0-1-2 0-2-1 0-2-2 0-2-3 ...] and ends up silently doing nothing, endlessly skipping server_id=2 events waiting for 0-1-3 that never shows up. This problem does not occur in strict mode, as it enforces monotonic sequence numbers. - Kristian.