[Maria-developers] GTID and failovers with multi-domain replication
Kristian, You've mentioned that in future implementation of parallel replication MariaDB will use multiple domains to replicate from one master to a slave. It means for me that I need to understand how multiple domains will work. And I realized that I don't know how the following situation can be handled. I don't know how parallel replication will work, so I'll explain situation in terms of several masters. Let's assume we have server S1 that is master working with domain_id=0, server S2 is master working with domain_id=1, servers S3 and S4 are slaves and replicate from both of these masters, i.e. they have both domains in their databases. Now let's say S1 has last GTID 0-1-100, S2 has last GTID 1-2-100. Before S3 and S4 were able to fully catch up with S1 and S2 power got cut out from S1 and S2. As replication from two masters goes independently it's possible that S3 will have last transactions 0-1-100, 1-2-99 while S4 will have last transactions 0-1-99, 1-2-100. As my masters are out I want either S3 or S4 to temporarily become master. But it looks like I won't be able to do so: S3 won't connect to S4 because S4 doesn't have 0-1-100 and S4 won't connect to S3 because S3 doesn't have 1-2-100. Ideally I'd want for S3 to replicate from S4 in domain 1 and S4 to replicate from S3 in domain 0, and when they are equal in their position I can declare one of them master for both domains. But it looks like there are no tools to do such operation. How would you suggest to resolve such situation? Thank you, Pavel
Pavel Ivanov <pivanof@google.com> writes:
assume we have server S1 that is master working with domain_id=0, server S2 is master working with domain_id=1, servers S3 and S4 are slaves and replicate from both of these masters, i.e. they have both domains in their databases. Now let's say S1 has last GTID 0-1-100, S2 has last GTID 1-2-100. Before S3 and S4 were able to fully catch up with S1 and S2 power got cut out from S1 and S2. As replication from two masters goes independently it's possible that S3 will have last transactions 0-1-100, 1-2-99 while S4 will have last transactions 0-1-99, 1-2-100. As my masters are out I want either S3 or S4 to
Right, this will be a common situation.
connect to S3 because S3 doesn't have 1-2-100. Ideally I'd want for S3 to replicate from S4 in domain 1 and S4 to replicate from S3 in domain 0, and when they are equal in their position I can declare one of them
Yes, this is the idea.
master for both domains. But it looks like there are no tools to do such operation.
Actually, I am implementing this right now, should have something working next week. The idea is to have START SLAVE UNTIL master_gtid_pos='xxx'. To make S3 the new master, we temporarily point S3 to replicate from S4, and do START SLAVE UNTIL master_gtid_pos='0-1-99,1-2-100'. This will replicate 1-2-100 to S3 and then stop. After this, S3 is strictly ahead of S4, and we can continue with S3 the master and S4 the slave. Note that S3 will ask to start at 0-1-100 but stop at 0-1-99. S4 will allow this because it has the stop position 0-1-99 in the binlog - so there is no problem that the start position 0-1-100 is missing. This requires support for START SLAVE UNTIL master_gtid_pos, of course. This is the general method to promote S1 as a master among slaves S1, S2, ..., Sn: - Let X be the current GTID state of server S2. Temporarily point S1 to replicate from S2, execute START SLAVE UNTIL master_gtid_pos=X. Execute MASTER_GTID_WAIT(X), when this stops we know S1 is strictly ahead of S2. - Repeat with the remaining servers S3, S4, ..., Sn. - Now we know S1 is ahead of all other servers, so we can make it the new master and point the other slaves to replicate from it. START SLAVE UNTIL master_gtid_pos is not available in the current code, but I am implementing it now (and after that MASTER_GTID_WAIT()). ---- There is actually another possible answer, related to strict mode. In strict mode, sequence numbers are always increasing. So it is safe to allow a slave to connect to a master starting at a GTID not (yet) present in the master binlog. If there really is a hole, we will give the error as soon as the hole is reached (as we discussed in the previous mail). So if we implement this, one could just connect S3 to S4 (and get no error), wait for it to catch up, then make S3 master. Not sure if it is a good idea to allow connect at a future GTID in strict mode. It does seem to go a bit against the idea with "strict", on the other hand the error is still caught later. The main reason for giving the error in non-strict mode is to avoid that slave asks for 0-1-3 in [0-1-1 0-1-2 0-2-1 0-2-2 0-2-3 ...] and ends up silently doing nothing, endlessly skipping server_id=2 events waiting for 0-1-3 that never shows up. This problem does not occur in strict mode, as it enforces monotonic sequence numbers. - Kristian.
I love the idea with START SLAVE UNTIL. Looks very clean and reasonable. And I don't like the idea of special treatment of this in strict mode.
Not sure if it is a good idea to allow connect at a future GTID in strict mode. It does seem to go a bit against the idea with "strict", on the other hand the error is still caught later.
This is exactly the problem. Strict mode should be about a strict discipline on the dba's side. If he connects S3 to replicate from S4 just for the sake of catch-up and he intends to make S3 master later then he must say that explicitly by issuing the command START SLAVE UNTIL. If he issues regular START SLAVE that may mean that he really wants S4 to be a master and doesn't intend to switch later. And then he will be surprised why replication doesn't progress. Pavel On Wed, May 8, 2013 at 12:22 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Pavel Ivanov <pivanof@google.com> writes:
assume we have server S1 that is master working with domain_id=0, server S2 is master working with domain_id=1, servers S3 and S4 are slaves and replicate from both of these masters, i.e. they have both domains in their databases. Now let's say S1 has last GTID 0-1-100, S2 has last GTID 1-2-100. Before S3 and S4 were able to fully catch up with S1 and S2 power got cut out from S1 and S2. As replication from two masters goes independently it's possible that S3 will have last transactions 0-1-100, 1-2-99 while S4 will have last transactions 0-1-99, 1-2-100. As my masters are out I want either S3 or S4 to
Right, this will be a common situation.
connect to S3 because S3 doesn't have 1-2-100. Ideally I'd want for S3 to replicate from S4 in domain 1 and S4 to replicate from S3 in domain 0, and when they are equal in their position I can declare one of them
Yes, this is the idea.
master for both domains. But it looks like there are no tools to do such operation.
Actually, I am implementing this right now, should have something working next week.
The idea is to have START SLAVE UNTIL master_gtid_pos='xxx'.
To make S3 the new master, we temporarily point S3 to replicate from S4, and do START SLAVE UNTIL master_gtid_pos='0-1-99,1-2-100'. This will replicate 1-2-100 to S3 and then stop. After this, S3 is strictly ahead of S4, and we can continue with S3 the master and S4 the slave.
Note that S3 will ask to start at 0-1-100 but stop at 0-1-99. S4 will allow this because it has the stop position 0-1-99 in the binlog - so there is no problem that the start position 0-1-100 is missing. This requires support for START SLAVE UNTIL master_gtid_pos, of course.
This is the general method to promote S1 as a master among slaves S1, S2, ..., Sn:
- Let X be the current GTID state of server S2. Temporarily point S1 to replicate from S2, execute START SLAVE UNTIL master_gtid_pos=X. Execute MASTER_GTID_WAIT(X), when this stops we know S1 is strictly ahead of S2.
- Repeat with the remaining servers S3, S4, ..., Sn.
- Now we know S1 is ahead of all other servers, so we can make it the new master and point the other slaves to replicate from it.
START SLAVE UNTIL master_gtid_pos is not available in the current code, but I am implementing it now (and after that MASTER_GTID_WAIT()).
----
There is actually another possible answer, related to strict mode. In strict mode, sequence numbers are always increasing. So it is safe to allow a slave to connect to a master starting at a GTID not (yet) present in the master binlog. If there really is a hole, we will give the error as soon as the hole is reached (as we discussed in the previous mail).
So if we implement this, one could just connect S3 to S4 (and get no error), wait for it to catch up, then make S3 master.
Not sure if it is a good idea to allow connect at a future GTID in strict mode. It does seem to go a bit against the idea with "strict", on the other hand the error is still caught later.
The main reason for giving the error in non-strict mode is to avoid that slave asks for 0-1-3 in [0-1-1 0-1-2 0-2-1 0-2-2 0-2-3 ...] and ends up silently doing nothing, endlessly skipping server_id=2 events waiting for 0-1-3 that never shows up. This problem does not occur in strict mode, as it enforces monotonic sequence numbers.
- Kristian.
Pavel Ivanov <pivanof@google.com> writes:
I love the idea with START SLAVE UNTIL. Looks very clean and reasonable. And I don't like the idea of special treatment of this in strict mode.
Yeah. Let's go with START SLAVE UNTIL and keep the error when starting from non-existing GTID and no UNTIL. I hope to have a patch for START SLAVE UNTIL master_gtid_pos=xxx sometimes next week. - Kristian.
participants (2)
-
Kristian Nielsen
-
Pavel Ivanov