[Maria-developers] MariaDB allows for slave to connect with non-existent GTID

older
[Maria-developers] Missing locking...

Pavel Ivanov

5 May 2013 5 May '13

11:23 p.m.

Kristian, I've realized that the way slaves are processed now on the master allows them to connect even if they request non-existent GTID. Is it "works as intended" and will be different in the "strict mode" or you didn't want for such things to happen even in non-strict mode? I've attached a test reproducing the problem. It sets up replication 1->2->3, executes transaction on 1, executes transaction on 2, disconnects 3, executes one more transaction on 1, fails over to 2, executes one more transaction on 2. Then when it tries to connect 3 to 1 it should fail because 1 doesn't have the last transaction that 3 has. I agree that this scenario is artificial, but in reality scenario can be e.g. like this: 2 is slave of 1, then it gets disconnected and executes one extra transaction, then backup is taken from 2, then 2 gets restarted and restored to the state of 1, then failover happens and 2 becomes master, then 3 restores from the backup taken earlier from 2 and if all this time 1 and then 2 as masters executed some transactions 3 will be able to connect to 2 even though its last transaction doesn't exist on 2... Pavel

Attachments:

rpl_gtid_bad_rpl.cnf (application/octet-stream — 224 bytes)
rpl_gtid_bad_rpl.test (application/octet-stream — 1.8 KB)

Show replies by date

Kristian Nielsen

6 May 6 May

6:09 a.m.

Pavel Ivanov <pivanof@google.com> writes:

...

I've realized that the way slaves are processed now on the master allows them to connect even if they request non-existent GTID.

What happens here is that S3 requests GTID 0-2-3 from S1. S1 has in binlog: 0-1-1 0-1-2 0-1-3 0-2-4 So there is a "hole" in the binlog of S1, a transaction got missing. However, the code allows S3 to start replicating with 0-2-4 as the first event. Because we can be sure that this is the first event that we _do_ have that follows the requested 0-2-3. Now, if S1 had had only "0-1-1 0-1-2 0-1-3" in the binlog, then S3 would not be allowed to connect. Mainly to protect against the case where no further 0-2-* events ever appear, which would cause S3 to skip events forever waiting for such event.

...

Is it "works as intended" and will be different in the "strict mode" or you didn't want for such things to happen even in non-strict mode?

I am not sure. But my immediate impression is that this is the most consistent behaviour. In MariaDB GTID, we keep track of only the last applied GTID (within each domain), and rely on binlog sequence being identical between different servers. In this particular example we could detect that this was violated, but it was kind of accidental. If S3 had been stopped one event earlier or later, then we would not be able to detect the error. So catching this error case does not really seem to buy much in general. Also, when using stuff like --replicate-wild-ignore-table, holes can easily appear, and allowing a slave to connect "in the middle of a hole" seems reasonable. But I am open to arguments for the opposite. - Kristian.

Pavel Ivanov

6:47 a.m.

On Sun, May 5, 2013 at 11:09 PM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:

...

Pavel Ivanov <pivanof@google.com> writes:

...
I've realized that the way slaves are processed now on the master allows them to connect even if they request non-existent GTID.

What happens here is that S3 requests GTID 0-2-3 from S1.

S1 has in binlog: 0-1-1 0-1-2 0-1-3 0-2-4

So there is a "hole" in the binlog of S1, a transaction got missing.

However, the code allows S3 to start replicating with 0-2-4 as the first event. Because we can be sure that this is the first event that we _do_ have that follows the requested 0-2-3.

Now, if S1 had had only "0-1-1 0-1-2 0-1-3" in the binlog, then S3 would not be allowed to connect. Mainly to protect against the case where no further 0-2-* events ever appear, which would cause S3 to skip events forever waiting for such event.

...
Is it "works as intended" and will be different in the "strict mode" or you didn't want for such things to happen even in non-strict mode?

I am not sure. But my immediate impression is that this is the most consistent behaviour.

In MariaDB GTID, we keep track of only the last applied GTID (within each domain), and rely on binlog sequence being identical between different servers. In this particular example we could detect that this was violated, but it was kind of accidental. If S3 had been stopped one event earlier or later, then we would not be able to detect the error. So catching this error case does not really seem to buy much in general.

I'd say if S3 stopped one event earlier then there would have been no error at all. If S3 stopped one event later then sure it wouldn't be possible to detect the error, but it will be detected in strict mode. But what I'm not feeling comfortable with is if S3 is stopped as it is and if it tries to connect to S1 immediately it will cause error. Also if there was no failover to S2 and S2 didn't author any new GTIDs then it will cause error as well. It looks like difference between error and non-error is very vague and fragile.

...

Also, when using stuff like --replicate-wild-ignore-table, holes can easily appear, and allowing a slave to connect "in the middle of a hole" seems reasonable.

So what you are saying is when stuff like --replicate-wild-ignore-table is used slave will have holes in binlogs compared to master. But in that case slaves won't ever have GTID that is missing on master. But if we have 2nd slave with different table filtering it will have different holes in binlogs. In this case if we failover and make this 2nd slave master then it's quite possible that 1st slave will connect to new master with GTID that does not exist there. I see how this is kind of valid situation from MariaDB point of view, but I don't see how it makes sense to do this in real life. So I see your point and I can't argue that this behavior should change by default (except that it probably won't make any sense for anybody to use such feature), but we would really like this situation to be detected and replication to be stopped either in "gtid strict mode" or in some other mode that we could turn on. Thank you, Pavel

Kristian Nielsen

7 May 7 May

12:50 p.m.

Pavel Ivanov <pivanof@google.com> writes:

...

I'd say if S3 stopped one event earlier then there would have been no error at all. If S3 stopped one event later then sure it wouldn't be possible to detect the error, but it will be detected in strict mode.

Ah, that is a good point.

...

But what I'm not feeling comfortable with is if S3 is stopped as it is and if it tries to connect to S1 immediately it will cause error. Also if there was no failover to S2 and S2 didn't author any new GTIDs then it will cause error as well. It looks like difference between error and non-error is very vague and fragile.

Right, so in fact, from this is appears that actually the most consistent behaviour is to give an error in this case (slave requests 0-2-3, master is missing this but has 0-2-4). Especially so in strict mode.

...

...
Also, when using stuff like --replicate-wild-ignore-table, holes can easily appear, and allowing a slave to connect "in the middle of a hole" seems reasonable.

So what you are saying is when stuff like --replicate-wild-ignore-table is used slave will have holes in binlogs compared to master. But in that case slaves won't ever have GTID that is missing on master. But if we have 2nd slave with different table

Agree, it is still indication of something not configured right if we requests something from slave that is missing on master.

...

filtering it will have different holes in binlogs. In this case if we failover and make this 2nd slave master then it's quite possible that 1st slave will connect to new master with GTID that does not exist there. I see how this is kind of valid situation from MariaDB point of view, but I don't see how it makes sense to do this in real life.

So I see your point and I can't argue that this behavior should change by default (except that it probably won't make any sense for anybody to use such feature), but we would really like this situation to be detected and replication to be stopped either in "gtid strict mode" or in some other mode that we could turn on.

...

From your arguments above I'm leaning more towards giving an error now.

I think this is what I'll do (further comments welcome though): 1. In GTID strict mode, give an error. 2. In non-strict mode, do as current code. The main use case for (2) will be to recover from the error in (1). Temporarily clear GTID strict mode, replicate across the problematic point, re-enable strict mode. I think it is important to have a clear overall strategy for handling all these different error cases. I think it is taking shape. We will have a strict mode, which will be the recommended mode. And it will generally give an error as soon as incorrect/dodgy usage is detected. And non-strict mode will generally try to handle things without error. And people can use non-strict mode if they think they know not to make mistakes, or prefer some inconsistencies to having to deal with errors (but then they should not complain if they shoot themselves in the foot). And in general, in strict mode, if you get an error, you can handle it by temporarily switching to non-strict mode to get past the error point. But then at least you get to know about the potential problem and have a chance to react on it. I will put this on the queue to implement. Thanks for the comments! - Kristian.

4441

Age (days ago)

4443

Last active (days ago)

List overview

3 comments

2 participants

participants (2)

Kristian Nielsen
Pavel Ivanov