OK, I performed some quick testing of the latest 10.0-base. I see a few points I'm unhappy with at the moment. These are not necessarily related to MDEV-4820, I probably should file new bugs for these. I can do that later if you want me to do that. 1. When master doesn't have binlogs and gtid_slave_pos is ahead of the GTID that slave tries to connect with you give error "The binlog on the master is missing the GTID ... requested by the slave (even though both a prior and a subsequent number does exist), and GTID strict mode is enabled". I find this error message very confusing: presence of a subsequent GTID in such situation is questionable, but there is no prior GTID in master's binlog for sure. 2. The error message "An attempt was made to binlog GTID ... which would create an out-of-order sequence number with existing GTID ..., and gtid strict mode is enabled" is confusing too, because it's issued not when slave actually tries to write event to binlog. Apparently the error condition is checked when slave considers executing the event that was just received from master. And if this event contains changes only to tables matching replicate-wild-ignore-table filter then this event won't be ever binlog'ed on slave in non-strict mode. So there's no "attempt to binlog" involved and error wording becomes not quite understandable. 3. There's error message "Specified GTID ... conflicts with the binary log which contains a more recent GTID .... If MASTER_GTID_POS=CURRENT_POS is used, the binlog position will override the new value of @@gtid_slave_pos". It looks like it's issued inconsistently. I had in binlog empty Gtid_list, then 0-1-26, 0-1-27, 0-1-28, 0-2-29 and 0-2-30. And both gtid_slave_pos and gtid_binlog_pos were set to '0-2-30'. In this situation I was able to set gtid_slave_pos to '0-1-29' successfully and get "slave has diverged" error after START SLAVE. Then I was able to set gtid_slave_pos to '0-2-29' and get error "Attempt was made to binlog out-of-order" after START SLAVE. I'd think that at least in strict mode MariaDB shouldn't allow to set gtid_slave_pos to a value that is clearly in the past. 4. Now real bug. Start three servers S1, S2 and S3 without binlogs. Set gtid_slave_pos to the same value on all of them. Connect S2 to replicate from S1. Execute a few transactions on S1. Perform a failover, make S1 to replicate from S2. Now connect S3 to replicate from S2. At this point S3 should be able to replicate successfully because it has the same db state as S2 had in the beginning (S3 has the same gtid_slave_pos as S2 had initially), and S2 has all binlogs to move from current position on S3 to the current position on S2. But yet S3 gets error that starting GTID doesn't exist in S2's binlogs. I think to fix this bug we should stop using gtid_slave_pos as indication of the current db state. We should make it possible to change gtid_binlog_pos when there's no events in binlogs. And when gtid_binlog_pos is changed we should force binlog rotation so that we have Gtid_list with initial value of gtid_binlog_pos. Then gtid_binlog_pos could be always used for setting initial db state and it kind of makes sense more than using gtid_slave_pos. But probably this will break the detection of slaves trying to connect using GTID before the start of binlogs... 5. Completely from different area but also GTID related bug. Take database from previous MySQL version (I've tested on the database from 5.1), start MariaDB on it, run mysql_upgrade and then try to set gtid_slave_pos to something. At this point I've got error "unable to load slave state from gtid_slave_pos table". This error was apparently remembered from MariaDB's start and reading of gtid_slave_pos table wasn't retried after mysql_upgrade actually created it. Pavel On Fri, Aug 16, 2013 at 6:27 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Ok, I've pushed to 10.0-base a patch for MDEV-4820.
revid:knielsen@knielsen-hq.org-20130816131025-etjrvmfvupsjzq83
As far as I can determine (and I checked quite carefully), this fixes all the problems you mentioned in the bug description and in your test cases. But I could have misunderstood something.
Note that for the problem "For some reason at this point server 1 doesn't have any errors and doesn't replicate anything from server 2. Oops", the error is caught not when slave connects, but instead when the first event is received, which should be just as good. The reason is briefly explained in the changeset comment, and is to not re-introduce the bug MDEV-4485.
The error message for "alternate future" I formulated like this:
"Connecting slave requested to start from GTID %u-%u-%llu, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra errorneous transactions"
I did not want to use the term "alternate future" as this seems to be not standard terminology. The MySQL manual uses the related term "diverge".
I am not sure if you will be happy with the fix, but if not, please explain clearly if
1. You observe incorrect behavior (eg. lost transactions, alternate future not caught by error), and if so describe as clearly as possible how to reproduce; or
2. The behaviour is correct, but you are unhappy about the wording of the error messages, or how the code is implemented.
- Kristian.
PS. I hope it is clear that I greatly value your feedback. You and Elena are the only ones who have seriously worked to help improve the MariaDB GTID, and your input has already been very valuable.