Krisitan, Could you say are you working on these? Is there an ETA? This is blocking us from pushing MariaDB into testing in the near-production environment, and I'm hesitant to implement fixes myself because I'd think you'll do it completely differently. Thank you, Pavel On Mon, Aug 19, 2013 at 6:49 AM, Pavel Ivanov <pivanof@google.com> wrote:
Ok. Actually, I think we should expose the real binlog state (what is stored in the Gtid_list event at the start of the binlog). So something like a variable
@@GLOBAL.gtid_binlog_state
Example value: '0-1-100,0-2-101'
And you get an error if you set it unless the binlog is empty.
Would this be what you need?
Yep, sounds like what we need.
Thanks, Pavel
On Mon, Aug 19, 2013 at 4:28 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Pavel Ivanov <pivanof@google.com> writes:
I think to fix this bug we should stop using gtid_slave_pos as indication of the current db state. We should make it possible to
Agree.
change gtid_binlog_pos when there's no events in binlogs. And when
Ok. Actually, I think we should expose the real binlog state (what is stored in the Gtid_list event at the start of the binlog). So something like a variable
@@GLOBAL.gtid_binlog_state
Example value: '0-1-100,0-2-101'
And you get an error if you set it unless the binlog is empty.
Would this be what you need?
it kind of makes sense more than using gtid_slave_pos. But probably this will break the detection of slaves trying to connect using GTID before the start of binlogs...
I do not think it will break that (but we will see).
5. Completely from different area but also GTID related bug. Take database from previous MySQL version (I've tested on the database from 5.1), start MariaDB on it, run mysql_upgrade and then try to set gtid_slave_pos to something. At this point I've got error "unable to load slave state from gtid_slave_pos table". This error was apparently remembered from MariaDB's start and reading of gtid_slave_pos table wasn't retried after mysql_upgrade actually created it.
Ok, I will take a look. I think there is an existing bug report on that. IIRC there is some locking issue (the variable can be accessed from a place where table locks cannot be taken to read gtid_slave_pos table), but I will see what can be done.
1. When master doesn't have binlogs and gtid_slave_pos is ahead of the GTID that slave tries to connect with you give error "The binlog on the master is missing the GTID ... requested by the slave (even though both a prior and a subsequent number does exist), and GTID strict mode is enabled". I find this error message very confusing: presence of a subsequent GTID in such situation is questionable, but there is no prior GTID in master's binlog for sure.
Hm, this sounds like a bug. Do you have a testcase?
But with @@GLOBAL.gtid_binlog_state implemented and set correctly, you will get instead the correct error message, that the position that the slave requests to connect at has been purged from the master's binlog.
2. The error message "An attempt was made to binlog GTID ... which would create an out-of-order sequence number with existing GTID ..., and gtid strict mode is enabled" is confusing too, because it's issued not when slave actually tries to write event to binlog. Apparently the error condition is checked when slave considers executing the event that was just received from master. And if this event contains changes only to tables matching replicate-wild-ignore-table filter then this event won't be ever binlog'ed on slave in non-strict mode. So there's no "attempt to binlog" involved and error wording becomes not quite understandable.
Right, I see. Thanks!
One problem here is that when using non-transactional (DDL or MyISAM), then we _do_ need to check this _before_ executing the event. Because we cannot roll back after the event.
But I agree of course that this is a bug. I will try to find a way to fix. Maybe the check can be delayed until the first event that we are actually going to execute (not filter).
3. There's error message "Specified GTID ... conflicts with the binary log which contains a more recent GTID .... If MASTER_GTID_POS=CURRENT_POS is used, the binlog position will override the new value of @@gtid_slave_pos". It looks like it's issued inconsistently. I had in binlog empty Gtid_list, then 0-1-26, 0-1-27, 0-1-28, 0-2-29 and 0-2-30. And both gtid_slave_pos and gtid_binlog_pos were set to '0-2-30'. In this situation I was able to set gtid_slave_pos to '0-1-29' successfully and get "slave has diverged" error after START SLAVE. Then I was able to set gtid_slave_pos to '0-2-29' and get error "Attempt was made to binlog out-of-order" after START SLAVE. I'd think that at least in strict mode MariaDB shouldn't allow to set gtid_slave_pos to a value that is clearly in the past.
Right, thanks, I will check. (I can understand that 0-1-29 did not give error, though you are probably right that it should; but that 0-2-29 did not give error is surprising).
4. Now real bug. Start three servers S1, S2 and S3 without binlogs. Set gtid_slave_pos to the same value on all of them. Connect S2 to replicate from S1. Execute a few transactions on S1. Perform a failover, make S1 to replicate from S2. Now connect S3 to replicate from S2. At this point S3 should be able to replicate successfully because it has the same db state as S2 had in the beginning (S3 has the same gtid_slave_pos as S2 had initially), and S2 has all binlogs to move from current position on S3 to the current position on S2. But yet S3 gets error that starting GTID doesn't exist in S2's binlogs.
This should also be fixed by setting @@GLOBAL.gtid_binlog_state.
- Kristian.