Pavel Ivanov <pivanof@google.com> writes:
I took 10.0-base r3685. Started new just bootstrapped server with server_id = 1. It has @@global.gtid_binlog_pos, @@global.gtid_slave_pos and @@global.gtid_current_pos empty. Then I execute
set global gtid_binlog_state = '0-10-10'
After that @@global.gtid_binlog_pos = '0-10-10' as expected, but both @@global.gtid_slave_pos and @@global.gtid_current_pos are still empty. Because of that server won't be able to replicate from master. If I set gtid_binlog_state to '0-1-10' though @@global.gtid_current_pos changes to '0-1-10' and everything is fine.
The short answer is that you should just set both gtid_slave_pos and gtid_binlog_state on the new server. SET GLOBAL gtid_binlog_state = '0-10-10'; SET GLOBAL gtid_slave_state = @@GLOBAL.gtid_binlog_pos; For the longer answer, let me try to explain: The gtid_binlog_pos and the gtid_slave_pos are different concepts in MariaDB. The former is the last GTID logged into the binlog (for each domain). The latter is the last GTID replicated by the slave. These become different because on the one hand slave can use --log-slave-updates=0 (so binlog is not updated), and on the other hand I did not want to add overhead of updating gtid_slave_pos for every transaction on the master. So a GTID that goes into one of them may or may not go into the other. Now let us set up a slave with CHANGE MASTER TO master_host= ... , master_use_gtid=slave_pos; The slave starts replication at the value of gtid_slave_pos. Every replicated GTID updates gtid_slave_pos, so to switch master we can just point it to the new host and it will continue from the correct point. But suppose we promote a new master, and later want the old master to to become a slave. The old master did not update gtid_slave_pos, so the point at which to start is the last GTID logged to the binlog, gtid_binlog_pos. Thus to start the old master replicating a slave one should use: SET GLOBAL gtid_slave_pos = @@GLOBAL.gtid_binlog_pos; CHANGE MASTER TO master_host= ... , master_use_gtid=slave_pos; and then things will proceed correctly with the new slave server. So this is how you should think of the variables. The gtid_slave_pos is the position at which to start replication for a slave. The gtid_binlog_pos is the last GTID logged into the binlog. Now, this creates an asymmetry - to switch a server to replicate from a new master, the user has to know if the server was a master or a slave before, and do it differently depending on which it is. So I wanted to provide a way to avoid this asymmetry, and I implemented CHANGE MASTER TO master_use_gtid=current_pos for this. In this mode, when the slave connects, it looks into both the gtid_slave_pos and the gtid_binlog_pos to decide which of these has the most recent GTID - and then uses that GTID as the point to start replication at. If server was a master before, then the last GTID in the binlog will have the server's own server_id; _and_ the sequence number will be bigger that what is in the gtid_slave_pos because sequence numbers on a master are always generated bigger than any seen before. So in this case we use the last GTID in the binlog to connect to. Otherwise we use the gtid_slave_pos. So that is _all_ that gtid_current_pos is - it is a way for the server to guess whether it was a master or a slave before, and act accordingly. A bit of magic for casual users that do not want to be aware of whether the server they are setting up as a slave was a slave already before, or a master. So the point is that if you want to use gtid_current_pos on a newly setup server, you need to provide correct values for _both_ gtid_binlog_pos/gtid_binlog_state _and_ gtid_slave_pos. Because gtid_current_pos is the result of combining the two.
It looks like the problem is in the server_id check in the first loop in rpl_slave_state::iterate(). Can it be removed from there?
I think so - in strict mode, the most recent GTID will always be the one with the highest sequence number, so the server_id check is not needed. On the other hand, if things are done correctly, the server_id check will make no difference, as a GTID with different server_id cannot get into the binlog without also getting into gtid_slave_pos But for now I have other, more critical things I want to fix first - I think this is not a critical thing, just setting gtid_slave_pos on the new server should make things work for you? (else let me know if I missed something). - Kristian.