Pavel Ivanov <pivanof@google.com> writes:
But this must be the same problem with normal replication? Whenever the slave decides to rotate the relay log, it will write a format description event created by itself with no following format description created on the master. So it seems this must work somehow, though I'll frankly admit I do not understand the details of how this works (do you know?)
Yes, I investigated this. During normal replication (when relay log is rotated automatically due to max_size) slave's format description is written at the beginning of the new realy log file, but right after that there's code that if description_event_for_queue->binlog_version
= 4 then it writes description_event_for_queue into relay log too. Also it ensures that the event has created = 0 and artificial_event set to 1. So SQL thread still gets the master's format description and doesn't rollback the transaction.
When IO thread reconnects to master the first event it receives is Rotate. For Rotate event queue_event() executes process_io_rotate(). Inside there if mi->rli.relay_log.description_event_for_queue->binlog_version >= 4 it forcefully replaces description_event_for_queue with new event with binlog_version = 3. Then it does the actual relay log rotation during which description_event_for_queue is not written into the new log file (and it shouldn't as it's not master's at this point anyway). The next
I see. So one possible solution is to do the same at the reconnect case as what we do in relay-log rotate initiated by slave due to size: Write out the description_event_for_queue to the relay log with created=0 and artificial=1. I have attached a patch for this, what do you think? Do you have the possibility to test if this works (eg. when we get a reconnect when the master's description event is incompatible with the slave's)? [This code really is criminally ugly, even for replication standard. But I do not really know how to fix it in any reasonable way :-(]
Maybe I messed up with testing, I'll try to retest again...
FYI: I took unmodified 10.0-base, commented out those two lines, ran rpl.rpl_gtid_reconnect, and it failed. Thanks, - Kristian.