
Krisitan, The attached patch doesn't work properly because the format description event it saves is different from what master has sent. It has different number_of_event_types and potentially it has garbage (or incomplete data) in post_header_len. And regarding the fix to Gtid_log_event that we were talking about: I've tried to do that but I felt like I don't understand the code quite well at the moment to do that. So apparently all peek*() functions should have Format_description added as parameter and it's easy to pass one from queue_event(), but then e.g. mysql_binlog_send() will need description event too to pass it to send_event_to_slave() and is_until_reached() so that those could pass it to peek*(). For correctness this format description should be read from binlog, right? That would be pretty non-trivial to do. Also gtid_state_from_pos() needs Format_descirption, but it will apparently meet that at the beginning of the binlog file, so it can be created before it will be actually needed. Although constructor of Format_description_log_event needs another instance of Format_description. Should it be just created like Format_description_log_event(BINLOG_VERSION)? Pavel On Thu, Aug 29, 2013 at 6:24 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Pavel Ivanov <pivanof@google.com> writes:
But this must be the same problem with normal replication? Whenever the slave decides to rotate the relay log, it will write a format description event created by itself with no following format description created on the master. So it seems this must work somehow, though I'll frankly admit I do not understand the details of how this works (do you know?)
Yes, I investigated this. During normal replication (when relay log is rotated automatically due to max_size) slave's format description is written at the beginning of the new realy log file, but right after that there's code that if description_event_for_queue->binlog_version
= 4 then it writes description_event_for_queue into relay log too. Also it ensures that the event has created = 0 and artificial_event set to 1. So SQL thread still gets the master's format description and doesn't rollback the transaction.
When IO thread reconnects to master the first event it receives is Rotate. For Rotate event queue_event() executes process_io_rotate(). Inside there if mi->rli.relay_log.description_event_for_queue->binlog_version >= 4 it forcefully replaces description_event_for_queue with new event with binlog_version = 3. Then it does the actual relay log rotation during which description_event_for_queue is not written into the new log file (and it shouldn't as it's not master's at this point anyway). The next
I see.
So one possible solution is to do the same at the reconnect case as what we do in relay-log rotate initiated by slave due to size: Write out the description_event_for_queue to the relay log with created=0 and artificial=1.
I have attached a patch for this, what do you think?
Do you have the possibility to test if this works (eg. when we get a reconnect when the master's description event is incompatible with the slave's)?
[This code really is criminally ugly, even for replication standard. But I do not really know how to fix it in any reasonable way :-(]
Maybe I messed up with testing, I'll try to retest again...
FYI: I took unmodified 10.0-base, commented out those two lines, ran rpl.rpl_gtid_reconnect, and it failed.
Thanks,
- Kristian.