Pavel Ivanov
You have the following comment in the queue_event() in sql/slave.cc:
/* Do not queue any format description event that we receive after a reconnect where we are skipping over a partial event group received before the reconnect.
(If we queued such an event, and it was the first format_description event after master restart, the slave SQL thread would think that the partial event group before it in the relay log was from a previous master crash and should be rolled back). */
I don't understand which failure scenario you are talking about here and I claim that this bypassing of queuing into relay log is incorrect.
It is this code, in Format_description_log_event::do_apply_event():
/*
As a transaction NEVER spans on 2 or more binlogs:
if we have an active transaction at this point, the master died
while writing the transaction to the binary log, i.e. while
flushing the binlog cache to the binlog. XA guarantees that master has
rolled back. So we roll back.
Note: this event could be sent by the master to inform us of the
format of its binlog; in other words maybe it is not at its
original place when it comes to us; we'll know this by checking
log_pos ("artificial" events have log_pos == 0).
*/
if (!is_artificial_event() && created && thd->transaction.all.ha_list)
{
/* This is not an error (XA is safe), just an information */
rli->report(INFORMATION_LEVEL, 0,
"Rolling back unfinished transaction (no COMMIT "
"or ROLLBACK in relay log). A probable cause is that "
"the master died while writing the transaction to "
"its binary log, thus rolled back too.");
const_cast
When IO thread is reconnecting it rotates relay log and as I said it writes format description event at the beginning of the new file. But it writes an event that it created itself, i.e. not the one that master have sent. And as format description event from master is not written into relay log SQL thread from this point on starts to use format description generated by slave which may be different from the one generated by master. It may lead to a broken replication and SQL
But this must be the same problem with normal replication? Whenever the slave decides to rotate the relay log, it will write a format description event created by itself with no following format description created on the master. So it seems this must work somehow, though I'll frankly admit I do not understand the details of how this works (do you know?)
Another somewhat related question: Gtid_log_event::peek() (as well as Gtid_log_event constructor from const char* buf) is implemented with assumption that Format_description_log_event::common_header_len is always equal to LOG_EVENT_HEADER_LEN. While currently it's true I
Agree, it looks like a bug. Do you have the possibility to help with this? It is a bit hard for me to test such a fix as I do not have an easy way to generate binlogs with different header lengths, but I think perhaps that your team has such capability? - Kristian.