Re: [Maria-developers] Incorrect format description event skipping in queue_event()

28 Aug 2013


      ...
So suppose now we have a two-statement transaction:
BEGIN; UPDATE t1 SET a=a+1; UPDATE t2 SET b=b-1; COMMIT
and that slave IO thread reconnects in the middle (between the two updates).
First we receive the BEGIN and the first UPDATE and write it to relay log 1:
Format_Description (slave)
    Format_Description (master)
    Gtid_list
    Binlog_checkpoint
    GTID BEGIN 0-1-1
    Query UPDATE t1 SET a=a+1
Then we disconnect and reconnect and write the remaining events to
relay log 2:
Format_Description (slave)
    # master format description received but skipped
    # GTID 0-1-1 skipped
    # Query UPDATE t1 skipped
    UPDATE t2 SET b=b-1
    XID COMMIT
and the SQL thread sees an intact transaction. (I am not sure I got 100% right
exactly which non-transaction events are written, but this is the overall
idea).
And so, the reason it is ok to skip the format description event from the
master in relay log 2 is that this _only_ happens when we know for sure that
the same master format description event was already written to an earlier
relay log and seen by the slave. Or put another way, if we had not reconnected
to the master at this point, we would not have received the master's format
description event at this point in the first place, but things would still
have worked without it obviously.
Makes sense? Or did I miss some case that could make this cause problems?
Yes, this causes problems because SQL thread sees Format_Description
(slave) from relay log 2 and replaces its description_event_for_exec
with it. So from this moment it continues to parse relay log events
using slave's format description when all events actually created (and
understood by IO thread) using master's format description.
...
...
When IO thread is reconnecting it rotates relay log and as I said it
writes format description event at the beginning of the new file. But
it writes an event that it created itself, i.e. not the one that
master have sent. And as format description event from master is not
written into relay log SQL thread from this point on starts to use
format description generated by slave which may be different from the
one generated by master. It may lead to a broken replication and SQL
But this must be the same problem with normal replication? Whenever the slave
decides to rotate the relay log, it will write a format description event
created by itself with no following format description created on the
master. So it seems this must work somehow, though I'll frankly admit I do not
understand the details of how this works (do you know?)
...
= 4 then it writes description_event_for_queue into relay log too.
Also it ensures that the event has created = 0 and artificial_event
set to 1. So SQL thread still gets the master's format description and
doesn't rollback the transaction.
When IO thread reconnects to master the first event it receives is
Rotate. For Rotate event queue_event() executes process_io_rotate().
Inside there if
mi->rli.relay_log.description_event_for_queue->binlog_version >= 4 it
forcefully replaces description_event_for_queue with new event with
binlog_version = 3. Then it does the actual relay log rotation during
which description_event_for_queue is not written into the new log file
(and it shouldn't as it's not master's at this point anyway). The next
event IO thread receives is Format_Description. If it's written to
relay log at this point then SQL thread will proceed using master's
Yes, I investigated this. During normal replication (when relay log is
rotated automatically due to max_size) slave's format description is
written at the beginning of the new realy log file, but right after
that there's code that if description_event_for_queue->binlog_version
format description. And I believe I tested the situation when IO
thread reconnects in the middle of transaction and writes format
description from master (I've commented those lines you pointed to)
and I didn't see that transaction rolled back, everything was fine.
Maybe I messed up with testing, I'll try to retest again...
...
...
Another somewhat related question: Gtid_log_event::peek() (as well as
Gtid_log_event constructor from const char* buf) is implemented with
assumption that Format_description_log_event::common_header_len is
always equal to LOG_EVENT_HEADER_LEN. While currently it's true I
Agree, it looks like a bug.
Do you have the possibility to help with this? It is a bit hard for me to test
such a fix as I do not have an easy way to generate binlogs with different
header lengths, but I think perhaps that your team has such capability?
I can try, although first quick look showed that it can be tricky:
peek() is used not only in IO thread, but during scanning binlogs too.
I didn't look yet if code has correct description_event available at
this point.


Pavel

Re: [Maria-developers] Incorrect format description event skipping in queue_event()

Pavel Ivanov