Pavel Ivanov <pivanof@google.com> writes:
You have the following comment in the queue_event() in sql/slave.cc:
/* Do not queue any format description event that we receive after a reconnect where we are skipping over a partial event group received before the reconnect.
(If we queued such an event, and it was the first format_description event after master restart, the slave SQL thread would think that the partial event group before it in the relay log was from a previous master crash and should be rolled back). */
I don't understand which failure scenario you are talking about here and I claim that this bypassing of queuing into relay log is incorrect.
It is this code, in Format_description_log_event::do_apply_event(): /* As a transaction NEVER spans on 2 or more binlogs: if we have an active transaction at this point, the master died while writing the transaction to the binary log, i.e. while flushing the binlog cache to the binlog. XA guarantees that master has rolled back. So we roll back. Note: this event could be sent by the master to inform us of the format of its binlog; in other words maybe it is not at its original place when it comes to us; we'll know this by checking log_pos ("artificial" events have log_pos == 0). */ if (!is_artificial_event() && created && thd->transaction.all.ha_list) { /* This is not an error (XA is safe), just an information */ rli->report(INFORMATION_LEVEL, 0, "Rolling back unfinished transaction (no COMMIT " "or ROLLBACK in relay log). A probable cause is that " "the master died while writing the transaction to " "its binary log, thus rolled back too."); const_cast<Relay_log_info*>(rli)->cleanup_context(thd, 1); } Try it, comment out these the lines in slave.cc if (unlikely(mi->gtid_reconnect_event_skip_count && !mi->gtid_event_seen)) gtid_skip_enqueue= true; and run the rpl.rpl_gtid_reconnect test. It will fail because the above code rolls back part of a transaction. So let me first explain the background for the reconnect-handling code, and then why I think the skipping you pointed out is ok. And then if you still think there is an issue, please be persistent - this is a really tricky part of the replication code that is very hard to get right, so there could definitely still be bugs. The issue is when the slave IO thread reconnects to the master. Maybe because of intermittent network issue or master shutdown, or could be explicit STOP SLAVE IO_THREAD. So the SQL thread is running while the reconnect happens. If the reconnect happens in the middle of an event group, we have half a group written to the relay log and potentially already partially executed by the SQL thread. So we need to write the second half of the event group to the relaylog, no more and no less. However, since we connect at a GTID position, we will receive from the master all the events from the start of the event group. So what we do is that we remember how many events from the event group we already wrote to the relay log before the disconnect. And after reconnect we then skip that number of events before writing the rest of the event group to the relay log. [Note that for reconnect after full STOP SLAVE, there is no issue, as in that case we clear all relay logs and start over from a well-defined GTID position. BTW, this is Bug#69758 in the MySQL 5.6 GTID implementation. Basically, the developers totally ignored the slave reconnect issue ... :-/] So suppose now we have a two-statement transaction: BEGIN; UPDATE t1 SET a=a+1; UPDATE t2 SET b=b-1; COMMIT and that slave IO thread reconnects in the middle (between the two updates). First we receive the BEGIN and the first UPDATE and write it to relay log 1: Format_Description (slave) Format_Description (master) Gtid_list Binlog_checkpoint GTID BEGIN 0-1-1 Query UPDATE t1 SET a=a+1 Then we disconnect and reconnect and write the remaining events to relay log 2: Format_Description (slave) # master format description received but skipped # GTID 0-1-1 skipped # Query UPDATE t1 skipped UPDATE t2 SET b=b-1 XID COMMIT and the SQL thread sees an intact transaction. (I am not sure I got 100% right exactly which non-transaction events are written, but this is the overall idea). And so, the reason it is ok to skip the format description event from the master in relay log 2 is that this _only_ happens when we know for sure that the same master format description event was already written to an earlier relay log and seen by the slave. Or put another way, if we had not reconnected to the master at this point, we would not have received the master's format description event at this point in the first place, but things would still have worked without it obviously. Makes sense? Or did I miss some case that could make this cause problems?
When IO thread is reconnecting it rotates relay log and as I said it writes format description event at the beginning of the new file. But it writes an event that it created itself, i.e. not the one that master have sent. And as format description event from master is not written into relay log SQL thread from this point on starts to use format description generated by slave which may be different from the one generated by master. It may lead to a broken replication and SQL
But this must be the same problem with normal replication? Whenever the slave decides to rotate the relay log, it will write a format description event created by itself with no following format description created on the master. So it seems this must work somehow, though I'll frankly admit I do not understand the details of how this works (do you know?)
Another somewhat related question: Gtid_log_event::peek() (as well as Gtid_log_event constructor from const char* buf) is implemented with assumption that Format_description_log_event::common_header_len is always equal to LOG_EVENT_HEADER_LEN. While currently it's true I
Agree, it looks like a bug. Do you have the possibility to help with this? It is a bit hard for me to test such a fix as I do not have an easy way to generate binlogs with different header lengths, but I think perhaps that your team has such capability? - Kristian.