Thanks for the pointers, Kristian. It turned out that it was an incorrect merge of GTID-related changes into 10.0.1 tree. In particular I didn't quite understand the meaning of LINE_FOR_FIRST_MYSQL_5_6 in sql/rpl_mi.cc and thus left it at the value of 19. Because of that every time when master.info was read slave didn't see the value of using_gtid variable and didn't purge relay logs. Now I've got this fixed for us.

BTW, while debugging I've noticed that master.info file wasn't properly truncated when it was re-written. I always saw some left-over text after the line "using_gtid=1". After finish of rpl.rpl_gtid_crash it could be even several lines of extra text with one more line "using_gtid=1" at the end. So probably you will want to fix this at some point...


Pavel


On Mon, May 27, 2013 at 2:11 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Pavel Ivanov <pivanof@google.com> writes:

> See logs in attachment. It looks clear to me that relay logs get replayed
> after crash and don't get deleted. I'm not sure though if the reason of
> that can be seen in the logs.

No, I agree it's not clear from the logs one way or the other, so good that
you mention this.

> Can you point me to the code where relay logs are deleted? I'll try to
> check why it's not called...

It is this, in slave.cc start_slave_threads():

  if (mi->using_gtid != Master_info::USE_GTID_NO &&
      !mi->slave_running && !mi->rli.slave_running)
  {
    purge_relay_logs(&mi->rli, NULL, 0, &errmsg);
    mi->master_log_name[0]= 0;
    mi->master_log_pos= 0;
  }

When both slave IO and SQL threads are stopped, we purge the relay logs before
starting. And surely they must be stopped when we first start them after a
crash ...

And then in get_master_version_and_clock(), if we deleted the relay logs then
we start from the GTID position:

  if (mi->using_gtid != Master_info::USE_GTID_NO && !mi->master_log_name[0])
  {
    ...

I would suggest printouts to the error log around those two places in the code
and check that both are executed as expected at startup. Maybe either if()
condition becomes false for some reason, or purge_relay_logs() fails to purge,
but I do not see how from the code at the moment.

It would also help if I had implemented MDEV-4491
(https://mariadb.atlassian.net/browse/MDEV-4491), to give better printouts in
the error log about where and how the slave is actually connecting to the
master. I will try to get this done soon.

One thing did spring to mind as I was looking at this code, the XtraDB
--innodb_recovery_update_relay_log option. This can overwrite relay-log
information during crash recovery, so might be able to interfere. But it is
off by default and does printouts to error log that are not seen in your logs,
so it seems this is not related ...

Thanks for helping look into this. I actually plan to change this code to
something more robust, but it would be good to find the real problem here,
otherwise I may end up just hiding the bug, not fixing it ...

 - Kristian.