Hi Kristian!

On Mon, May 2, 2016 at 2:10 PM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Nirbhay Choubey <nirbhay@mariadb.com> writes:

[Cc: maria-developers@, please always keep these discussions on the mailing list]

> In Galera cluster, the state transfer scripts perform FTWRL and
> copy data along with the last of all available binlog files to the
> joiner node.
>
> After MDEV-181, I understand that the binlog checkpoint can be
> in any of the binary log files (and not necessarily the last one).
>
> This seemingly has caused MDEV-9423, in which the joiner node
> complains of the missing binlog file.
>
> Now the question is : Is FTWRL not sufficient to ensure that the
> checkpoint is always the last binlog file?

So if I understand correctly, the issue is related to having binlog files
available during XA crash recovery. When the binlog file is rotated, there
is a small window where both the latest and the previous binlog files are
needed for crash recovery. The binlog checkpoint is the earliest binlog file
that is needed for crash recovery, and it can be seen from the binlog
checkpoint event.

So the problem here is that a copy is made just after binlog rotation, and
Galera only copies the most recent, mostly-empty binlog file, leaving
insufficient information for XA recovery, right?

Correct.
 

One option to solve this is to always copy the last two binlog files. While
it is theoretically possible to have the binlog checkpoint more than two
files back, I think it will not occur in practice. 

Another option is to wait for the binlog checkpoint to reach the current
binlog file. You can see this done in the test suite:

  mysql-test/include/wait_for_binlog_checkpoint.inc

The binlog checkpointing happens asynchroneously, I *think* it can complete
even while FTWRL is active, but I am not 100% sure though.

The checkpoint happens after InnoDB has made its commits durable with
fsync() or similar - only after that is it safe to discard the old binlog
data and still have correct crash recovery.

While copying the last 2 binlog files would have solved this, I have worked out
a solution where the donor node waits for binlog checkpoint event for last binlog
file to get logged before proceeding with file transfer.

http://lists.askmonty.org/pipermail/commits/2016-June/009483.html

By the way, I initially tried reusing is_xidlist_idle_nolock()/COND_xid_list to implement the
waiting mechanism. But since binlog checkpoint events are written asynchronously after
xid_count falls to 0, that did not work. So later came up with the above patch.

Best,
Nirbhay

 

 - Kristian.