Hi Kristian! On Mon, May 2, 2016 at 2:10 PM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Nirbhay Choubey <nirbhay@mariadb.com> writes:
[Cc: maria-developers@, please always keep these discussions on the mailing list]
In Galera cluster, the state transfer scripts perform FTWRL and copy data along with the last of all available binlog files to the joiner node.
After MDEV-181, I understand that the binlog checkpoint can be in any of the binary log files (and not necessarily the last one).
This seemingly has caused MDEV-9423, in which the joiner node complains of the missing binlog file.
Now the question is : Is FTWRL not sufficient to ensure that the checkpoint is always the last binlog file?
So if I understand correctly, the issue is related to having binlog files available during XA crash recovery. When the binlog file is rotated, there is a small window where both the latest and the previous binlog files are needed for crash recovery. The binlog checkpoint is the earliest binlog file that is needed for crash recovery, and it can be seen from the binlog checkpoint event.
So the problem here is that a copy is made just after binlog rotation, and Galera only copies the most recent, mostly-empty binlog file, leaving insufficient information for XA recovery, right?
Correct.
One option to solve this is to always copy the last two binlog files. While it is theoretically possible to have the binlog checkpoint more than two files back, I think it will not occur in practice.
Another option is to wait for the binlog checkpoint to reach the current binlog file. You can see this done in the test suite:
mysql-test/include/wait_for_binlog_checkpoint.inc
The binlog checkpointing happens asynchroneously, I *think* it can complete even while FTWRL is active, but I am not 100% sure though.
The checkpoint happens after InnoDB has made its commits durable with fsync() or similar - only after that is it safe to discard the old binlog data and still have correct crash recovery.
While copying the last 2 binlog files would have solved this, I have worked out a solution where the donor node waits for binlog checkpoint event for last binlog file to get logged before proceeding with file transfer. http://lists.askmonty.org/pipermail/commits/2016-June/009483.html By the way, I initially tried reusing is_xidlist_idle_nolock()/COND_xid_list to implement the waiting mechanism. But since binlog checkpoint events are written asynchronously after xid_count falls to 0, that did not work. So later came up with the above patch. Best, Nirbhay
- Kristian.