[Maria-developers] Why do we need fsync() in commit() in internal two-phase commit?
Currently, when an InnoDB/XtraDB transaction is committed with the binlog enabled, we do three fsync()'s: 1. Inside prepare() in InnoDB 2. When writing to the binlog 3. Inside commit() in InnoDB The fsync()s are done when --innodb-flush-log-at-trx-commit=1 and sync_binlog=1; these settings are needed to be able to recover into a consistent state between binlog and InnoDB after a crash during commit. This got me thinking why this is really needed? - I understand why we need the fsync() in prepare(): otherwise we might after crash have a transaction in the binlog that is missing in InnoDB and that we cannot (currently) recover. - I understand why we need the fsync() in binlog write; otherwise the commit in InnoDB may reach the disk before the binlog write, and after a crash we might have a transaction in InnoDB missing in the binlog that cannot be recovered. But why do we need the fsync() in commit()? We do not need it to ensure durability or consistency. If we crash after commit() returns (or just binlog write finishes), but before the InnoDB commit reaches disk, the crash recovery at next server start will re-commit the transaction inside InnoDB. In fact, it seems to me the only reason for the third fsync() is that we call TC_LOG_BINLOG::unlog() after InnoDB commit() returns. And unlog() may decide to rotate the binlog once it has been called for all transactions written to the current log file. And during recovery, we only read the latest binlog, so transactions in older binlogs must have reached disk for recovery to work. Do you agree that this is the only reason the third fsync() is needed? If so, it seems it would not be too hard to avoid that fsync(). Eg. we could recover from the last two binlog files instead of only one. We would need a mechanism for InnoDB to tell the binlog that transaction `Xid' reached the disk, in an asynchronous way (after returning from commit()). [Just wanted to confirm (or the opposite) this reasoning... as we have been talking about a way to avoid both the fsync() in prepare() /and/ the fsync() in commit(), that may be a better project to implement that just avoiding the one in commit().] - Kristian.
Hi, Kristian! On Oct 26, Kristian Nielsen wrote:
Currently, when an InnoDB/XtraDB transaction is committed with the binlog enabled, we do three fsync()'s:
1. Inside prepare() in InnoDB 2. When writing to the binlog 3. Inside commit() in InnoDB ... why do we need the fsync() in commit()?
We do not need it to ensure durability or consistency. If we crash after commit() returns (or just binlog write finishes), but before the InnoDB commit reaches disk, the crash recovery at next server start will re-commit the transaction inside InnoDB.
In fact, it seems to me the only reason for the third fsync() is that we call TC_LOG_BINLOG::unlog() after InnoDB commit() returns. And unlog() may decide to rotate the binlog once it has been called for all transactions written to the current log file. And during recovery, we only read the latest binlog, so transactions in older binlogs must have reached disk for recovery to work.
Do you agree that this is the only reason the third fsync() is needed?
Yes, sounds logical.
If so, it seems it would not be too hard to avoid that fsync(). Eg. we could recover from the last two binlog files instead of only one. We would need a mechanism for InnoDB to tell the binlog that transaction `Xid' reached the disk, in an asynchronous way (after returning from commit()).
Reading two, three, or any number of binlogs is not a solution - it only increases the chance of recovery to work, but does not guarantee that it'll work. For a correct solution we'll need a way to call unlog() asynchronously. Regards, Sergei
participants (2)
-
Kristian Nielsen
-
Sergei Golubchik