Hi, Kristian! On Oct 26, Kristian Nielsen wrote:
Currently, when an InnoDB/XtraDB transaction is committed with the binlog enabled, we do three fsync()'s:
1. Inside prepare() in InnoDB 2. When writing to the binlog 3. Inside commit() in InnoDB ... why do we need the fsync() in commit()?
We do not need it to ensure durability or consistency. If we crash after commit() returns (or just binlog write finishes), but before the InnoDB commit reaches disk, the crash recovery at next server start will re-commit the transaction inside InnoDB.
In fact, it seems to me the only reason for the third fsync() is that we call TC_LOG_BINLOG::unlog() after InnoDB commit() returns. And unlog() may decide to rotate the binlog once it has been called for all transactions written to the current log file. And during recovery, we only read the latest binlog, so transactions in older binlogs must have reached disk for recovery to work.
Do you agree that this is the only reason the third fsync() is needed?
Yes, sounds logical.
If so, it seems it would not be too hard to avoid that fsync(). Eg. we could recover from the last two binlog files instead of only one. We would need a mechanism for InnoDB to tell the binlog that transaction `Xid' reached the disk, in an asynchronous way (after returning from commit()).
Reading two, three, or any number of binlogs is not a solution - it only increases the chance of recovery to work, but does not guarantee that it'll work. For a correct solution we'll need a way to call unlog() asynchronously. Regards, Sergei