----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Extend crash recovery to recover non-prepared transactions from binlog CREATION DATE..: Wed, 27 Oct 2010, 13:08 SUPERVISOR.....: IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 164 (http://askmonty.org/worklog/?tid=164) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Currently, when an xa-capable storage engine commits a transaction that is also written to the binary log, we do three fsync() calls, which is expensive. One in prepare(), one for binlog write(), and one for commit(). These multiple fsync()'s are needed to ensure that crash recovery can recover into a consistent state after a crash that happens during commit, so that any transaction that becomes committed in the storage engine is also committed in the binlog, and vice versa. This is essential for replication, as otherwise a crashed master may not be able to recover into a usable state without a full restore from backup. The fsync()s are also needed to ensure durability, that is, any transaction that is reported to the client as having committed, will also remain committed after a crash. This worklog is about optimising this so that we can reliably recover into a consistent state (and preserve durability) with just a single fsync(), the one for the binary log. In effect, this will allow to run with --innodb-flush-log-at-trx-commit=2, or even =0, and still preserve both durability of InnoDB commits and consistent crash recovery. The idea is that along with every commit, InnoDB will store also an identification of the transactions associated with that commit (in a transaction-safe way, eg. using the InnoDB transaction log). In fact, InnoDB already does this, since it records the binlog position of the last commit done at any point in time. Then during crash recovery, we can do as follows: - First let each engine recover into an engine-consistent state. This may be missing some transactions due to --innodb-flush-log-at-trx-commit=0. - Ask each transactional engine for the ID of the last transaction committed. - By virtue of MWL#116, we know that the order of commits in the storage engine and in the binlog is the same. Thus, the transactions missing in the engine is exactly those that are written in the binlog after the last transaction committed in the engine. - Recover the missing transactions in the engine by re-playing the binary log from that point to the end (similar to how it would be applied on a slave.) There are some restrictions/requirements: - This will only work for engines that implement the MWL#116 commit_ordered() handlerton method. - After a crash, different engines (eg. PBXT vs. InnoDB) may be in different states (different last transaction committed). It will be necessary to recover each engine separately, at least up to a point where they are consistent. This should be simple enough for transactions that only involve tables for one engine. For cross-engine transactions, it can probably work for row-based replication, simply by ignoring updates to tables in other engines when applying the binlog events. But for statement-based replication it will not work easily [*]. This way, it should be possible to greatly reduce the fsync() overhead when using the binlog with a transactional engine. [*] One could imagine trying to make it work even for statement-based binlog events, at least for MVCC engines. When running the updates against one engine, any select against the other engine could try to use an MVCC snapshow corresponding to the transaction that corresponds to the appropriate transaction back in time. But I think this is not realistic / worth it to implement. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v4.0.0)