[Maria-developers] WL#187 New (by Knielsen): Consistent (but non-durable) crash recovery from binlog 2-phase commit

21 Mar 2011

      -----------------------------------------------------------------------
                              WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Consistent (but non-durable) crash recovery from binlog 2-phase commit
CREATION DATE..: Mon, 21 Mar 2011, 11:54
SUPERVISOR.....: 
IMPLEMENTOR....: 
COPIES TO......: 
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 187 (http://askmonty.org/worklog/?tid=187)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0

PROGRESS NOTES:

DESCRIPTION:

Overview
--------

Current MySQL needs no less than three fsync() calls for every commit in order
to ensure the ability to recover into a consistent state (between master
binlog and storage engine(s)). Without this, after a master crash, the binlog
may be inconsistent with the data in tables on the master, and replication
slaves will diverge from the master, possibly causing replication to
eventually break.

The MWL#116 implements group commit, so that each of the three fsync() calls
can be shared among multiple transaction commits.

The MWL#164 describes a way to reduce this to a single fsync() call shared
among multiple commits, by recovering any transactions lost in storage engines
from the binlog, thereby avoiding the need for fsync() calls in the engines.

Both MWL#116 and MWL#164 improve performance without reducing functionality or
ability to do crash recovery.

This worklog describes a method to reduce the overhead to *zero* fsync() calls
per commit. Though one will still want to have say one fsync() per second or
whatever of the binlog. After a master crash, crash recovery is still
guaranteed to recover the system into a consistent state, where every
transaction committed in the storage engine is also committed in the binlog,
and vice versa. However, the cost of this is that durability is no longer
guaranteed: after a crash, a number of transactions committed before the crash
may be lost.

Thus this worklog is similar to the InnoDB feature
innodb_flush_log_at_trx_commit=2 (or 0). That feature makes InnoDB not call
fsync() after every commit, possibly loosing some transactions in case of
crash, but still guaranteeing that the table data can be recovered into a
consistent state. This worklog would do the same, but for the binlog and
InnoDB together, whereas innodb_flush_log_at_trx_commit=2/0 only guarantees
consistency for InnoDB in isolation.

Idea
----

This worklog builds on top of the group commit framework implemented in
MWL#116.

The basic idea is similar to MWL#163, "release of row locks in InnoDB during
prepare() phase". As soon as we have successfully run the prepare step in
InnoDB, we make the transaction committed in memory so that it is visible to
other transactions from that point on and all row locks are released.

However, at this point we go further than MWL#163, and return successful
commit to the client connection. The rest of the group commit work is
delegated to a separate backgroud thread: writing and fsync()'ing to the
binlog, running commit_ordered() and commit() inside the storage engine.

The client is thus free to continue with new transactions without waiting for
any fsync() delay. The background thread will do an fsync() call, but since no
client is waiting for it, the background thread is free to do a long wait
(eg. 1 second) between fsync() calls, potentially collecting lots of
transtions for a single huge group commit.

Note that any replication slave _will_ be waiting for the background thread;
we must wait for fsync() of the binary log before sending events to the
slaves, or we could in case of crash end up with more transactions on the
slaves than on the master, which would break replication. So there will be a
cost of extra replication latency.

Discussion
----------

A main complication with this worklog is that the main part of the commit will
run in a background thread, which is different from the thread that ran the
main transaction.

This requires that storage engines are checked, and modified as needed, so
that they do not depend on thread local storage in that part of the code, and
also do not do things like release a mutex that was aquired from another
thread, etc.

We also need to handle the issue that we need the THD object of the
original transaction/thread to be able to finish the commit in the background
thread. I think we will need to remove the THD object from the original thread
before returning to the client and keep a reference to it in the queue of
transactions waiting to group commit in the background thread. The background
thread must then use this THD for the last part of commit and afterwards
return the THD object to a pool of spare THDs; when/if the client starts a new
transaction, it will then need to obtain a new THD from this pool of spare
THDs.

----

Currently, the MWL#116 API has prepare_ordered() being called _after_
prepare(). I am not sure if there is any particularly good reason for
this. The only reason I know is that this is how the Facebook
innodb_release_locks_early patch did things, however when I asked I did not
get any answer from Facebook why they did it in this order. If we switched
this, we should be able to return to the client even earlier, already before
prepare(), so we could keep the fsync() in prepare().

Alternatively, we should use MWL#164 to not have to call fsync() in prepare();
then we can return to the client only after prepare(), and still avoid any
fsync() latency visible to the client.

Note that we have to make the transaction committed to memory and visible to
other transactions before returning to the client, just as with
innodb_release_locks_early in MWL#163. The client would be quite confused if
COMMIT returns ok, yet the committed transaction is not visible to following
SELECT statements!

----

The first transaction to enter the group commit queue becomes the group commit
leader, and would just signal the background thread to continue the commit
sequence and then return to the client. Subsequent transactions that enqueue
after would not signal the background thread; instead they would set a flag in
their entry in the queue, so that the background thread knows that they are
asynchroneous and should not be woken up after commit is done (the original
thread may no longer exist at that point, or may be doing something else).

----

The main benefit of this worklog would be that we could get crash safe
replication state on the master without any fsync() overhead visible to the
client.

ESTIMATED WORK TIME

ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)

[Maria-developers] WL#187 New (by Knielsen): Consistent (but non-durable) crash recovery from binlog 2-phase commit

worklog-noreply＠askmonty.org