----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Consistent (but non-durable) crash recovery from binlog 2-phase commit CREATION DATE..: Mon, 21 Mar 2011, 11:54 SUPERVISOR.....: IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 187 (http://askmonty.org/worklog/?tid=187) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Overview -------- Current MySQL needs no less than three fsync() calls for every commit in order to ensure the ability to recover into a consistent state (between master binlog and storage engine(s)). Without this, after a master crash, the binlog may be inconsistent with the data in tables on the master, and replication slaves will diverge from the master, possibly causing replication to eventually break. The MWL#116 implements group commit, so that each of the three fsync() calls can be shared among multiple transaction commits. The MWL#164 describes a way to reduce this to a single fsync() call shared among multiple commits, by recovering any transactions lost in storage engines from the binlog, thereby avoiding the need for fsync() calls in the engines. Both MWL#116 and MWL#164 improve performance without reducing functionality or ability to do crash recovery. This worklog describes a method to reduce the overhead to *zero* fsync() calls per commit. Though one will still want to have say one fsync() per second or whatever of the binlog. After a master crash, crash recovery is still guaranteed to recover the system into a consistent state, where every transaction committed in the storage engine is also committed in the binlog, and vice versa. However, the cost of this is that durability is no longer guaranteed: after a crash, a number of transactions committed before the crash may be lost. Thus this worklog is similar to the InnoDB feature innodb_flush_log_at_trx_commit=2 (or 0). That feature makes InnoDB not call fsync() after every commit, possibly loosing some transactions in case of crash, but still guaranteeing that the table data can be recovered into a consistent state. This worklog would do the same, but for the binlog and InnoDB together, whereas innodb_flush_log_at_trx_commit=2/0 only guarantees consistency for InnoDB in isolation. Idea ---- This worklog builds on top of the group commit framework implemented in MWL#116. The basic idea is similar to MWL#163, "release of row locks in InnoDB during prepare() phase". As soon as we have successfully run the prepare step in InnoDB, we make the transaction committed in memory so that it is visible to other transactions from that point on and all row locks are released. However, at this point we go further than MWL#163, and return successful commit to the client connection. The rest of the group commit work is delegated to a separate backgroud thread: writing and fsync()'ing to the binlog, running commit_ordered() and commit() inside the storage engine. The client is thus free to continue with new transactions without waiting for any fsync() delay. The background thread will do an fsync() call, but since no client is waiting for it, the background thread is free to do a long wait (eg. 1 second) between fsync() calls, potentially collecting lots of transtions for a single huge group commit. Note that any replication slave _will_ be waiting for the background thread; we must wait for fsync() of the binary log before sending events to the slaves, or we could in case of crash end up with more transactions on the slaves than on the master, which would break replication. So there will be a cost of extra replication latency. Discussion ---------- A main complication with this worklog is that the main part of the commit will run in a background thread, which is different from the thread that ran the main transaction. This requires that storage engines are checked, and modified as needed, so that they do not depend on thread local storage in that part of the code, and also do not do things like release a mutex that was aquired from another thread, etc. We also need to handle the issue that we need the THD object of the original transaction/thread to be able to finish the commit in the background thread. I think we will need to remove the THD object from the original thread before returning to the client and keep a reference to it in the queue of transactions waiting to group commit in the background thread. The background thread must then use this THD for the last part of commit and afterwards return the THD object to a pool of spare THDs; when/if the client starts a new transaction, it will then need to obtain a new THD from this pool of spare THDs. ---- Currently, the MWL#116 API has prepare_ordered() being called _after_ prepare(). I am not sure if there is any particularly good reason for this. The only reason I know is that this is how the Facebook innodb_release_locks_early patch did things, however when I asked I did not get any answer from Facebook why they did it in this order. If we switched this, we should be able to return to the client even earlier, already before prepare(), so we could keep the fsync() in prepare(). Alternatively, we should use MWL#164 to not have to call fsync() in prepare(); then we can return to the client only after prepare(), and still avoid any fsync() latency visible to the client. Note that we have to make the transaction committed to memory and visible to other transactions before returning to the client, just as with innodb_release_locks_early in MWL#163. The client would be quite confused if COMMIT returns ok, yet the committed transaction is not visible to following SELECT statements! ---- The first transaction to enter the group commit queue becomes the group commit leader, and would just signal the background thread to continue the commit sequence and then return to the client. Subsequent transactions that enqueue after would not signal the background thread; instead they would set a flag in their entry in the queue, so that the background thread knows that they are asynchroneous and should not be woken up after commit is done (the original thread may no longer exist at that point, or may be doing something else). ---- The main benefit of this worklog would be that we could get crash safe replication state on the master without any fsync() overhead visible to the client. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v4.0.0)